gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	9d09230c26	Better docs for verbose output of Pileup	2012-08-15 21:55:08 -04:00
Mark DePristo	c0a31b2e5b	CombineVariants parallel integration tests -- All tests but one (using old bad VCF3 input) run unmodified with parallel code. -- Disabled UNSAFE_VCF_PROCESSING for all but that test, which changes md5s because the output files have fixed headers -- Minor optimizations to simpleMerge	2012-08-15 21:13:16 -04:00
Mark DePristo	669c43031a	BCF2 optimizations; parallel CombineVariants -- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header. Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently) -- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils. Fixed bug in edge case that never occurred in practice -- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right) -- More ways to access the data in VCFHeader -- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output -- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary. We just guess that on average we need ~10 elements for the attribute map -- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet -- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement	2012-08-15 21:13:16 -04:00
Mark DePristo	ae4d4482ac	Parallel combine variants! -- CombineVariants is now TreeReducible! -- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format	2012-08-15 21:13:15 -04:00
Mark DePristo	bd7ed0d028	Enable efficient parallel output of BCF2 -- Previous IO stub was hardcoded to write VCF. So when you ran -nt 2 -o my.bcf you actually created intermediate VCF files that were then encoded single threaded as BCF. Now we emit natively per thread BCF, and use the fast mergeInfo code to read BCF -> write BCF. Upcoming optimizations to avoid decoding genotype data unnecessarily will enable us to really quickly process BCF2 in parallel -- VariantContextWriterStub forces BCF output for intermediate files -- Nicer debug log message in BCF2Codec -- Turn off debug logging of BCF2LazyGenotypesDecoder -- BCF2FieldWriterManager now uses .debug not .info, so you won't see all of that field manager debugging info with BCF2 any longer -- VariantContextWriterFactory.isBCFOutput now has version that accepts just a file path, not path + options	2012-08-15 21:13:15 -04:00
Mark DePristo	9459e6203a	Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates -- Expanded unit tests -- Support for clean logging of results to logger -- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse -- Added docs and contracts to StateMonitoringThreadFactory	2012-08-15 21:13:15 -04:00
Mark DePristo	be3230a1fd	Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates -- Created makeCombinations utility function (very useful!). Moved template from VariantContextTestProvider -- UnitTests for basic functionality	2012-08-15 21:13:15 -04:00
Mark DePristo	f277d7c09e	Removing parallelism bottleneck in the GATK -- GenomeLocParser cache was a major performance bottleneck in parallel GATK performance. With 10 thread > 50% of each thread's time was spent blocking on the MasterSequencingDictionary object. Made this a thread local variable. -- Now we can run the GATK with 48 threads efficiently on GSA4! -- Running -nt 1 => 75 minutes (didn't let is run all of the way through so likely would take longer) -- Running -nt 24 => 3.81 minutes	2012-08-15 21:13:15 -04:00
Eric Banks	87e41c83c5	In AlleleCount stratification, check to make sure the AC (or MLEAC) is valid (i.e. not higher than number of chromosomes) and throw a User Error if it isn't. Added a test for bad AC.	2012-08-14 15:02:30 -04:00
Eric Banks	8e3774fb0e	Fixing behavior of the --regenotype argument in SelectVariants to properly run in GenotypeGivenAlleles mode. Added integration tests to cover recent SV changes.	2012-08-14 14:21:42 -04:00
Eric Banks	34b62fa092	Two changes to SelectVariants: 1) don't add DP INFO annotation if DP wasn't used in the input VCF (it was adding DP=0 previously). 2) If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the VC.	2012-08-14 12:54:31 -04:00
Eric Banks	cfb994abd2	Trivial removal of ununsed variable (mentioned in resolved JIRA entry)	2012-08-13 22:55:02 -04:00
Khalid Shakir	f809f24afb	Removed SelectHeader's --include_reference_name option since the reference is always included. In SelectHeaders instead of including the path to the file, only include the name of the reference since dbGaP does not like paths in headers.	2012-08-13 16:49:27 -04:00
Mark DePristo	6ad75d2f5c	Reverting changes to BCF2 ranges -- The previously expanded ones are actually the missing values in the range. The previous ranges were correct. Removed the TODO to confirm them, as they are now officially confirmed	2012-08-13 15:06:28 -04:00
Mark DePristo	4d3fad38e9	Increase allowable range for BCF2 by -1 on low-end	2012-08-13 14:20:26 -04:00
Mark DePristo	f032e0aba4	A bit better output for ContextCovariate context size logging	2012-08-12 13:45:52 -04:00
Mark DePristo	243af0adb1	Expanded the BQSR reporting script -- Includes header page -- Table of arguments (Arguments) -- Summary of counts (RecalData0) -- Summary of counts by qual (RecalData1) -- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly) -- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly	2012-08-12 13:45:14 -04:00
Mark DePristo	458bbdee8f	Add useful logger.info telling us the mismatch and indel context sizes	2012-08-12 10:27:05 -04:00
Eric Banks	40f0320a1c	When adding a unit test to LIBS for X and = CIGAR operators, I uncovered a bug with the implementation of the ReadBackedPileup.depthOfCoverage() method.	2012-08-10 14:58:29 -04:00
Eric Banks	eca9613356	Adding support of X and = CIGAR operators to the GATK	2012-08-10 14:54:07 -04:00
Ami Levy Moonshine	68fb04b8f7	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable into testing	2012-08-09 16:48:22 -04:00
Mark DePristo	06258c8a01	BCF2 optimizations -- Added Write method to BCF2 types that directly converts int value to byte stream. Deleted writeRawBytes(int) -- encodeTypeDescriptor semi-inlined into encodeType so that the tests for overflow are done in just one place -- Faster implementation of determineIntegerType for int[] values	2012-08-09 16:36:18 -04:00
Mark DePristo	c6bd9b15ff	BCF2 optimizations -- BCF2Type enum has an overloaded method to read the type as an int from an input stream. This gets rid of a case statement and replaces it with just minimum tiny methods that should be better optimized. As side effect of this optimization is an overall cleaner code organization	2012-08-09 16:36:18 -04:00
Mark DePristo	9a0dda71d4	BCF2 optimizations -- All low-level reads throw IOException instead of catching it directly. This allows us to not try/catch in readByte, improving performance by 5% or so -- Optimize encodeTypeDescriptor with final variables. Avoid using Math.min instead do inline comparison -- Inlined willOverflow directly in its single use	2012-08-09 16:36:18 -04:00
Ryan Poplin	9887bc4410	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-09 16:31:06 -04:00
Ryan Poplin	f4c72a26d5	A few quick, minor findbugs fixes.	2012-08-09 16:30:58 -04:00
Ryan Poplin	c7f22e410f	A few quick, minor findbugs fixes.	2012-08-09 16:22:08 -04:00
Eric Banks	def077c4e5	There's actually a subtle but important difference between foo++ and ++foo	2012-08-09 12:42:50 -04:00
Ryan Poplin	e48727dae3	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-09 10:31:10 -04:00
Guillermo del Angel	5be7e0621d	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-09 09:58:34 -04:00
Guillermo del Angel	71ee8d87b3	Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header	2012-08-09 09:58:20 -04:00
Eric Banks	35cec8530c	Make coverage threshold in FindCoveredIntervals a command-line argument	2012-08-08 21:44:24 -04:00
Ryan Poplin	1223d77546	Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events	2012-08-08 15:13:20 -04:00
Eric Banks	0a2a646a52	Other random FindBugs fixes	2012-08-08 14:56:27 -04:00
Eric Banks	4c84cc9486	Quick pass of FindBugs 'should be static inner class' fixes.	2012-08-08 14:42:06 -04:00
Eric Banks	a0196c9f5b	Quick pass of FindBugs 'method invokes inefficient Number constructor' fixes.	2012-08-08 14:34:16 -04:00
Eric Banks	4b2e3cec0b	Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools.	2012-08-08 14:29:41 -04:00
Guillermo del Angel	3e2752667c	Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts	2012-08-08 12:07:33 -04:00
David Roazen	a7811d673f	Update URL for phone home / GATK key documentation output by the GATK upon error	2012-08-08 09:29:54 -04:00
Mark DePristo	cda8d944b7	Bugfixes for BCF with VQSR -- Old version converted doubles directly from strings. New version uses VariantContext getAttributeAsDouble() that looks at the values directly to determine how to convert from Object to Double (via Double.valueOf, (Double), or (Double)(Integer)). -- getAttributeAsDouble() is now smart in converting integers to doubles as needed -- Removed unnecessary logging info in BCF2Codec -- Added integration tests to ensure that VQSR works end-to-end with BCF2 using sites version of the file khalid sent to me -- Added vqsr.bcf_test.snps.unfiltered.bcf file for this integration test	2012-08-07 17:22:39 -04:00
Mark DePristo	80b94a4f9a	AdaptiveContexts implement pruning to a given chi2 p value -- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold -- Penalty is now a phred-scaled p-value not the raw chi2 value -- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning	2012-08-07 17:22:39 -04:00
Mark DePristo	982c735c76	VisualizeAdaptiveTree now considers only leaf nodes when computing max/min penalty	2012-08-07 17:22:39 -04:00
Ryan Poplin	15085bf03e	The UnifiedGenotyper now makes use of base insertion and base deletion quality scores if they exist in the reads.	2012-08-07 13:58:22 -04:00
Guillermo del Angel	97c5ed4feb	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-06 20:22:31 -04:00
Guillermo del Angel	238d55cb61	Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead	2012-08-06 20:22:12 -04:00
Ryan Poplin	f1c30c3a59	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-06 12:02:26 -04:00
Mark DePristo	44f160f29f	indelGOP and indelGCP are now advanced, not hidden arguments	2012-08-06 11:42:55 -04:00
Mark DePristo	2f004665fb	Fixing public -> private dep	2012-08-06 11:42:55 -04:00
Mark DePristo	7bf5ca51ee	Major bugfix for adaptive contexts -- Basically I was treating the context history in the wrong direction, effectively predicting the further bases in the context based on the closer one. Totally backward. Updated the code to build the tree in the right direction. -- Added a few more useful outputs for analysis (minPenalty and maxPenalty) -- Misc. cleanup of the code -- Overall I'm not 100% certain this is even the right way to think about the problem. Clearly this is producing a reasonable output but the sum of chi2 values over the entire tree is just enormous. Perhaps a MCMC convergence / sampling criterion would be a better way to think about this problem?	2012-08-06 11:42:55 -04:00
Mark DePristo	b4841548f1	Bug fixes and misc. improvements to running the adaptive context tools -- Better output file name defaults -- Fixed nasty bug where I included non-existant quals in the contexts to process because they showed up in the Cycle covariate -- Data is processed in qual order now, so it's easier to see progress -- Logger messages explaining where we are in the process -- When in UPDATE mode we still write out the information for an equivalent prune by depth for post analysis	2012-08-06 11:42:55 -04:00
Ryan Poplin	b8709d8c67	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-06 11:41:28 -04:00
Eric Banks	210db5ec27	Update -maxAlleles argument to -maxAltAlleles to make it more accurate. The hidden GSA production -capMaxAllelesForIndels argument also gets updated.	2012-08-06 11:31:18 -04:00
Eric Banks	8f95a03bb6	Prevent NumberFormatExceptions when parsing the VCF POS field	2012-08-06 11:19:54 -04:00
Ryan Poplin	b7eec2fd0e	Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly.	2012-08-05 12:29:10 -04:00
Mark DePristo	e1bba91836	Ready for full-scale evaluation adaptive BQSR contexts -- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees -- Docs, algorithm descriptions, etc so that it makes sense what's going on -- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR -- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)	2012-08-03 16:02:53 -04:00
Guillermo del Angel	6f8e7692d4	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-03 12:24:37 -04:00
Guillermo del Angel	9e25b209e0	First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller	2012-08-03 12:24:23 -04:00
Ryan Poplin	8817fc70d1	Merged bug fix from Stable into Unstable	2012-08-03 10:45:01 -04:00
Ryan Poplin	f40d0a0a28	Updating VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller. Integration tests change because of the MNPs in dbSNP.	2012-08-03 10:44:36 -04:00
Joel Thibault	51bd03cc36	Add RemoveProgramRecords annotation to ActiveRegionWalker	2012-08-03 09:54:16 -04:00
Joel Thibault	addbfd6437	Add a RemoveProgramRecords annotation * Add the RemoveProgramRecords annotation to LocusWalker	2012-08-03 09:54:16 -04:00
Joel Thibault	524d7ea306	Choose whether to keep program records based on Walker * Add keepProgramRecords argument * Make removeProgramRecords / keepProgramRecords override default	2012-08-03 09:54:16 -04:00
Mark DePristo	e04989f76d	Bugfix for new PASS position in dictionary in BCF2	2012-08-03 09:42:21 -04:00
Mark DePristo	fb5dabce18	Update BCF2 to include a minor version number so we can rev (and report errors) with BCF2 -- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results -- Added new class BCFVersion that consolidates all of the version management of BCF	2012-08-02 17:30:30 -04:00
Eric Banks	e3f89fb054	Missing/malformed GATK report files are user errors	2012-08-02 11:33:21 -04:00
Mark DePristo	c3c3d18611	Update BCF2 to put PASS as offset 0 not at the end -- Unfortunately this commit breaks backward compatibility with all existing BCF2 files...	2012-08-01 17:09:22 -04:00
Mark DePristo	ccac77d888	Bugfix for incorrect allele counting in IndelSummary -- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs -- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles -- Update a few pieces of code to get previous correct behavior -- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct) -- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default	2012-08-01 15:45:12 -04:00
Joel Thibault	2b25df3d53	Add removeProgramRecords argument * Add unit test for the removeProgramRecords	2012-08-01 15:33:05 -04:00
Ryan Poplin	d53105668b	Merged bug fix from Stable into Unstable	2012-08-01 14:53:06 -04:00
Ryan Poplin	fabca66d09	Another fix to VQSR docs	2012-08-01 14:52:49 -04:00
Ryan Poplin	2be29ebd22	Merged bug fix from Stable into Unstable	2012-08-01 14:35:30 -04:00
Ryan Poplin	4093909a56	Updating VQSR docs. Removing references to old best practices pages.	2012-08-01 14:30:24 -04:00
Eric Banks	52b93cab62	Merged bug fix from Stable into Unstable	2012-08-01 13:17:36 -04:00
Eric Banks	22bf052828	Fixing BQSR GATK docs	2012-08-01 13:17:16 -04:00
Eric Banks	459832ee16	Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions as reported a while back on GS	2012-08-01 10:45:04 -04:00
Eric Banks	a4a41458ef	Update docs of FastaAlternateReferenceMaker as promised in older GS thread	2012-08-01 10:33:41 -04:00
Eric Banks	38e5419b11	Merged bug fix from Stable into Unstable	2012-08-01 09:50:31 -04:00
Eric Banks	56f8afab97	Requested by Geraldine: adding a utility to register deprecated walkers (and the major version of the first release since they were removed) so that the User Error printed out for e.g. CountCovariates now states: Walker CountCovariates is no longer available in the GATK; it has been deprecated since version 2.0.	2012-08-01 09:50:00 -04:00
Guillermo del Angel	0528337467	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-31 18:17:50 -04:00
Guillermo del Angel	4a23f3cd11	Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP\|INDEL\|BOTH] usage	2012-07-31 16:34:20 -04:00
Eric Banks	6cb10cef96	Fixed older GS reported bug. Actually, the problem really lies in Picard (can't set max records in RAM without it throwing an exception, reported on their JIRA) so I just masked out the problem by removing this never-used argument from this rarely-used tool.	2012-07-31 16:00:36 -04:00
Eric Banks	ab53d73459	Quick fix to user error catching	2012-07-31 15:50:32 -04:00
Eric Banks	10111450aa	Fixed AlignmentUtils bug for handling Ns in the CIGAR string. Added a UG integration test that calls a BAM with such reads (provided by a user on GetSatisfaction).	2012-07-31 15:37:22 -04:00
Mark DePristo	f7133ffc31	Cleanup syntax errors from BQSR reorganization	2012-07-31 08:11:05 -04:00
Mark DePristo	dad9bb1192	Changes order of writing BaseRecalibrator results so that if R blows up you still get a meaningful tree	2012-07-31 08:11:04 -04:00
Mark DePristo	0c4e729e13	Working version of adaptive context calculations -- Uses chi2 test for independences to determine if subcontext is worth representing. Give excellent visual results -- Writes out analysis output file producing excellent results in R -- Trivial reformatting of MathUtils	2012-07-31 08:11:04 -04:00
Mark DePristo	93640b382e	Preliminary version of adaptive context covariate algorithm -- Works according to visual inspection of output tree	2012-07-31 08:11:04 -04:00
Mark DePristo	315d25409f	Improvement to RecalDatum and VisualizeContextTree -- Reorganize functions in RecalDatum so that error rate can be computed indepentently. Added unit tests. Removed equals() method, which is a buggy without it's associated implementation for hashcode -- New class RecalDatumTree based on QualIntervals that inherits from RecalDatum but includes the concept of sub data -- VisualizeContextTree now uses RecalDatumTree and can trivially compute the penalty function for merging nodes, which it displays in the graph	2012-07-31 08:11:04 -04:00
Mark DePristo	57b45bfb1e	Extensive unit tests, contacts, and documentation for RecalDatum	2012-07-31 08:11:03 -04:00
Mark DePristo	e00ed8bc5e	Cleanup BQSR classes -- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration. It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers. As code becomes embedded throughout GATK its should be refactored to live in utils -- Removed unncessary imports of BQSR in VQSR v3 -- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate -- Update PluginManager to sort the plugins and interfaces. This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.	2012-07-31 08:11:03 -04:00
Mark DePristo	191294eedc	Initial cleanup of RecalDatum for move and further refactoring -- Moved Datum, the now unnecessary superclass, into RecalDatum -- Fixed some obviously dangerous synchronization errors in RecalDatum, though these may not have caused problems because they may not have been called in parallel mode	2012-07-31 08:11:03 -04:00
Mark DePristo	0670316288	Be clearer that dcov 50 is good for 4x, should use 200 for >30x	2012-07-31 08:11:02 -04:00
Mark DePristo	874dbf5b58	Maximum wait for GATK run report upload reduced to 10 seconds	2012-07-31 08:11:02 -04:00
Ryan Poplin	7ed06ee7b9	Updating FindCoveredIntervals to use the changes to the ActiveRegionWalker.	2012-07-30 12:16:27 -04:00
Ryan Poplin	13591b169f	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-30 12:13:24 -04:00
Eric Banks	0b30588d67	Catch yet another class of User Errors	2012-07-30 11:59:56 -04:00
Eric Banks	5743694196	Merged bug fix from Stable into Unstable	2012-07-30 11:35:28 -04:00
Eric Banks	79195b97a3	Adding categories for the remaining uncategorized walkers	2012-07-30 11:35:08 -04:00
Eric Banks	2b1b00ade5	All integration tests and VC/Allele unit tests are passing	2012-07-27 17:03:49 -04:00
Eric Banks	beb7610195	Resolving merge conflicts	2012-07-27 15:52:02 -04:00
Eric Banks	27e7e11ec0	Allele refactoring checkpoint #3 : all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this.	2012-07-27 15:48:40 -04:00
Ryan Poplin	22bb4804f0	HaplotypeCaller now use an excessive number of high quality soft clips as a triggering signal in order to capture both end points of a large deletion in a single active region.	2012-07-27 12:44:02 -04:00
Ryan Poplin	a0890126a8	ActiveRegionWalker's isActive function returns a results object now instead of just a double.	2012-07-27 11:01:39 -04:00
Eric Banks	ef335b6213	Several more walkers have been brought up to use the new Allele representation.	2012-07-27 02:14:25 -04:00
Eric Banks	9e2209694a	Re-enable reverse trimming of alleles in UG engine when sub-selecting alleles after genotyping. UG integration tests now pass.	2012-07-27 00:47:15 -04:00
Eric Banks	baf3e33730	Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass.	2012-07-26 23:27:11 -04:00
Ryan Poplin	35e803e110	Merged bug fix from Stable into Unstable	2012-07-26 14:00:04 -04:00
Ryan Poplin	4f741b4cd7	Smoothing in the BQSR bins should be one error observation and one non-error observation.	2012-07-26 13:59:02 -04:00
Guillermo del Angel	2ae890155c	Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy	2012-07-26 13:43:00 -04:00
Eric Banks	a694d1b5de	Merge branch 'master' into allelePadding	2012-07-26 01:53:14 -04:00
Eric Banks	32516a2f60	Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point.	2012-07-26 01:50:39 -04:00
Mark DePristo	8c418a15da	Sorting out HMS error handling (fingers crossed) -- Check if a traversal error occurred in the last shard -- Catch ExecutionException from the TreeReducer and throw as our HMS execption -- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself -- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)	2012-07-25 23:13:12 -04:00
Mark DePristo	9242f63a4d	On the way to really sorting out HMS error handling -- Better error message when a traveral error occurs (a real bug) -- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times -- A bit of cleanup in WalkerTest	2012-07-25 22:11:10 -04:00
Eric Banks	7eb3f54750	Added category docs for the remaining public walkers (I think I got them all). I removed a couple of totally unnecessary walkers.	2012-07-25 21:40:28 -04:00
Eric Banks	2982b24c4b	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-07-25 20:36:53 -04:00
Eric Banks	0a98a6aa8d	Adding extraDocs tag per Mauricio's request	2012-07-25 18:23:18 -04:00
Mauricio Carneiro	fce5cb9f35	Few category changes	2012-07-25 17:23:02 -04:00
Eric Banks	05fa377a8e	Adding GATK categories to standard walkers. Will add to remaining walkers after the next successful release (so that I can see which walkers are public and still need it).	2012-07-25 16:05:47 -04:00
Mauricio Carneiro	d46cf47bd1	Updating Read Filter documentation	2012-07-25 15:05:47 -04:00
Eric Banks	6a3bfa3811	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-07-25 14:11:11 -04:00
Eric Banks	357e0b35af	Register GATK-full-only walkers and rethrow the missing walker error as a not supported in GATK lite error	2012-07-25 14:11:03 -04:00
Roger Zurawicki	5b74763096	Removed Categories. We will use DocumentedGATKFeatures to create categories in our documentation. Eric I guess will be in charge of this. We need to remove walkers and think how to categorize everything. Tools can be hidden from GATKdocs with the @Hidden annotation Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-07-25 13:46:24 -04:00
Eric Banks	a5721a8846	Context covariate optimizations were not suited for multiple threads, so I removed them (since that ended up being much, much easier than trying to make the covariates thread local). Added -nt 2 layer to BQSR integration tests to confirm that it now works with multiple threads.	2012-07-25 13:38:07 -04:00
Eric Banks	e0c07f5567	Reverting old commits that made error handling better because ultimately they made things worse.	2012-07-25 12:37:59 -04:00
Mark DePristo	fcefa61bce	Remove reference dependence in BCF2Codec -- Adding BCF2Codec to VCF.jar and associated unit tests Signed-off-by: Mark DePristo <depristo@broadinstitute.org>	2012-07-25 08:56:38 -04:00
Mark DePristo	19a257a5c1	Multiple bugfixes -- VariantFiltration now properly sets passFilters in VC -- BCF2 writer now properly decodes lazy BCF genotype data that it uses. Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug! Signed-off-by: Mark DePristo <depristo@broadinstitute.org>	2012-07-25 08:56:38 -04:00
Mark DePristo	3066894215	Bugfix for BCF2 -- Always decode genotypes block when writing out a BCF file. If the header changes (and we currently don't know this easily) then the dictionary keys used in the genotypes block may be invalid. Temporarily added a private static boolean that turns off writing of the blocks until Eric and his team rewrite the header. Signed-off-by: Mark DePristo <depristo@broadinstitute.org>	2012-07-25 08:56:38 -04:00
Guillermo del Angel	eb55061fd0	a) Document BEAGLE codec, b) Bug fix: inbreeding coefficient shouldn't be computed for non-diploid organisms in current implementaiton	2012-07-24 12:16:15 -04:00
Mauricio Carneiro	348e86159e	Moving doclets to public	2012-07-23 23:52:14 -04:00
Mauricio Carneiro	5cd98a36b9	Making ForumAPIUtils public	2012-07-23 17:44:24 -04:00
Mauricio Carneiro	3d92f041f3	forgot to delete the merging line	2012-07-23 17:35:07 -04:00
Roger Zurawicki	f3c504769b	Added the ability to update the Forum GATKDocs looks for a key on gsa4, and updates the forum with new walker if it exists. More changes were made to the GATKDocs. Works nicely with bootstrap on and offline. Cleaned up the code as well Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-07-23 17:17:33 -04:00
Khalid Shakir	46ca49b63d	Removed 'Walker' suffix from packages/GATKEngine.xml that were breaking the packaged release. Archived AnalyzeCovariates scripts and removed references in build packages / GATK extensions.	2012-07-23 16:32:31 -04:00
Ryan Poplin	2a14bbe4f0	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-23 11:28:26 -04:00
Ryan Poplin	10d143c35c	Adding error model header names in the BQSR recal plot. Making the downsampling of points look a little nicer.	2012-07-23 11:28:17 -04:00
Eric Banks	675ccab2fa	Renaming BQSR to BaseRecalibrator	2012-07-23 10:17:17 -04:00
Ryan Poplin	2e486d83e2	Updating HaplotypeCaller docs and expanding integration tests.	2012-07-23 10:05:42 -04:00
Mauricio Carneiro	921eaad33f	Generalized the default platform parameter in BQSRv2 Parameter wasn't working outside of the BQSR walker. It now takes the information on the recalibration report in other tools (PrintReads for example) and treats all reads as coming from the defined default platform.	2012-07-20 17:29:13 -04:00
Mauricio Carneiro	5dc2143142	Removed support for walkers ending with "Walker" from the engine. If your walker has "Walker" in the name, you will have to use "Walker" on the -T to access it.	2012-07-20 17:27:11 -04:00
Mauricio Carneiro	d446d34227	GATK Error messages now point to the new website instead of GetSatisfaction.	2012-07-20 17:27:11 -04:00
Mauricio Carneiro	116885a450	Removed the "Walker" suffix from all walkers that had it. * Did not touch archived walkers... those can be named whatever. * Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...) * Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.	2012-07-20 17:27:11 -04:00
Christopher Hartl	3ee46cced2	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-19 21:25:40 -04:00
Christopher Hartl	af383c30b5	Ensure that the gene summary has a header line	2012-07-19 21:24:04 -04:00
Mark DePristo	2ca5fc62a2	Support for MISSING BCF2 type -- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid. Updated our codebase to support this construct. Heng said he'll update the BCF2 quick reference. -- Enabled integration test reading Heng's ex2.bcf file -- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK. Turns out there's a single record in the 1000G SV call set that doesn't have the right length -- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X -- Added integration test reading 1000G SV chr1 calls (from Chris)	2012-07-19 16:14:26 -04:00
Guillermo del Angel	c16f9f2f15	a) Use new method to check for GATK Like, b) minor improvements to indel pool caller (more to come): brain-dead, quick way to limit number of alt alleles to genotype. We can't process too many alt alleles because of the combinatorial explosion of GL values with high ploidy, and some STR validation targets had up to 12 alt alleles, resulting of GL vectors of > 1e8 elements. Can't use pileup elements since typically not many alleles will be in one pileup, and different alleles will appear in different samples, TBD a nicer solution. c) Commit to posterity scala script for large scale validation calling, still work in progress	2012-07-19 10:24:08 -04:00
Eric Banks	e370030e6c	As requested by Mark, I've broken out the code to pull out the protected subclass when available (and otherwise use the public version) into the GATKLiteUtils class. People should use this code instead of reimplementing all of the java reflection on their own.	2012-07-18 22:44:37 -04:00
Eric Banks	d46ccec04e	Adding Unit Tests to cover the exception catching for Picard errors: because we are using String matching, we want to ensure that we know if/when the exception text changes underneath us.	2012-07-18 21:48:58 -04:00
Mark DePristo	74e153ff4a	FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing -- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase -- Updating MD5s as appropriate	2012-07-18 16:07:47 -04:00
Mark DePristo	dede3a30e9	Improvements to the validation report of VariantEval -- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status. This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF. The previous version was counting sites polymorphic in mom against the calls in NA12878. -- Added testdata VCF and integrationtests to ensure this behavior continues in the future -- TODO: actually run integration tests when I have an internet connection	2012-07-18 16:07:47 -04:00
Mark DePristo	559a4826be	Improvements to the validation report of VariantEval -- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status. This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF. The previous version was counting sites polymorphic in mom against the calls in NA12878. -- Added testdata VCF and integrationtests to ensure this behavior continues in the future	2012-07-18 16:07:46 -04:00
Mark DePristo	dc292c0317	FisherStrand now includes all reads and bases, regardless of mapping quality and base quality, just like other annotations -- This actually proved to be a problem with Ion Torrent data where the base quality can be quite low, and so we need to include Q15 bases for calling effectively.	2012-07-18 16:07:46 -04:00
Eric Banks	2c0f073ab1	Make -qq arg hidden for now since it's still very experimental	2012-07-18 15:43:25 -04:00
Eric Banks	b46c85e8b4	More bad BAM file catching	2012-07-18 15:26:31 -04:00
Eric Banks	659eee13a6	Handle NPE generated in UG when non-standard reference bases are present in the fasta	2012-07-18 15:16:27 -04:00
Eric Banks	9af2cfe283	Catch underlying file system problems that get masked as Tribble index errors. There's also a quick patch to the HMS that isn't really the ultimate fix needed; Mark and I will review at a later point.	2012-07-18 15:11:38 -04:00
Eric Banks	4c730542f0	Handle RuntimeExceptions thrown by Picard that are really User Errors. I will add unit tests for these as best I can later.	2012-07-18 13:56:35 -04:00
Eric Banks	ae08d35138	Catch 'too many open files' errors that show up when trying to read the bam index. All that needs to be done is to flesh out the original error message (because it will get caught later and rethrown correctly).	2012-07-18 12:57:34 -04:00
Eric Banks	f2fe59a9d4	Wow, there are a ton of errors captured having to do with being unable to merge the temp Tribble output. I'm expanding the error message a bit to help see if we can do anything going forward.	2012-07-18 12:31:59 -04:00
Eric Banks	e4db8dde91	Enabled a whole other bunch of integration tests for BQSRv2. While I was there I also changed the default context size for indels to 3 (from 8) since that's what works best in the current implementation (as suggested by Ryan). At this point, all of the new core tools (ReduceReads, BQSRv2, HaplotypeCaller, UG extensions) have been moved over to protected and should be stable. Looks like we are pretty much ready for GATK 2.0!	2012-07-17 23:36:43 -04:00
Eric Banks	a8d08ea18d	As a user pointed out, it is not valid for a GenomeLoc to have a start or stop equal to 0.	2012-07-17 22:18:43 -04:00
Guillermo del Angel	29273abab7	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-17 16:58:12 -04:00
Guillermo del Angel	731bbba2e6	Bug fixes for integration test, use correct new UG syntax	2012-07-17 16:57:59 -04:00
Eric Banks	33be41ecf5	Cleaning up integration test	2012-07-17 16:06:04 -04:00
Eric Banks	8dbc9cb29c	Add the ability to emit the original quals in the OQ tag	2012-07-17 15:52:56 -04:00
Guillermo del Angel	40b8c7172c	Pool Caller refactoring in preparation of GATK 2.0: a) PoolCallerUnifiedArgumentCollection disappeared, and arguments moved to UnifiedArgumentCollection. b) PoolCallerWalker is no longer needed and redundant, all functionality subsumed by UG. UG now checks if GATK is lite - if so, don't allow ploidy > 2. c) Moved pool classes from private to protected. d) Changed the way to specify ploidy. Instead of specifying samples per pool and having ploidy = 2*samplesPerPool, have user specify ploidy directly, which is cleaner. Update tests accordingly. We can now call triploid seedless grape genotypes correctly in theory. e) Renamed argument -reference to -reference_sample_calls since the former is ambiguous and it's not clear what it refers to.	2012-07-17 15:27:04 -04:00
Laurent Francioli	68d0e4dd6d	- Multi-allelic sites are now correctly ignored - Reporting of mendelian violations enhanced - Corrected TP overflow by caping it to Bye.MAX_VALUE -Updated integrationtests to reflect changes in MVF file output Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-07-17 15:21:10 -04:00
Eric Banks	b0d99fd10d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-17 15:12:28 -04:00
Eric Banks	305db8c0d1	Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us.	2012-07-17 15:11:03 -04:00
Ryan Poplin	6efbcd99f1	HaplotypeCaller is now an AnnotatorCompatibleWalker with all the rights and privileges pertaining thereto. Enabling the ClippingRankSumTest after showing it was useful for 1000 Genomes calling.	2012-07-17 14:38:36 -04:00
Eric Banks	110886e8b9	Oops, got the logic wrong.	2012-07-17 13:37:11 -04:00
Eric Banks	a963b37424	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-17 13:15:37 -04:00
Eric Banks	3a64398d07	Cleaned up the isGATKLite check	2012-07-17 12:46:16 -04:00
Eric Banks	62c5228048	1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method.	2012-07-17 12:23:40 -04:00
Chris Saunders	1913d1bbd0	Put RunReport S3 upload on timeout thread Move the RunReport S3 upload process onto a separate thread with a timeout allowing the parent to continue. Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>	2012-07-17 12:19:39 -04:00
Eric Banks	40618ac471	A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots).	2012-07-17 10:52:43 -04:00
Eric Banks	d5b3a2eabf	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-17 00:32:53 -04:00
Eric Banks	f657b8bda8	Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow.	2012-07-17 00:32:34 -04:00
Eric Banks	a003148d50	Move AnalyzeCovariates over too.	2012-07-16 16:11:56 -04:00
Eric Banks	0a89adbcdb	Add utility decorators so that classes can tell you which package source they come from if they want to (suggested by Khalid). Using those decorators, we can easily pull out the BQSR updateDataForPileupElement() method into a standard RecalibrationEngine and an AdvancedRecalibrationEngine and use the protected one (AdvancedRE) if available (otherwise, the public one).	2012-07-16 15:34:50 -04:00
Eric Banks	52baac1e16	Move BQSRv2 into public and v1 into the archive.	2012-07-16 14:23:38 -04:00
Khalid Shakir	07822d6c0f	Fixed input annotations for master/test files on DiffObjectsWalker.	2012-07-16 13:33:11 -04:00
Eric Banks	2a830939df	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-14 23:49:59 -04:00
Eric Banks	f29cadd7e2	By default, don't quantize quals in BQSRv2	2012-07-14 23:49:48 -04:00
Eric Banks	75543a3f22	ReadClipper.clipRead's claim that it doesn't modify the original read was false. Ultimately, GATKSAMRecord.clone (as documented) creates a soft copy of the read - so modifying e.g. the bases of the cloned read means that you modify the bases of the original read too. Because of this, when the BQSRv2 Context covariate was writing Ns over the low quality tails of the reads they got propagated out to the output BAM file (very bad). I've updated the ReadClipper docs and cleaned up the code (no reason to use a clone of the read anymore given that we are already modifying the original). For now, the simplest thing is to have the Context covariate store the original bases, overwrite low quality Ns, compute covariates, and rewrite the original bases; we can update later if needed.	2012-07-13 18:50:27 -04:00
Ryan Poplin	443f02ffc2	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-13 16:09:24 -04:00
Khalid Shakir	6dfcc486e8	In ApplyRecalibration marking filter as PASS instead of '.' when the site passes by calling .passFilters().	2012-07-13 15:40:56 -04:00
Ami Levy Moonshine	5d0a7335ea	remove unnecessary use in the PRIORITY list remove unneeded imports	2012-07-13 15:27:08 -04:00
Ryan Poplin	d70bb59182	HaplotypeCaller now calls insertion events that aren't fully assembled as symbolic alleles.	2012-07-10 14:22:23 -06:00
Guillermo del Angel	279dff9f81	Bug fix when specifying a JEXL expression for a field that doesn't exist: we should treat the whole expression as false, but we were rethrowing the JEXL exception in this case. Added integration test to cover this in SelectVariants	2012-07-10 13:59:00 -04:00
Mauricio Carneiro	7eb45b4038	Fixed BQSR IntegrationTests * BinaryTag covariate is Experimental, not Standard (this was breaking integration tests) * New parameter in the Recalibration report requires new MD5 for one of the integration tests.	2012-07-09 13:55:12 -04:00
Eric Banks	dd0c47ab7e	Don't cast to a specific walker type since any walker can use the VA engine	2012-07-09 10:25:58 -04:00
Mark DePristo	5b0ade67c8	Updates to VCF processing for better BCF processing -- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order]. Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy -- Updating GATK code to use the appropriate header function (this is why so many files have changed) -- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this) -- Bugfix for adding contig lines to BCF2 header dictionary -- VCFHeader metaData no longer sorted internally. The system now maintains the data in header order, and only sorts output as requested in API -- VCFWriter and BCF2Writer now explictly sort their header lines -- Don't allow filters to be added that are PASS in the contract	2012-07-08 15:44:33 -07:00
Mark DePristo	63f5262e45	mergeInfoWithMaxAC is no longer hidden in CombineVariants	2012-07-08 15:44:32 -07:00
Mark DePristo	66aee613e2	Bugfix for set key in mergeInfoWithMaxAC. -- Previous version was always setting set=source of info with highest AC. Should actually have been set to the set annotation value itself.	2012-07-08 15:44:32 -07:00
Mark DePristo	91f0ed8059	Fixed nasty Rscript typo in VariantRecalibrator when compactPDF is available	2012-07-08 15:44:32 -07:00
Mark DePristo	87b090c362	Update VariantRecalibator error message to use -resource not old -B syntax	2012-07-08 15:44:31 -07:00
Mauricio Carneiro	125e6c1a47	added BinaryTagCovariate for ancient dna analysis	2012-07-06 15:03:20 -04:00
Mauricio Carneiro	f603d4c48c	Fixing PairHMMIndelErrorModel boundary issue When checking the limits of a read to clip, it wasn't considering reads that may already been clipped before.	2012-07-06 11:48:04 -04:00
Eric Banks	dd571d9aa0	Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags.	2012-07-04 01:22:20 -04:00
Eric Banks	33306d2e20	Changing the logic of the -standard argument; the way it stands currently one can never turn off the cycle or context covariates. Now they are on by default and users must opt out of them to turn them off.	2012-07-04 00:21:21 -04:00
Eric Banks	7d30558e6f	Only 'pad' the cycle covariate for indels, not substitutions	2012-07-03 23:47:01 -04:00
Eric Banks	22f1afddaa	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-03 14:55:59 -04:00
Eric Banks	617eebd204	More misc cleanup	2012-07-03 14:55:37 -04:00
Eric Banks	344c3aeb1d	Cleanup from previous commit	2012-07-03 14:42:44 -04:00
Ryan Poplin	9e8e78de15	Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels.	2012-07-03 14:30:37 -04:00
Eric Banks	0b37d44b0d	Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup.	2012-07-03 13:05:11 -04:00
Eric Banks	031322ff00	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-03 00:12:59 -04:00
Eric Banks	a4670113bd	Refactored/renamed the nested integer array; cleaned up code a bit.	2012-07-03 00:12:33 -04:00
Ryan Poplin	f92139dd82	Ooops, UG VA path for rank sum tests aren't happy with empty lists. Disabling clipping rank sum test for now.	2012-07-02 21:12:42 -04:00
Ryan Poplin	7e7b4cd1b9	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-02 16:37:54 -04:00
Ryan Poplin	b807ff63ef	HaplotypeCaller now creates MNP and complex substitutions by using LD information to decide if events segregate together on haplotypes. Added unit test.	2012-07-02 16:37:39 -04:00
Mauricio Carneiro	3cea080aa8	Cache SoftStart() and SoftEnd() in the GATKSAMRecord these are costly operations when done repeatedly on the same read.	2012-07-02 16:22:00 -04:00
Mauricio Carneiro	88a02fa2cb	Fixing but for reads with cigars like 9S54H When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.	2012-07-02 16:22:00 -04:00
Eric Banks	cac72bce91	Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit.	2012-07-02 14:33:33 -04:00
Mark DePristo	bcd2e13d8b	Adding duplicate header line keys is a logger.debug not logger.warn message now	2012-07-02 11:39:34 -04:00
Mark DePristo	01e04992f8	Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output	2012-07-02 11:38:59 -04:00
Eric Banks	c94c8a9c09	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-02 08:53:01 -04:00
Mark DePristo	7aff4446d4	Added unit tests for header repairing capabilities in the GATK engine	2012-07-01 15:38:10 -04:00
Mark DePristo	480b32e759	BCF2 is now officially zero-based open-interval, and that's how the GATK does it now	2012-07-01 14:59:27 -04:00
Ryan Poplin	b6093ff02c	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-01 10:32:37 -04:00
Mark DePristo	5ad9a98a15	Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations -- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order -- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values	2012-06-30 11:22:49 -04:00
Mark DePristo	385a3c630f	Added check in VariantContext.validate to ensure that getEnd() == END value when present -- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward -- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF	2012-06-30 11:22:48 -04:00
Mark DePristo	893630af53	Enabling symbolic alleles in BCF2 -- Bugfix for VCFDiffableReader: don't add null filters to object -- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles -- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication. Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele -- Brand new home for allele clipping / padding routines in VCFAlleleClipper. Actually documented this code, which results in lots of **** negative comments on the code quality. Eric has promised that he and Ami are going to rethink this code from scratch. Fixed many nasty bugs in here, cleaning up unnecessary branches, etc. Added UnitTests in VCFAlleleClipper that actually test the code full. In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed. Even introduced some minor optimizations -- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason. Fixed. -- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles --	2012-06-30 11:22:48 -04:00
Mark DePristo	16276f81a1	BCF2 with support symbolic alleles -- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys -- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest	2012-06-30 11:22:48 -04:00
Mark DePristo	6bea28ae6f	Genotype filters are now just Strings, not Set<String>	2012-06-30 11:22:47 -04:00
Guillermo del Angel	f631be8d80	UnifiedGenotyperEngine.calculateGenotypes() is not only used in UG but in other walkers - vc attributes shouldn't be inherited by default or it may cause undefined behaviour in those walkers, so only inherit attributes from input vc in case of UG calling this function	2012-06-29 23:51:52 -04:00
Guillermo del Angel	65037b87da	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-29 11:08:44 -04:00
Guillermo del Angel	5a9a37ba01	Pool caller improvements: a) Log ref sample depth at every called site (will add more ref-related annotations later), b) Make -glm POOLBOTH work in case we want to genotype snp's and indels together, c) indel bug fix (pool and non-pool): prevent a bad GenomeLoc to be formed if we're running GGA and incoming alleles are larger than ref window size (typically 400 bb)	2012-06-29 11:08:16 -04:00
Eric Banks	96ea334bf2	Disable caching in BQSR for now since it significantly slows down computation; will look into this in a bit.	2012-06-28 15:27:44 -04:00
Ryan Poplin	05791ebf80	Adding the Clipping rank sum test: If alternate-supporting reads have more hard clipping than reference-supporting reads this is evidence for error.	2012-06-28 13:22:56 -04:00
Ryan Poplin	d12ec92a55	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-28 12:57:59 -04:00
Ryan Poplin	5bb0693888	Bug fix for HC GGA mode. Shouldn't try to add an indel into the haplotype if that haplotype already contains the event of interest. Misc minor assembly param changes. Turning off capping of base qualities by base indel qualities until we can evaluate that change.	2012-06-28 12:57:51 -04:00
Khalid Shakir	1ce0b9d519	Throwing UnknownTribbleType exception instead of CommandLineException when an unknown tribble type is specified.	2012-06-28 11:28:04 -04:00
Mark DePristo	734bb5366b	Special case the situation where we have ploidy == 0 (no GT values) to implicitly assume we have diploid samples -- numLikelihoods no longer allows even ploidy == 0 in requires -- VCFCompoundHeaderLine handles the case where ploidy == 0 => implicit ploidy == 2	2012-06-28 10:06:07 -04:00
Mark DePristo	64d7e93209	Massive bugfixes -- Previous version was reading the size of the encoded genotypes vector for each genotype. This only worked because I never wrote out genotype field values with > 15 elements. Mauricio's killer DiagnoseTargets VCF uncovered the bug. Unfortunately since symbolic allele clipping is still busted those tests are still diabled -- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.	2012-06-28 10:06:06 -04:00
Mark DePristo	7144154f53	VCFWriter and BCFWriter no longer allow missing samples in the VC compared to their header -- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously. -- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects -- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC -- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function	2012-06-28 10:06:06 -04:00
Mark DePristo	4811a00891	GENOTYPE_FILTER_KEY is now a VCFStandardHeaderLine	2012-06-28 10:06:05 -04:00
Mark DePristo	93426a44b1	Fixes for DiagnoseTargets to be VCF/BCF2 spec complaint -- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int -- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK. Genotype filters should be empty for PASSing sites	2012-06-28 10:06:05 -04:00
Eric Banks	dc7636b923	Refactor the ContextCovariate to significantly reduce runtime	2012-06-28 02:29:35 -04:00
Eric Banks	1fafd9f6c8	NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation.	2012-06-27 16:55:49 -04:00
Khalid Shakir	746a5e95f3	Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding. Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals. IntervalUtils return unmodifiable lists so that utilities don't mutate the collections. Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus. Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values. JobRunInfo handles the null done times when jobs crash with strange errors.	2012-06-27 01:15:22 -04:00
Mark DePristo	1f45551a15	Bugfixes to G count types in VCF header -- Previously VCF header lines of count type G assumed that the sample would be diploid. -- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class -- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class -- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations -- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples -- Added extensive unit tests that compare A and G type values in genotypes	2012-06-26 15:28:34 -04:00
Mark DePristo	39c849aced	Bugfix to ensure the DB=1 old files decode properly	2012-06-26 15:28:33 -04:00
Mark DePristo	c1ac0e2760	BCF2 cleanup -- allowMissingVCFHeaders is now part of -U argument. If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING. Updated lots of files to use this -- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup. This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.	2012-06-26 15:28:33 -04:00
Mark DePristo	11dbfc92a7	Horrible bugfix to decodeLoc() in BCF2Codec -- Just completely wrong. -- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem -- Added samtools ex2.bcf file for decoding to our integrationtests	2012-06-26 15:28:32 -04:00
Mark DePristo	7dbba465ee	Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf	2012-06-26 15:28:32 -04:00
Roger Zurawicki	7eb3e4da41	Added integration Tests for DiagnoseTargets Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-25 17:02:46 -04:00
Joel Thibault	f0c54d99ed	Account for a null attributes object * field attributesCanBeModified - a null attributes object can't be modified in its current state * method makeAttributesModifiable() - initialize a null attributes object to empty	2012-06-25 12:07:36 -04:00
Joel Thibault	fd9effbfe2	Fix Exception typo	2012-06-25 12:05:04 -04:00
Ryan Poplin	429ad44421	Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start.	2012-06-22 14:53:29 -04:00
Ryan Poplin	735b59d942	Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero.	2012-06-22 12:38:48 -04:00
Ryan Poplin	0650b349d7	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-22 10:42:49 -04:00
Guillermo del Angel	eed32df30d	a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this	2012-06-21 21:19:55 -04:00
Mark DePristo	734756d6b2	Final fixes before BCF2 mark III push -- Added MLEAC and MLEAF format lines to PoolCallerWalker -- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation -- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing	2012-06-21 15:17:22 -04:00
Mark DePristo	31ee8aa01a	JEXL update -- Update to 2.1.1 from 2.0 -- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching. So "AF < 0.5" works even in the presence of multi-allelics now. --	2012-06-21 15:17:21 -04:00
Mark DePristo	549293b6f7	Bugfixes towards final BCF2 implementation -- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF -- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing -- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one -- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing -- Keys whose value is null are put into the VariantContext info attributes now	2012-06-21 15:17:21 -04:00
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	fba7dafa0e	Finalizing BCF2 mark III commit -- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now -- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work... -- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking. -- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt -- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now -- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status -- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be -- VariantsToVCF now properly writes out the GT FORMAT field -- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure -- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now -- PASS string now added at the end of the BCF2 dictionary after discussion with Heng -- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian. -- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException -- Cleaned up unused code -- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding -- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples -- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names... -- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally -- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele -- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code -- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, ** NEVER COPY CODE ** if you need to share funcitonality make a function, that's why there were invented! -- Increased the default number of records to read for DiffObjects to 1M	2012-06-21 15:16:27 -04:00
Mark DePristo	9c81f45c9f	Phase I commit to get shadowBCFs passing tests -- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument -- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA -- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample -- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original -- Performance optimizations in BCF2 codec and writer -- using arrays not lists for intermediate data structures -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over -- VCFHeader (which needs a complete rewrite, FYI Eric) -- Warn and fix on the way flag values with counts > 0 -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation. -- Explicitly track FILTER fields for efficient lookup in their own hashmap -- Automatically add PL field when we see a GL field and no PL field -- Added get and has methods for INFO, FILTER, and FORMAT fields -- No longer add AC and AF values to the INFO field when there's no ALT allele -- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare -- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF	2012-06-21 15:16:26 -04:00
Mauricio Carneiro	ab53220635	Refactor on how RR treats soft clips * Sites with more soft clipped bases than regular will force-trigger a variant region * No more unclipping/reclipping, RR machinery now handles soft clips natively. * implemented support for base insertion and base deletion quality scores in synthetic and regular reads. * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present. note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!	2012-06-21 14:02:03 -04:00
Ryan Poplin	769e190202	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-20 09:59:55 -04:00
Christopher Hartl	fe1d6e3953	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-19 08:02:00 -04:00
Christopher Hartl	79ef3325bd	Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations.	2012-06-19 07:50:13 -04:00
Eric Banks	62cee2fb5b	Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature.	2012-06-18 21:36:27 -04:00
Eric Banks	4393adf9e7	If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it.	2012-06-18 13:36:14 -04:00
Ryan Poplin	707151f0a4	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 12:55:58 -04:00
Eric Banks	82a2c40338	Emit the MLE AC and AF in the INFO field of the UG output	2012-06-18 12:19:36 -04:00
Ryan Poplin	5ec737f008	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 08:51:48 -04:00
Ryan Poplin	e3147969d9	Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled.	2012-06-18 08:51:41 -04:00
Eric Banks	677babf546	Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort.	2012-06-15 15:55:03 -04:00
Eric Banks	783b7f6899	Misc cleanup	2012-06-15 10:39:19 -04:00
Eric Banks	0c218e4822	Refactoring mostly for readability (and small performance improvement)	2012-06-15 10:36:41 -04:00
Eric Banks	c54e84e739	Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations.	2012-06-15 09:28:56 -04:00
Eric Banks	61fcbcb190	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-15 02:45:57 -04:00
Eric Banks	4895fe2289	No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java)	2012-06-15 02:32:00 -04:00
Mark DePristo	0384ce5d34	Simple optimizations for BCF2Encoder -- Inline encodeString that doesn't go via List<Byte> intermediate -- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2 -- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation -- Final UG integration test update	2012-06-14 16:42:39 -04:00
Mark DePristo	68eed7b313	Optimizations for VCF and BCF2 -- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking. encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version. Lots of contracts. User code updated to use specialized versions where possible -- Misc code refactoring -- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1. Updating MD5s accordingly -- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations	2012-06-14 16:42:39 -04:00
Mark DePristo	09df584788	Fixed nasty bug where we weren't closing the underlying PositionalOutputStream in IndexingVariantContextWriter	2012-06-14 16:42:39 -04:00
Mark DePristo	fbc45e14d3	Cleanup formatting of VCF floats -- Final integrationtest update before commit (and fixing new formatting changes)	2012-06-14 16:42:38 -04:00
Mark DePristo	8b01969762	More code cleanup and optimizations to BCF2 writer -- Cleanup a few contracts -- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields -- A few simple optimizations -- VCF header samples stored in String[] in the writer for fast access -- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset -- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files -- A few editorial comments in VCFHeader	2012-06-14 16:42:38 -04:00
Mark DePristo	e34ca0acb1	Passing all unittests -- Final merge conflicts resolved -- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that	2012-06-14 16:42:38 -04:00
Mark DePristo	71da76039e	Final support for variable length lists of strings in BCF2 -- Updating many MD5s as well.	2012-06-14 16:42:38 -04:00
Mark DePristo	bd9d40fb84	Code cleanup and more documentation for BCFFieldWriters -- Update integration tests where appropriate	2012-06-14 16:42:37 -04:00
Mark DePristo	856905ee5b	Cleanup Genotypes -- Renamed getAttribute to getExtendedAttribute, as this is really what this function does -- Added a few more genotype tests	2012-06-14 16:42:36 -04:00
Mark DePristo	31997f8092	Bugfixes on the way to passing integration tests -- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate -- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it. Very expensive but convenient. -- Fixed nasty subsetting bug in SelectVariants with excluding samples -- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT -- Bugfix for dropping old style GL field values -- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header -- Bugfix for Allele.getBaseString to properly show NO_CALL alleles -- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes	2012-06-14 16:42:33 -04:00
Mark DePristo	ea1b699778	Cleanup the interface for BCF2FieldEncoder -- Now uses a much clearer approach. Update all user classes to new interface	2012-06-14 16:42:33 -04:00
Mark DePristo	dd6aee347a	Genotype encoding uses the BCF2FieldEncoder system	2012-06-14 16:42:33 -04:00
Mark DePristo	9ac4203254	GenotypeAnnotations now accept a GenotypeBuilder and directly update the builder with their value -- Cleans up interface and avoids significant amounts of gross typing code	2012-06-14 16:42:32 -04:00
Mark DePristo	7506994d09	Nearing final BCF commit -- Cleanup some (but not all) VCF3 files. Turns out there are lots so... -- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec. Now VCF3 properly handles the new GenotypeBuilder interface -- Misc. bugfixes in GenotypeBuilder	2012-06-14 16:42:32 -04:00
Mark DePristo	6272612808	Testing utility to perform diffs N times	2012-06-14 16:42:32 -04:00
Mark DePristo	8014178f2f	Algorithmically faster version of DiffEngine -- Now only includes leaf nodes in the summary, i.e., summaries of the form ".....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm -- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size. -- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well. -- Added integration test for new leaf and old pairwise calculations -- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields	2012-06-14 16:42:30 -04:00
Mark DePristo	2a86b81a3f	Initial version of clean, fast formatting routines built dynamically from a VCF header -- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level. -- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values -- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder -- Allowed us to naturally support encoding of lists of strings -- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion -- Fixes for integration test failures -- Enabling contig updates -- WalkerTest now prints out relative paths where possible to make cut/paste/run easier	2012-06-14 16:42:30 -04:00
Mark DePristo	51a3b6e25e	No more makePrecisionFormatStringFromDenominatorValue -- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating. -- Created a smart float formatter in VCFWriter, with unit tests -- Removed makePrecisionFormatStringFromDenominatorValue and its uses -- Fix broken contracted -- Refactored some code from the encoder to utils in BCF2 -- HaplotypeCaller's GenotypingEngine was using old version of subset to context. Replaced with a faster call that I think is correct. Ryan, please confirm.	2012-06-14 16:42:30 -04:00
Mark DePristo	43ad890fcc	Finalizing BCF2 v2 -- FastGenotypes are the default in the engine. Use --useSlowGenotypes engine argument to return to old representation -- Cleanup of BCF2Codec. Good error handling. Added contracts and docs. -- Added a few more contacts and docs to BCF2Decoder -- Optimized encodePrimitive in BCF2Encoder -- Removed genotype filter field exceptions -- Docs and cleanup of BCF2GenotypeFieldDecoders -- Deleted unused BCF2TestWalker -- Docs and cleanup of BCF2Types -- Faster version of decodeInts in VCFCodec -- BCF2Writer -- Support for writing a sites only file -- Lots of TODOs for future optimizations -- Removed lack of filter field support -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural -- Lots of docs and contracts -- Docs for GenotypeBuilder. More filter creation routines (unfiltered, for example) -- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters. Better genotype comparisons	2012-06-14 16:42:29 -04:00
Mark DePristo	37e5d32019	Remove logger.info statement	2012-06-14 16:42:29 -04:00
Mark DePristo	01ddf9555a	Performance optimizations for Genotype field decoding for GT field -- Fast path decoder for biallelic diploid GT fields that avoids allocating the same genotypes over and over -- Contracts -- final classes	2012-06-14 16:42:28 -04:00
Mark DePristo	7fbca7013e	Don't add missing value binding from field to Genotype object in VCF3Codec	2012-06-14 16:42:28 -04:00
Mark DePristo	4a4d3cde3d	UnitTests for decodeIntArray method	2012-06-14 16:42:27 -04:00
Mark DePristo	5b8bd81991	An option to not actually write out the results of select variants -- Useful for performance testing of the SV operations themselves.	2012-06-14 16:42:26 -04:00
Mark DePristo	6f7a01e00d	Bugfix for BCF2 reader / writer for > 0x0FFF samples :-) -- Should be 0x00FFFFFF in the mask	2012-06-14 16:42:26 -04:00
Mark DePristo	1d4eb46606	Efficient reading of genotype fields v1 -- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object -- Code cleanup / contracts added were appropriate -- V2 will have a yet more optimized path...	2012-06-14 16:42:26 -04:00
Mark DePristo	37b8d70321	Hidden option to SelectVariants to force the genotypes information to be decoded by computing AC	2012-06-14 16:42:25 -04:00
Mark DePristo	17fbd103d0	Smarter infrastructure to decode genotypes in BCF -- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder. The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2 -- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder. In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form -- Reduced the amount of data further to be computed in the DiffEngine. The DiffEngine algorithm needs to be rethought to be efficient...	2012-06-14 16:42:25 -04:00
Mark DePristo	889e3c4583	Code cleanup before major refactor	2012-06-14 16:42:25 -04:00
Mark DePristo	cebd37609c	Finalizing new Genotype object and associated routines -- Builder now provides a depreciated log10pError function to make a new GQ value -- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions -- Lots of contracts -- Bugfixes throughout	2012-06-14 16:42:25 -04:00
Mark DePristo	8b0a629a31	Terrible bugfix -- The way I was handling the contig offset ordering wasn't correct. Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict. It has nothing to do with the contig index itself. -- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly	2012-06-14 16:42:24 -04:00
Mark DePristo	d37a8a0bc8	Efficient Genotype object Intermediate commit -- Created a new Genotype interface with a more limited set of operations -- Old genotype object is now SlowGenotype. New genotype object is FastGenotype. They can be used interchangable -- There's no way to create Genotypes directly any longer. You have to use GenotypeBuilder just like VariantContextBuilder -- Modified lots and lots of code to use GenotypeBuilder -- Added a temporary hidden argument to engine to use FastGenotype by default. Current default is SlowGenotype -- Lots of bug fixes to BCF2 codec and encoder. -- Feature additions -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes -- Cleaned up semantics of subContextFromSamples. There's one function that either rederives or not the alleles from the subsetted genotypes -- MASSIVE BUGFIX in SelectVariants. The code has been decoding genotypes always, even if you were not subsetting down samples. Fixed!	2012-06-14 16:42:24 -04:00
Mark DePristo	a648b5e65e	First step towards an efficient Genotype object -- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness. Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code. Much cleaner interface now, but not yet hooked up to anything -- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class -- Code cleanup. Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS. Uses in code replaced with constant	2012-06-14 16:42:23 -04:00
Mark DePristo	ff9ac4b5f8	BCF2 genotype decoding is now lazy -- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec. -- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec	2012-06-14 16:42:23 -04:00
Mark DePristo	9eb83a0771	Enable adding contigs to VariantContextWriters on output	2012-06-14 16:42:23 -04:00
Mark DePristo	b0ea14ef0f	VCFHeader getMetaData returns 4.1 version not 4.0	2012-06-14 16:42:22 -04:00
Mauricio Carneiro	7d12429917	First step towards indel qualities in RR Let the BI's and BD's pass through the reduce reads machinery	2012-06-14 15:37:39 -04:00
Mauricio Carneiro	e68038c5d8	Refactor post-processing downsampling using David's generic downsampler interface	2012-06-14 15:37:32 -04:00
Eric Banks	de5508fcea	Bug fixes for cycle and context covariates	2012-06-14 13:01:14 -04:00
Eric Banks	5c3c6cbc40	Long -> long conversions in BQSR	2012-06-14 09:07:02 -04:00
Eric Banks	29a74908bb	The next round of BQSR optimizations: no more Long[] array creation	2012-06-14 00:05:42 -04:00
Guillermo del Angel	cd2074b1dc	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 20:59:30 -04:00
Guillermo del Angel	92669a0468	Second intermediate commit for indel pool caller - now works (more or less) in reference sample-free mode. Still needs a lot of cleanups/add more tests and not done w/refactoring quite yet	2012-06-13 20:59:17 -04:00
David Roazen	0550b27799	Make downsampler classes themselves generic (instead of just the Downsampler interface) This is in response to a request from Mauricio to make it easier to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords) without having to do any cumbersome typecasting. Sadly, Java language limitations make this sort of solution the best choice. Thanks to Khalid for his feedback on this issue. Also: -added a unit test to verify GATKSAMRecord support with no typecasting required -added some unit tests for the FractionalDownsampler that Mauricio will/might be using -moved classes from private to public to better sync up with my local development branch for engine integration	2012-06-13 16:43:39 -04:00
Guillermo del Angel	67c0569f9c	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-13 11:50:00 -04:00
Eric Banks	81993b08e2	Don't put null entries into the key array	2012-06-13 11:43:44 -04:00
Roger Zurawicki	bdf5945dcc	Fixed bugs in DiagnoseTargets DT would not report bad mates! that has been fixed Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:26 -04:00
Roger Zurawicki	538cdf9210	Created the FindCoveredIntervals Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class Minor tweaks FindCoveredIntervals supports Gathering FindCoveredIntervals outputs an interval list instead of GATKReport Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-13 11:15:25 -04:00
Guillermo del Angel	aee66ab157	Big UG refactoring and intermediate commit to support indels in pool caller (not done yet). Lots of code pulled out of long spaghetti-like functions and modularized to be easily shareable. Add functionality in ErrorModel to count indel matches/mismatches (but left part disabled as not to change integration tests in this commit), add computation of pool genotype likelihoods for indels (not fully working yet in more realistic cases, only working in artificial nice pools). Lot's of TBD's still but existing UG and pool SNP functionality should be intact	2012-06-13 11:14:44 -04:00
Eric Banks	37f56ce8fd	A couple of minor updates to BQSR	2012-06-12 16:12:13 -04:00
Eric Banks	277493dd83	Yet more instances of Lists changed over to native arrays	2012-06-12 15:56:09 -04:00
Eric Banks	613badc835	Very minor optimizations for the context covariate	2012-06-12 15:47:32 -04:00
Eric Banks	0f79adb2aa	Changing more Java Lists to native arrays in BQSR for performance optimization.	2012-06-12 15:41:01 -04:00
Eric Banks	1da3e43679	Wow, apparently it's way, way less efficient to iterate over Java Lists than native arrays. With this change and the bit fiddling, Ryan's 10-day test case now runs in 1 day. More to come.	2012-06-12 13:32:56 -04:00
Eric Banks	fec0bd5e11	Fixing UG argument docs	2012-06-12 09:46:16 -04:00
Eric Banks	a4defdfb29	Adding a GT header line to SomaticIndelDetector output	2012-06-12 09:39:17 -04:00
Eric Banks	891ce51908	Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements.	2012-06-12 09:19:36 -04:00
Eric Banks	ff5749599d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 15:46:17 -04:00
Eric Banks	fea625632f	Don't use asList because it maintains an iterator to the original list and then the result can't be used to create a new one	2012-06-11 15:45:58 -04:00
Ryan Poplin	e4d371dc80	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-11 10:38:50 -04:00
Ryan Poplin	683d4b508e	Bug fix in fragment utils: the read name wasn't being set in the merged read. Misc minor updates to the HaplotypeCaller.	2012-06-11 10:38:35 -04:00
Mauricio Carneiro	4aad7e23ef	New ReduceReads v2 with unclipped variant regions and soft-clipped bases * Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it. * Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute * Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads * Updated all integration tests	2012-06-08 14:58:31 -04:00
Eric Banks	afa9b2718a	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-08 13:54:48 -04:00
Eric Banks	92280b4068	BQSR optimization: cache the BitSetUtils.bitSetFrom() calls since they are called over and over again with the same values. Another 10% reduction in runtime.	2012-06-08 13:54:37 -04:00
Eric Banks	898a0e6161	Minor optimizations	2012-06-08 12:07:58 -04:00
Ryan Poplin	0a37e19998	Bug fix in VQSR so that the VCF index will be created for the recalFile.	2012-06-08 11:51:28 -04:00
Eric Banks	d463ab2cbf	BQSR optimization: String manipulation is extremely expensive in Java (accounts for 8% of BQSR runtime). Instead use byte[] and StringBuilder when possible.	2012-06-08 10:42:42 -04:00
Eric Banks	2bd48a7351	Bad comments made it into the previous commit	2012-06-07 23:12:56 -04:00
Eric Banks	31c3a6be48	BQSR optimization: getRequiredCovariates() and getOptionalCovariates() were creating a new List every time they were being called, and unfortunately getRequiredCovariates().size() is used as the stop condition in for-loops throughout the code. Just maintaining the original list of covariates results in a 15% reduction in runtime for BQSR.	2012-06-07 20:04:10 -04:00
Eric Banks	0fb9179f76	BQSR optimization: don't clone the original quals for each read, we can just overwrite the original array	2012-06-07 19:41:03 -04:00
Ryan Poplin	d449f169d3	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-07 10:56:55 -04:00
Ryan Poplin	0b4281fdd0	misc minor update to HC debug output for when there are a lot of samples	2012-06-07 10:56:41 -04:00
Eric Banks	bad50a1b05	Fix docs	2012-06-06 22:45:38 -04:00
Eric Banks	b093ba9dcc	Stabilized NGSPlatform code: don't assume all reads have read groups (e.g. artificial SAM records)	2012-06-06 15:17:30 -04:00
Eric Banks	54f682a99c	Unify to NGSPlatform framework. TechnologyComposition annotation now generalizes to Illumina and not just SLX.	2012-06-06 11:44:37 -04:00
Eric Banks	dd46d843fb	IR should skip Ion reads just like it does with 454 reads; Tim has confirmed that official platform name for Ion.	2012-06-06 11:04:55 -04:00
Guillermo del Angel	2cbd6e5f90	Merged bug fix from Stable into Unstable	2012-06-05 15:58:23 -04:00
Guillermo del Angel	ce4dc2128d	Adding minor clarification to -mbq argument documentation	2012-06-05 15:17:56 -04:00
Eric Banks	e02ec8c8b6	Don't update the record ID unless we are actually going to emit the record	2012-06-04 14:58:50 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Ryan Poplin	f11e7ebc3a	Fixing the previous fix related to clipping. Adding extra reference padding in the HaplotypeCaller to get those larger alleles during GGA.	2012-06-04 12:49:36 -04:00
Ryan Poplin	320956ee4b	Bug fix in clipping function in ReadUtils for when the read ends at exactly the clipping boundary. Bug fixes in HaplotypeCaller GGA mode for when Smith-Waterman produces a different allele than what was given in the input alleles VCF. GGA mode now works with multiallelic records. Adding min pruning factor argument which is combined with the pruning factor that is determined dynamically by the coverage.	2012-06-04 10:55:36 -04:00
Guillermo del Angel	7a54baf08c	Merged bug fix from Stable into Unstable	2012-06-03 08:42:08 -04:00
Guillermo del Angel	47df7bbc14	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-06-03 08:38:54 -04:00
Guillermo del Angel	2ddbdee3bc	Fixed broken VariantEval stratifications VariantType and IndelSize - integration tests to follow	2012-06-03 08:38:38 -04:00
Mauricio Carneiro	12a8c54f9a	Fixing VCF header for filter elements (thanks Eric)	2012-06-01 15:45:15 -04:00
Eric Banks	3a15ba2102	Malformed VCF headers should be User Errors	2012-05-31 16:05:53 -04:00
Khalid Shakir	c4f7df4dce	When an underlying exception occurs because of the user error, if the exception instance does not include a message instead of telling the user "because null", tell them "because <exception class name>".	2012-05-30 16:39:06 -04:00
Ryan Poplin	421d0d1435	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-30 15:21:35 -04:00
Ryan Poplin	5dd811f84a	Adding genotype given alleles mode to the HaplotypeCaller.	2012-05-30 15:07:01 -04:00
Eric Banks	d09b8d5584	Fixing docs	2012-05-30 13:24:08 -04:00
Mauricio Carneiro	d6e1205310	Updating default values for DiagnoseTargets	2012-05-30 12:43:07 -04:00
Khalid Shakir	c3c7f17d90	Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000. Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run. TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.	2012-05-29 11:18:22 -04:00
Roger Zurawicki	b8b139841d	DiagnoseTargets with working Q1,Median,Q3 - Merged Roger's metrics with Mauricio's optimizations - Added Stats for DiagnoseTargets - now has functions to find the median depth, and upper/lower quartile - the REF_N callable status is implemented - The walker now runs efficiently - Diagnose Targets accepts overlapping intervals - Diagnose Targets now checks for bad mates - The read mates are checked in a memory efficient manner - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker. - Fixed some bugs - Removed rod binding Added more Unit tests - Test callable statuses on the locus level - Test bad mates - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-05-29 10:16:45 -04:00
Eric Banks	50031b63c5	Fix possible NPE from NBaseCount annotation module	2012-05-29 09:46:00 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	7ce24a96f1	PBT now uses getGenotypeLikelihoodString to avoid NPE when there are no PLs present	2012-05-28 20:18:16 -04:00
Mark DePristo	1818c29371	Fixed long-standing bug in beagle codec that was passing on the header record for decoding	2012-05-28 20:17:26 -04:00
Mark DePristo	5894d045cb	Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests -- This version of BCF should actually work properly for most files, assuming headers are properly defined. -- Lots of bug fixes to BCF2 codec -- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL. NOTE THIS SEMANTICS change -- Equals() method for GenotypeLikelihoods, using PLs. -- VCFCodec now longer adds empty bindings to missing input field values. NOTE THIS CHANGE -- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work. The BCF2 codec now makes VCs marked as fully decoded -- stringToBytes returns empty list for null or "" string in BCF2Encoder -- Proper handling of genotype ordering in BCF2 reader / writer -- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily -- Many failing MD5s now due to double -> int change in GQ, will update later	2012-05-27 11:17:17 -04:00
Mark DePristo	86e5a066fc	Even more conservative limit on number of differences to summarize at 1000	2012-05-27 11:17:13 -04:00
Mark DePristo	31f4e5b52e	Stop unlimited runtimes in DiffEngine when you have lots of differences -- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000	2012-05-27 11:17:13 -04:00
Mauricio Carneiro	4109fcbb08	Merged bug fix from Stable into Unstable	2012-05-25 13:03:05 -04:00
Mauricio Carneiro	2be5704a25	Fixed haplotype boundary bug in PairHMMIndelErrorModel haplotypes were being clipped to the reference window when their unclipped ends went beyond the reference window. The unclipped ends include the hard clipped bases, therefore, if the reference window ended inside the hard clipped bases of a read, the boundaries would be wrong (and the read clipper was throwing an exception). * updated code to use SoftEnd/SoftStart instead of UnclippedEnd/UnclippedStart where appropriate. * removed unnecessary code to remove hard clips after processing. * reorganized the logic to use the assigned read boundaries throughout the code (allowing it to be final).	2012-05-25 13:00:45 -04:00
Guillermo del Angel	175bb35e70	Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former)	2012-05-25 12:56:23 -04:00
Mark DePristo	7280cdf937	Bugfixes and testdata cleanup -- Cut down the size of a few large files in public/testdata that were only used in part -- Refactor vcf Filename => shadow BCF filename to BCF2Utils. Fix bug in WalkerTest due to the way this was handled previously	2012-05-24 13:26:05 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	ade1843818	Bugfix for not setting header in AbstractVCFCodec	2012-05-24 10:58:58 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	69ee4d0454	Moved getMetaDataForField to VariantContextUtils	2012-05-24 10:57:09 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Mark DePristo	4bde24f020	Bugfix for VCFWriter in the case where there are no genotypes in the VC but genotypes in the header	2012-05-24 10:57:08 -04:00
Mark DePristo	4846bf5c8e	@Hidden --also_generate_bcf engine argument produces both VCF and BCF files for -o my.vcf -- Going to be useful going forward for integration tests so they will generate both VCF and BCF files automatically	2012-05-24 10:57:07 -04:00
Mark DePristo	bb0d87666a	Finally just deleted equals() method in GATKArgumentCollection. -- We never compare these things in the codebase anyway...	2012-05-24 10:57:07 -04:00
Mark DePristo	c8ed0bfc4c	Edge case fixes for BCF2 --handle entirely missing GT in a sample in decodeGenotypeAlleles --Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code -- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples -- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys -- Removed restriction on getPloidy requiring ploidy > 1. It's logically find to return 0 for a no called sample -- getMaxPloidy() in VC that does what it says -- Support for padding / depadding of generic genotype fields	2012-05-24 10:57:06 -04:00
Mark DePristo	40431890be	-- BCF2 is now a reference dependent codec so it can initialize the contigs in the case where the file doesn't have contigs in it -- BCF2 writer can now work without the contig lines being in the header -- Made GenomeLocParser a final class	2012-05-24 10:57:06 -04:00
Mark DePristo	6301572009	GenotypeLikelihood PLs are capped at Short.MAX_INT now -- UserExceptions in BCF2 now where appropriate -- Asserts for code safety -- Public -> protected encode(Object v) method is for testing only	2012-05-24 10:57:06 -04:00
Mark DePristo	d52bc31a47	Bugfix for doNotWriteGenotypes mode -- Was outputing GT ./. in sites only mode. Fixed	2012-05-24 10:57:05 -04:00
Mark DePristo	64d4238e2f	99% working version of BCF2 encoder / decoder -- fixed final bugs with PL encoding / decoding -- Ready for testing by other members of the group -- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations -- Fixed a nasty bug in the filter field -- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types Read 1500 genotypes file in VCF -> VCF : 11 seconds Read 1500 genotypes file in VCF -> BCF : 9.5 seconds VariantEval 1500 genotypes file in VCF : 3 seconds VariantEval 1500 genotypes file in BCF : 3 seconds	2012-05-24 10:57:05 -04:00
Mark DePristo	b5bce8d3f9	AD should be UNBOUNDED, actually -- Pass in # alt alleles as appropriate for getCount in VCF header line	2012-05-24 10:57:05 -04:00
Mark DePristo	aaf11f00e3	Near final BCF2 implementation -- Trivial import changes in some walkers -- SelectVariants has a new hidden mode to fully decode a VCF file -- DepthPerAlleleBySample (AD) changed to have not UNBOUNDED by A type, which is actually the right type -- GenotypeLikelihoods now implements List<Double> for convenience. The PL duality here is going to be removed in a subsequent commit -- BugFixes in BCF2Writer. Proper handling of padding. Bugfix for nFields for a field -- padAllele function in VariantContextUtils -- Much better tests for VariantContextTestProvider, including loading parts of dbSNP 135 and the Phase II 1000G call set with genotypes to test encoding / decoding of fields.	2012-05-24 10:57:02 -04:00
Mark DePristo	dfee17a672	Generalize / unify code for handling strings -- List<String> is converted inside of the codec to a collapsed string, and exploded in the decoder. -- Unified the type conversion code in BCFWriter to simply the mapping from VCF type => BCF type and special value recoding -- Code cleanup and renaming	2012-05-24 10:57:02 -04:00
Mark DePristo	b4a5acd6f4	Added some genotype tests for BCF2, which all pass. Of course that's because I commented out the ones that didn't	2012-05-24 10:57:01 -04:00
Mark DePristo	373ae39e86	Testing of BCF codec -- Rev.d tribble -- Minor code cleanup -- BCF2 encoder / decoder use Double not Float internally everywhere -- Generalized VC testing framework	2012-05-24 10:57:01 -04:00
Mark DePristo	fb1911a1b6	-- Convenience constructor for VariantContextBuilder that creates a new one based on an existing builder -- Convenience routine for creating alleles from strings of bases -- Convenience constructor for VCFFilterHeader line whose description is the same as name -- VariantContextTestProvider creates all sorts of types of VariantContexts for testing purposes. Can be reused throughtout code for BCF, VCF, etc. -- Created basic BCF2WriterCodec tests that consumes VariantContextTestProvider contexts, writes them to disk with BCF2 writer, and checks that they come back equals to the original VariantContexts. Actually worked for some complex tests in the first go	2012-05-24 10:57:01 -04:00
Mark DePristo	4968dcd36a	Throw an error when genotype fields with mixed vector lengths are encountered	2012-05-24 10:57:00 -04:00
Mark DePristo	afd2f1a3f9	Individual VariantContextWriters are now package protected -- Added VCFHeader() constructor that makes an empty header, and updated VariantRecalibrator to use it -- Update build.xml to build vcf.jar with updated paths and bcf2 support.	2012-05-24 10:57:00 -04:00
Mark DePristo	24864fd5b0	GATK now writes BCF output to any file with .bcf extension -- Moved VCF and BCF writers to variantcontext.writers -- Updated vcf.jar build path -- Refactored VCFWriter and other code. Now the best (and soon to be only) way to create these files is through a factory method called VariantContextWriterFactory. Renamed the general VCFWriter interface to VariantContextWriter which is implemented by VCFWriter and BCF2Writer.	2012-05-24 10:57:00 -04:00
Mark DePristo	e2311294c0	Removed unused ManualSortingVCFWriter	2012-05-24 10:56:59 -04:00
Mark DePristo	93cef82637	BCF2 header encoding decoding at final spec	2012-05-24 10:56:58 -04:00
Mark DePristo	ce9e9eebb1	No dictionary in header. Now built dynamically from the header in the writer and codec -- Created BCF2Utils and moved BCF2Constants and TypeDescriptor methods there	2012-05-24 10:56:58 -04:00
Mark DePristo	c3b8048e2e	Moving around classes in VCF and BCF2 -- Refactored VCF writers into vcf.writers package -- Moved BCF2Writer to bcf2.writer -- Updates to all of the walkers using VCFWriter to reflect new packages -- A large number of files had their headers cleaned up because of this as well	2012-05-24 10:56:58 -04:00
Mark DePristo	679ffdd333	Move BCF2 from private utils to public codecs	2012-05-24 10:56:56 -04:00
Mark DePristo	450f098a61	BCF2 encoder / decoder implement new site / genotype block organization -- Supports final organization of data blocks into sites data and genotypes data	2012-05-24 10:56:55 -04:00
Mark DePristo	27b51d4dea	Enable on the fly indexing of BCF2	2012-05-24 10:56:54 -04:00
Mark DePristo	81bd7646d6	Fix for MISSING floats -- Restructured code to separate the MISSING value in java (currently everywhere a null) from the byte representation on disk (an int). -- Now handles correctly MISSING qual fields	2012-05-24 10:56:53 -04:00
Mark DePristo	3afbc50511	More BCF2 improvements -- Refactored setting of contigs from VCFWriterStub to VCFUtils. Necessary for proper BCF working -- Added VCFContigHeaderLine that manages the order for sorting, so we now emit contigs in the proper order. -- Cleaned up VCFHeader operations -- BCF now uses the right header files correctly when encoding / decoding contigs -- Clean up unused tools -- Refactored header parsing routines to make them more accessible -- More minor header changes from Intellij	2012-05-24 10:56:52 -04:00
Mark DePristo	0799855479	Archiving GCF -- Rider update to CramByPiece.scala	2012-05-24 10:56:51 -04:00
Guillermo del Angel	43919078cd	Merged bug fix from Stable into Unstable	2012-05-23 21:21:01 -04:00
Guillermo del Angel	4bc04e2a9e	Correct way in which start/stop positions in a VC are computed when creating an indel VC. Old way was incorrect in case GENOTYPE_GIVEN_ALLELES was specified with a complex record. New way should work in general for all cases and is simpler.	2012-05-23 21:19:30 -04:00
Ryan Poplin	08dfd6cab6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-21 16:47:07 -04:00
Ryan Poplin	04000d920c	Bug fix in BadCigar read filter for index out of bounds exception when used with a bam file that contains unmapped reads.	2012-05-21 16:46:59 -04:00
Eric Banks	666862af19	Added @Hidden option for GSA production use to cap the max alleles for indels at a lower number than for SNPs	2012-05-21 16:03:29 -04:00
Khalid Shakir	e57cd78bba	Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each. This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource. Ex: public Wrapper getNewWrapper(File path) { FileStream myStream = new FileStream(path); // This stream must be eventually closed. return new Wrapper(myStream); } public void close(Wrapper wrapper) { wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream. }	2012-05-21 15:41:56 -04:00
Eric Banks	7f5ec17d22	Fixed up the comments in the GATKReportTable code and added some sanity checks to make sure that the user doesn't inconsistently add rows and corresponding IDs to the table.	2012-05-21 14:16:13 -04:00
Eric Banks	92d8aa3d4c	Don't exception out in these VE modules if the VCF has records that aren't just SNPs or indels	2012-05-21 09:38:52 -04:00
Eric Banks	3af3834d50	Fixing 2 bugs in the SAMRecord printing argument descriptor code (as reported by Kristian): * For some reason, the original implementor decided to use Booleans instead of booleans and didn't always check for null so we'd occasionally get a NPE. Switched over to booleans. * We'd also generate a NPE if SAMRecord writing specific arguments (e.g. --simplifyBAM) were used while writing to sdout.	2012-05-18 11:55:41 -04:00
Eric Banks	52c206d5db	Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports.	2012-05-18 02:32:20 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Guillermo del Angel	5189b06468	New annotation for indels that describe if they're STR's and their characteristics. If an indel is a STR, 3 fields are added to INFO: STR (boolean), RU = repeat unit (String), RPA = number of repetitions per allele. So, for example, if ATATAT* context gets changed to ATAT and ATATATAT, then RU=AT and RPA=3,2,4. Will be made standard annotation shortly. Added unit tests for new functionality. Pending: refactor VariantContextUtils.isRepeat() to unify code, and fix VariantEval functionality.	2012-05-17 15:28:19 -04:00
Eric Banks	0f7c917e7a	Better error checking and messages for bad alleles	2012-05-17 13:36:42 -04:00
Eric Banks	d44886d9e8	Very naughty bug: VE output is not at all gatherable but no one told this to Queue. Fixed.	2012-05-15 10:29:04 -04:00
Eric Banks	819c3d0c15	Adding to the Hrun docs	2012-05-15 10:27:52 -04:00
Guillermo del Angel	5fc3adbb04	One more VariantsToTable bug fix	2012-05-14 14:10:07 -04:00
Guillermo del Angel	04d691f04a	Forgot to update MD5's due to new Exact AF model in pool caller (all changes legit, minor QUAL/QD/SB differences). Fixed bug in VariantsToTable from previous commit	2012-05-14 14:01:29 -04:00
Guillermo del Angel	ae26f0fe14	a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing	2012-05-14 10:55:35 -04:00
Ryan Poplin	c9dd0f3173	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-10 13:09:10 -04:00
Ryan Poplin	0cdadffe14	Committing the best of the frantic pre-CSHL experiments: Better algorithm for partioning reads amongst the alleles they support. Require the read's original alignment to actually overlap the variant. QD uses the non-informative reads when calculating D. More HC-specific annotations for potential use in a statistical filtering strategy. Increasing the minimum kmer length in the assembly graphs. Misc minor bug fixes.	2012-05-10 13:09:03 -04:00
Guillermo del Angel	27b1aa5dd3	Don't allow N's in insertions when discovering indels. Maybe better solution will be to use them as wildcards and merge them with compatible regular insertion alleles but for now it's easier to ignore them. Minor refactoring of Allele.accepableAlleleBases to support this. Added unit test to test consensus allele counter in presence of N's	2012-05-10 10:29:19 -04:00
Eric Banks	4f37d6d399	Fixing docs	2012-05-10 00:56:00 -04:00
Mark DePristo	c81acfc15d	Working implementation of BCF2 -- Nearly complete on spec implementation. Slow but clean -- Some refactoring of VariantContext to support common functions for BCF and VCF	2012-05-08 19:46:51 -04:00
Mark DePristo	a5193c2399	Mostly complete reference implementation of BCF2 -- Can run VariantEval on 3000 sample exome VCF and get the same output as the original VCF	2012-05-08 19:46:51 -04:00
Eric Banks	473d07b0c5	fixing up docs from previous Pool Caller commit	2012-05-08 11:02:55 -04:00
Eric Banks	b4999d14c1	updating docs	2012-05-08 10:58:46 -04:00
Guillermo del Angel	33a1dd2048	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 10:42:12 -04:00
Eric Banks	5cf4fd63c2	Catch malformed base qualities and throw as a User Error	2012-05-08 09:34:57 -04:00
Guillermo del Angel	a4f4b5007b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-08 09:34:33 -04:00
Guillermo del Angel	605984353f	Pool Caller improvements: a) New non-standard private annotation Heteroplasmy which measures mean heteroplasmy (pool AF) across called samples, meant for easier mtDNA calling. Pure homoplasmic variants (pool AF = 1 or 0) would have heteroplasmy=1. b) Don't output pool genotypes by default for large pool sizes because it makes file sizes explode and they're unreadable. c) Refactored classes ExactACCounts and ExactACSet and moved to superclass AlleleFrequencyCalculationModel because both Pool and Exact AF calculation models will use it. d) Initial refactorings and skeleton for linearized multi-allelic exact model (not done yet). e) Unit test for Pool AF calculation model.	2012-05-08 09:33:38 -04:00
Eric Banks	c40cda7e3c	Nope, loads of integration tests had to be changed.	2012-05-07 14:30:42 -04:00
Eric Banks	66838a073e	Very annoying: we have been emitting an extra TAB in the header of the VCF (which breaks some parsers) for sites-only file. Hopefully not too many integration tests will need to be fixed...	2012-05-07 12:20:11 -04:00
David Roazen	6b769e91d8	BCF2: third checkpoint * writer mostly implemented * walkers to convert BCF2 <-> VCF * almost working for sites-only files; genotypes still need work * initial performance tests this afternoon will be on sites-only files	2012-05-04 13:00:15 -04:00
Eric Banks	f3433201b1	Merged bug fix from Stable into Unstable	2012-05-03 11:11:00 -04:00
Eric Banks	557da77a1a	Don't compute QD if there is no QUAL; added integration test for this	2012-05-03 11:02:37 -04:00
Eric Banks	1fc7b5d58b	Merged bug fix from Stable into Unstable	2012-05-03 10:37:58 -04:00
Laurent Francioli	567d01cee8	- Added option to output the father's allele first in phased child haplotypes - BUG corrected causing wrong phasing of child/father pairs Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:49 -04:00
Laurent Francioli	96e5a26223	PED support for Inbreeding Coefficient annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-05-03 10:36:20 -04:00
Mark DePristo	43d97c2e00	Rev Tribble to r97, adding binary feature support From tribble logs: Binary feature support in tribble -- Massive refactoring and cleanup -- Many bug fixes throughout -- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream as an argument not a String -- See ExampleBinaryCodec for an example binary codec -- AbstractAsciiFeatureCodec provides to its subclass the same String decode, readHeader functionality before. Old ASCII codecs should inherit from this base class, and will work without additional modifications -- Split AsciiLineReader into a position tracking stream (PositionalBufferedStream). The new AsciiLineReader takes as an argument a PositionalBufferedStream and provides the readLine() functionality of before. Could potentially use optimizations (its a TODO in the code) -- The Positional interface includes some more functionality that's now necessary to support the more general decoding of binary features -- FeatureReaders now work using the general FeatureCodec interface, so they can index binary features -- Bugfixes to LinearIndexCreator off by 1 error in setting the end block position -- Deleted VariantType, since this wasn't used anywhere and it's a particularly clean why of thinking about the problem -- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package -- TabixReader requires an AsciiFeatureCodec as it's currently only implemented to handle line oriented records -- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles Ascii and binary features -- Removed unused functions here and there as encountered -- Fixed build.xml to be truly headless -- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a value and the position in the file where the header ends (not inclusive). TribbleReaders now skip the header if the position is set, so its no longer necessary, if one implements the general readHeader(PositionalBufferedStream) version to see header lines in the decode functions. Necessary for binary codecs but a nice side benefit for ascii codecs as well -- Cleaned up the IndexFactory interface so there's a truly general createIndex function that takes the enumerated index type. Added a writeIndex() function that writes an index to disk. -- Vastly expanded the index unit tests and reader tests to really test linear, interval, and tabix indexed files. Updated test.bed, and created a tabix version of it as well. -- Significant BinaryFeaturesTest suite. -- Some test files have indent changes	2012-05-03 07:31:48 -04:00
Mark DePristo	58c470a6c5	Rev'ing Tribble from 53 to 94 -- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code -- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase	2012-05-03 07:31:47 -04:00
Khalid Shakir	b8b7f28aa9	Revving Picard to pick up new SamFileHeaderMerger. Updated ReadFilter abstract class to implement (via UnsupportedOperationException) the new SamRecordFilter.filterOut(). In IndelRealignerIntegrationTest updates for Picard fixes to SAMRecord.getInferredInsertSize() in svn r1115 & r1124. - Ran FixMates to create new input BAM since running IR with variable maxReadsInMemory means all reads weren't realigned leading to different outputs. - Updated md5s to match new expectations after looking at TLEN diff engine output.	2012-05-02 16:47:28 -04:00
Mauricio Carneiro	f51a1d0d61	Better error message to the BAMScheduler In the case where the BAM file was aligned using a reference but analysis is being attempted with a different reference.	2012-05-02 16:10:00 -04:00
Mauricio Carneiro	940029fa5d	Fixing on-the-fly recalibration (caught by Ryan) low quality bases in the tails were being turned to N's in the final read.	2012-05-02 16:06:04 -04:00
Eric Banks	623b36fbc4	Add header lines for AC,AF, and AN tags	2012-05-02 15:33:34 -04:00
Guillermo del Angel	429800a192	Fix corner case rounding issue in MathUtils unit test: 10^logFactorial(4)) was 23.999999... which if cast directly yielded 23 - so, do pre-rounding to ensure correct integer result if caller will cast value.	2012-05-02 09:57:06 -04:00
Guillermo del Angel	76a95fdedf	Full implementation of multiallelic exact model for pools. Still super-linear so not useable at scale but it should be a gold standard to compare to. Unit tests are not exhaustive yet, will be expanded to provide better test coverage. Small inconsequential optimization in MathUtils: we're already caching log10(factorial(n)) for large n, so might as well use the cached values to compute binomial and multinomial coefficients instead of the log-gamma approximation which is more expensive (doesn't seem to save much time either in PoolCaller nor in UG though).	2012-05-02 09:24:28 -04:00
Joel Thibault	4d732fa586	Move all MongoDB files into private/java/src/org/broadinstitute/sting/mongodb	2012-05-01 18:23:51 -04:00
Eric Banks	619a69a5f1	As promised in the release notes for 1.6, I am removing the old deprecated genotyping framework revolving around the misordering of alleles and have moved the fixed version in its place in preparation for release 1.7 (or 2.0?).	2012-05-01 16:18:24 -04:00
Joel Thibault	c255dd5917	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:10:38 -04:00
Ryan Poplin	51af61b5d7	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-05-01 16:07:23 -04:00
Ryan Poplin	fc55dcec3c	Unfortunately the reverse trimming of alleles still doesn't work with mixed records in some corner cases. Turning it off for now.	2012-05-01 16:02:36 -04:00
Ryan Poplin	20a0078f23	Merging active regions across shard boundries if they are contiguous, have the same active status and don't grow too big.	2012-05-01 15:51:36 -04:00
Eric Banks	0f3af9555b	Adding an option to SelectVariants which allows the user to re-genotype through the exact model (if PLs are present) the samples in order to recalculate the QUAL and genotypes. This is really the correct way to select a subset of samples, especially when originally called from low coverage data. Also added integration test to cover this case.	2012-05-01 14:58:06 -04:00
Joel Thibault	aa4d41cce0	Minor cleanup before push	2012-05-01 14:16:44 -04:00
Joel Thibault	b101b9c30b	Add Mongo switch	2012-05-01 14:00:48 -04:00
Joel Thibault	1b609e9075	Move Mongo to server couchdb	2012-05-01 13:59:47 -04:00
Joel Thibault	fd57d27f45	Move MongoDB connection handling to a separate class	2012-05-01 13:59:37 -04:00
Joel Thibault	db3cd1abd5	Use 2 MongoDB collections (tables): one for INFO/attributes, one for samples/genotypes.	2012-05-01 13:57:23 -04:00
Joel Thibault	04e1be9106	Better handling of Mongo errors + exceptions	2012-05-01 13:57:23 -04:00
Joel Thibault	ca737479cf	Query for stop locations because we don't have that information in the reference	2012-05-01 13:57:23 -04:00
Joel Thibault	1cda87a4ad	Set ROD priority list to input	2012-05-01 13:57:23 -04:00
Joel Thibault	a7fe847faf	Set the priority list and don't bother combining if not needed	2012-05-01 13:57:23 -04:00
Joel Thibault	f739305f43	Combine the variants found at a location	2012-05-01 13:57:23 -04:00
Joel Thibault	020f884d5a	Use new key of source ROD plus alleles	2012-05-01 13:57:23 -04:00
Joel Thibault	221ce9c3d6	Add alleles to the primary key	2012-05-01 13:57:23 -04:00
Joel Thibault	3198ce5471	Can have multiple variants at a location	2012-05-01 13:57:22 -04:00
Joel Thibault	11ed8e61c9	Add referenceBaseForIndel to the Mongo VariantContext objects	2012-05-01 13:53:44 -04:00
Joel Thibault	7ed0ee7ed0	Skip locations with no genotypes instead of throwing a NPE	2012-05-01 13:53:44 -04:00
Joel Thibault	4bdfeacdaa	Handle multiple samples/genotypes per location TODO: sample selection	2012-05-01 13:53:43 -04:00
Joel Thibault	1f7c628796	Insert the ROD filename into MongoDB as part of the primary key	2012-05-01 13:53:43 -04:00
Joel Thibault	bb8a6e9b0a	Initial test of write and read from MongoDB	2012-05-01 13:53:43 -04:00
David Roazen	c0084c741b	Pilot BCF2 Implementation: Checkpointing the code * Not working yet, still very much a work-in-progress with lots of placeholders * Needed to check this in to enable possible collaboration, since it's going slower than anticipated and the conference deadline looms.	2012-05-01 12:23:10 -04:00
Christopher Hartl	7d029b9a28	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-30 12:16:30 -04:00
Christopher Hartl	944a7d815e	Bringing VQSRV3 up to date. Lots of new features (un-classifying the worst-performing training sites, treating the x% best/worst sites as postive/negative points, ability to pass in a monomorphic track to see ROC curves output). Minor changes to AlleleBalance: weighted average was incorrectly specified (using logscale actually biased the average towards the AB of low-quality genotypes), and breaking out AB by het, hom, and diploid to bring it in line with some (private) changes to the indel likelihood model that (correctly) computes these values for indels.	2012-04-28 11:31:03 -04:00
Ryan Poplin	54a9bc2da2	Bug fix in reverse trim alleles for the case of mixed records that become non-mixed after subsetting the alleles.	2012-04-28 09:12:26 -04:00
Ryan Poplin	e332aeaf70	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-27 16:21:21 -04:00
Ryan Poplin	2b5dd28550	Bug fix in reverse trim alleles for the case of mixed records.	2012-04-27 16:21:02 -04:00
Mauricio Carneiro	1db2d1ba82	Do not add the first and last 4 cycles to the recalibration tables.	2012-04-27 15:18:07 -04:00
Mauricio Carneiro	08dbd756f3	Quick QC walkers to look at the error profile of indels in the read	2012-04-27 15:18:07 -04:00
Guillermo del Angel	730208133b	Several fixes and improvements to Pool caller with ancillary test functions (not done yet): a) Utility class called Probability Vector that holds a log-probability vector and has the ability to clip ends that deviate largely from max value. b) Used this class to hold site error model, since likelihoods of error model away from peak are so far down that it's not worth computing with them and just wastes time. c) Expand unit tests and add an exhaustive test for ErrorModel class. d) Corrected major math bug in ErrorModel uncovered by exhaustive test: log(e^x) is NOT x if log's base = 10. e) Refactored utility functions that created artificial pileups for testing into separate class ArtificialPileupTestProvider. Right now functionality is limited (one artificial contig of 10 bp), can only specify pileups in one position with a given number of matches and mismatches to ref) but functionality will be expanded in future to cover more test cases. f) Use this utility class for IndelGenotypeLikelihoods unit test and for PoolGenotypeLikelihoods unit test (the latter testing functionality still not done). g) Linearized implementation of biallelic exact model (very simple approach, similar to diploid exact model, just abort if we're past the max value of AC distribution and below a threshold). Still need to add unit tests for this and to expand to multiallelic model. h) Update integration test md5's due to minor differences stemming from linearized exact model and better error model math	2012-04-27 14:41:17 -04:00
Eric Banks	0439047269	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-27 10:49:45 -04:00
Eric Banks	05b44dd017	The genotypeCounts array wasn't always being initialized before it was accessed, leading to a NPE (which got caught and thrown as a JEXL expression when used in selection). Added unit test to cover all genotype count methods.	2012-04-27 10:49:36 -04:00
Khalid Shakir	9801dd114f	Bug fix for: https://getsatisfaction.com/gsa/topics/problem_with_indelrealigner_and_l_unmapped The GATK -L unmapped is for GenomeLocs with SAMRecord.NO_ALIGNMENT_REFERENCE_NAME, not SAMRecord.getReadUnmappedFlag() Previously unmapped flag reads in the last bin were being printed while also seeking for the reads without a reference contig.	2012-04-27 09:58:38 -04:00
Guillermo del Angel	972d6531b6	Corner case fix for indel GL computation: sometimes (depending on surrounding context) reads which are not informative of two candidate haplotypes end up having marginally higher likelihoods with one haplotype as opposed to another, depending on uncertainty on alignments in surrounding regions. So, a sample whose GL is -0.0001,-0.0005,-0.001 may have its genotype set to 1/1 due to this statistical noise. We already have a tolerance comparing max(gl)-min(gl) to avoid genotyping, so this tolerance is now increased from 0.001 to 0.1 (equivalent to 1 PL unit) to avoid genotyping a sample if all PLs are within this threshold. Changed 2 integration test md5s that hit this case.	2012-04-26 10:15:26 -04:00
Laurent Francioli	219b0a128b	PED support for ChromosomeCounts annotation Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:50:04 -04:00
Laurent Francioli	19d5213d5a	Added function to get founders IDs in SampleDB Signed-off-by: Eric Banks <ebanks@broadinstitute.org>	2012-04-25 12:49:36 -04:00
Mauricio Carneiro	902277856e	fix for RBP getPileupsForSamples() do not differentiate per sample pileups from generic pileups. Do the same for both -- it's O(n) either way.	2012-04-24 17:20:30 -04:00
Mauricio Carneiro	82b4798913	CountBasesWalker -- a quick QC walker.	2012-04-24 17:20:30 -04:00
Mauricio Carneiro	e440d0ce69	BQSR triage #4 * fixed queue script plot file names * updated the ReadGroupCovariate to use the platform unit instead of sample + lane. * fixed plotting of marginalized reported qualities	2012-04-24 17:19:54 -04:00
Eric Banks	d6277b70d8	Forgot to consider the optimized case in hasAllele	2012-04-24 11:32:28 -04:00
Eric Banks	91bad244d5	Using a VCF whose ALT is the reference in GGA mode is a User Error	2012-04-24 11:08:37 -04:00
Eric Banks	74ad008163	Adding VariantContext.hasAlternateAllele functionality	2012-04-24 11:07:46 -04:00
Eric Banks	66f3315548	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-24 09:39:55 -04:00
Eric Banks	bcb93dda5f	Fixing docs (rank sum test values are not phred-scaled)	2012-04-24 09:39:42 -04:00
Mauricio Carneiro	e39a59594a	BQSR triage and test routines * updated BQSR queue script for faster turnaround * implemented plot generation for scatter/gatherered runs * adjusted output file names to be cooperative with the queue script * added the recalibration report file to the argument table in the report * added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read * added RecalibrationReport unit test -- guarantees the integrity of the delta tables	2012-04-23 11:23:00 -04:00
Eric Banks	a733723439	Merged bug fix from Stable into Unstable	2012-04-23 10:30:30 -04:00
Eric Banks	2761da975e	Handle null VCs (which can arise when indels are present in the file)	2012-04-23 10:30:00 -04:00
Eric Banks	63aa79df82	Slightly better error message	2012-04-23 09:37:28 -04:00
Eric Banks	7b5fbf9567	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-23 09:34:08 -04:00
Eric Banks	4edb005411	Catch poorly formatted PL/GL fields	2012-04-23 09:33:50 -04:00
Ryan Poplin	35bb55f562	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-22 13:23:36 -04:00
Ryan Poplin	18e4532d10	Turning down the amount of assembly graph pruning slightly in the case of low coverage.	2012-04-22 13:23:24 -04:00
Eric Banks	1f23d99dfa	If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did.	2012-04-20 17:00:05 -04:00
Eric Banks	4b81c75642	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-20 14:30:19 -04:00
Eric Banks	f1c5510ec0	When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic.	2012-04-20 14:30:04 -04:00
Ryan Poplin	a1596791af	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-20 14:03:04 -04:00
Ryan Poplin	a57295eb75	Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold).	2012-04-20 14:02:55 -04:00
Guillermo del Angel	de68363c23	Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls	2012-04-20 10:58:34 -04:00
Guillermo del Angel	d2488dfb81	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 19:40:03 -04:00
Guillermo del Angel	c44c7b9a97	Restored optimization in Pair HMM only to compute HMM matrices starting in index where haplotypes start to diverge - saves about 15-20% of runtime which is what we lost by disabling banding in latest version, so runtime should be now about the same as what it was before refactoring. Output is bit-true to previous commit	2012-04-19 19:39:43 -04:00
Mauricio Carneiro	0f8c77391d	BQSR bug triage #3 * fixed context covariate famous "off by one" error * reduced maximum quality score to Q50 (following Eric/Ryan's suggestion) * remove context downsampling in BQSR R script	2012-04-19 17:31:04 -04:00
Khalid Shakir	df5dd841af	AC strat now checks if evals will be merged before throwing an error on multiple eval files. Minor tweaks to WGP script based on new recal VCF format.	2012-04-19 16:08:55 -04:00
Guillermo del Angel	1ae2ab5b63	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 12:50:29 -04:00
Guillermo del Angel	0e6e0cb907	Merging bug fixes	2012-04-19 12:49:30 -04:00
Eric Banks	79272c5e15	Thanks to Menachem for pointing out that the docs for genotyping_mode and output_mode were the same (and unclear). Fixed.	2012-04-19 12:48:09 -04:00
Guillermo del Angel	02ff930f6a	My changes	2012-04-19 12:45:18 -04:00
Eric Banks	2485cef5b8	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 11:46:06 -04:00
Eric Banks	76a6e37f4f	Don't output callability metrics by default anymore; one can still have them output to the 'metrics' file (which is now @Hidden because they are really for GSA use). Added a TODO to move UG from @By reference to reads and rods once LIBS is cleaned up.	2012-04-19 11:45:56 -04:00
Ryan Poplin	1ea4e48a27	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 11:32:32 -04:00
Ryan Poplin	11001ab9a2	Adding option to HaplotypeCaller to genotype the events on the chosen haplotypes as independent events. The filtered reads are now kept around so they can be passed to the variant annotations. Unfortunately the filtered reads aren't assigned a likelihood yet so they are all thrown in the Allele.NO_CALL bin.	2012-04-19 11:32:10 -04:00
Mauricio Carneiro	eb22cd7222	Unit test to guarantee BQSR sequential calculation accuracy This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.	2012-04-19 09:33:40 -04:00
Mauricio Carneiro	68d0211fa1	Improved BQSR plotting and some new parameters * Refactored CycleCovariate to be a fragment covariate instead of a per read covariate * Refactored the CycleCovariateUnitTest to test the pairing information * Updated BQSR Integration tests accordingly * Made quantization levels parameter not hidden anymore * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted) * Added hidden option not to generate the plots automatically (important for scatter/gathering)	2012-04-19 09:31:41 -04:00
Guillermo del Angel	143e92b797	Rebasing	2012-04-18 20:05:43 -04:00
Guillermo del Angel	82efd4457e	Revert some bad merge changes	2012-04-18 16:35:09 -04:00
Guillermo del Angel	31c394d588	Resolve merge conflicts	2012-04-18 16:25:03 -04:00
Ryan Poplin	4999ae87ad	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-18 15:02:42 -04:00
Ryan Poplin	dcc4871468	minor misc optimizations to PairHMM	2012-04-18 15:02:26 -04:00
Eric Banks	d3c84e7b1f	This should be a User Error since it's provided from the DoC command-line arguments	2012-04-18 13:09:23 -04:00
Eric Banks	392f1903f7	Handling some of the NumberFormatExceptions seen via Tableau that are really user errors.	2012-04-18 12:57:37 -04:00
Ryan Poplin	8a84456626	Following Eric's awesome update to change the VQSR recal file into a VCF file, the ApplyRecalibration step is now scatter/gather-able and tree reducible.	2012-04-18 11:24:04 -04:00
Eric Banks	4448a3ea76	Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position.	2012-04-17 23:54:10 -04:00
Eric Banks	c1f52b773a	Minor tweaks and updated integration tests MD5s	2012-04-17 23:17:28 -04:00
Eric Banks	6d03bce0d3	Important refactoring of the VQSR recal file format: we now use a VCF instead of a CSV file. The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration. For 1000G calling this was prohibitive in terms of memory requirements. Now we go through the rod system and pull in just the records we need at a given position. As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).	2012-04-17 22:38:18 -04:00
Mauricio Carneiro	46a212d8e9	Added "simplify reads" option to PrintReads.	2012-04-17 19:32:34 -04:00
Mauricio Carneiro	f0c81b59b0	Implementation of the new BQSR plotting infrastructure * removed low quality bases from the recalibration report. * refactored the Datum (Recal and Accuracy) class structure * created a new plotting csv table for optimized performance with the R script * added a datum object that carries the accuracy information (AccuracyDatum) for plotting * added mean reported quality score to all covariates * added QualityScore as a covariate for plotting purposes * added unit test to the key manager to operate with one required covariate and multiple optional covariates * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)	2012-04-17 19:23:55 -04:00
Ryan Poplin	952280bef1	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-17 17:00:14 -04:00
Ryan Poplin	cf705f6c62	Adding read position rank sum test to the list of annotations that get produced with the HaplotypeCaller	2012-04-17 17:00:00 -04:00
Eric Banks	13c800417e	Handle NPE in UG indel code: deletions immediately preceding insertions were not handled well in the code.	2012-04-17 15:51:23 -04:00
Guillermo del Angel	c78b0eee3a	Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller.	2012-04-17 14:22:48 -04:00
Khalid Shakir	91cb654791	AggregateMetrics: - By porting from jython to java now accessible to Queue via automatic extension generation. - Better handling for problematic sample names by using PicardAggregationUtils. GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name. CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering. Added SelectHeaders walker for filtering headers for dbGAP submission. Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter. Latest WholeGenomePipeline. Other minor cleanup to utility methods.	2012-04-17 11:45:32 -04:00
Ryan Poplin	1a2e92f8db	Merged bug fix from Stable into Unstable	2012-04-17 10:23:05 -04:00
Ryan Poplin	adad76b36f	Fixing NPE in VQSR for the case of very small callsets.	2012-04-17 10:20:43 -04:00
Mark DePristo	23ccf772d4	IndelSummary now emits all of the underlying counts for ratios, percentages, etc it computes	2012-04-13 17:00:36 -04:00
Mark DePristo	84d1e8713a	Infrastructure for combining VariantEvaluations -- Not hooked up yet, so the output of VariantEval should be the same as before -- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines -- Better docs throughout	2012-04-13 17:00:36 -04:00
Mark DePristo	38986e4240	Documentation for StratificationManager	2012-04-13 17:00:36 -04:00
Mark DePristo	ab06d53867	Useful test constructor or Unit tests in RefMetaDataTracker	2012-04-13 17:00:36 -04:00
Mark DePristo	285e61a227	Bugfix for IndelSummary -- multi allelic count should be % not ratio	2012-04-13 17:00:35 -04:00
Mark DePristo	e6d5cb46d2	Improvements and bugfixes to IndelSummary -- Now properly includes both bi and multi-allelic variants. These are actually counted as well, and emitted as counts and % of sites with multiple alleles -- Bug fix for gold standard rate	2012-04-13 17:00:35 -04:00
Mark DePristo	bfa966a4e9	Bugfix for OneBPIndel -- Previously was only including 1 bp insertions in stratification	2012-04-13 17:00:35 -04:00
Mark DePristo	2aa2d9aec0	Merged bug fix from Stable into Unstable	2012-04-13 09:25:43 -04:00
Mark DePristo	27e7e17dc7	New way to handle exceptions in multi-threaded GATK -- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now. -- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer -- Better printing of stack traces in WalkerTest	2012-04-13 09:23:33 -04:00
Eric Banks	818e8c2fb9	Resolving merge conflicts	2012-04-12 15:19:44 -04:00
Eric Banks	0dd571928d	Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones).	2012-04-12 15:16:29 -04:00
Eric Banks	f77a6d18b8	Bad conflict merge before	2012-04-12 09:56:49 -04:00
Eric Banks	33a8bdd75f	Resolving merge conflicts	2012-04-12 09:51:55 -04:00
Eric Banks	b659b16b31	Generate User Error for bad POS value	2012-04-12 09:49:35 -04:00
Eric Banks	cc71baf691	Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing.	2012-04-12 09:18:44 -04:00
Eric Banks	5bf9dd2def	A framework to get annotations working in the HaplotypeCaller (and ART walkers in general). Adding support for active-region-based annotation for most standard annotations. I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest. IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller). The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system. When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.	2012-04-11 16:22:12 -04:00
Guillermo del Angel	f9f8589692	Refactoring/fixing up UG HMM code: a) Make code use PairHMM class instead of having duplicated code. That way UG and HaplotypeCaller now use same core code. Changes to be able to do this: 1. Compute context-dependent GOP as a function of read, not of haplotype, b) Extracted code to initialize HMM arrays into separate method, c) Move PairHMM class and unit test to public, d) Reenable banded code in PairHMM, inverted sense of flag (true=enable feature) but leave off in HaplotypeCaller.	2012-04-11 13:56:51 -04:00
Eric Banks	7aa654d13f	New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now	2012-04-11 13:49:09 -04:00
Eric Banks	dc90508104	Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful.	2012-04-11 13:47:10 -04:00
Eric Banks	f560611fe8	Merged bug fix from Stable into Unstable	2012-04-10 22:26:53 -04:00
Eric Banks	f46f7d0590	Fix the stats coming out of FlagStat. I will add an integration test in unstable	2012-04-10 22:26:10 -04:00
Mauricio Carneiro	cd842b650e	Optimizing DiagnoseTargets * Fixed output format to get a valid vcf * Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples * Added support to overlapping intervals * Removed expand target functionality (for now) * Removed total depth (pointless metric)	2012-04-10 17:43:59 -04:00
Ryan Poplin	e3cc7cc59c	Resolving merge conflict.	2012-04-10 14:50:27 -04:00
Ryan Poplin	a4634624b7	There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function.	2012-04-10 14:48:23 -04:00
Eric Banks	10e74a71eb	We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior.	2012-04-10 12:30:35 -04:00
Mark DePristo	b43d21056b	Merged bug fix from Stable into Unstable	2012-04-10 09:42:09 -04:00
Mark DePristo	6885e2d065	UserException fixes for GATK_logs recent errors -- SamFileReader.java:525 -- BlockCompressedInputStream:376 These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.	2012-04-10 07:37:42 -04:00
Mark DePristo	8507cd7440	Throw UserException for bad dict / chain files	2012-04-10 07:22:43 -04:00
Ryan Poplin	cd9bf1bfc3	Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs.	2012-04-10 00:22:40 -04:00
Roger Zurawicki	9ece93ae9c	DiagnoseTargets now outputs a VCF file - refactored the statistics classes - concurrent callable statuses by sample are now available. Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-04-09 16:40:20 -04:00
Guillermo del Angel	719ec9144a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-09 14:53:19 -04:00
Guillermo del Angel	550179a1f7	Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors.	2012-04-09 14:53:05 -04:00
Eric Banks	ea4300d583	Refactoring so that Unified Argument Collection doesn't use deprecated classes.	2012-04-09 13:45:17 -04:00
Eric Banks	6ddf2170b6	More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice.	2012-04-09 11:46:16 -04:00
Mauricio Carneiro	87e6bea6c1	Adding engine capability to quantize qualities. * Added parameter -qq to quantize qualities using a recalibration report * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization. * Updated BQSR scripts to make use of the new parameters	2012-04-08 21:07:51 -04:00
Mark DePristo	45fc0ea98d	Improvements to indel analysis capabilities of VariantEval -- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites -- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly: // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a // downstream frameshift, if we make the simplifying assumptions that 3 bp ins // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions // as for deletions, and certainly would expect consistency between in/dels that // multiple methods find and in/dels that are unique to one method (since deletions // are more common and the artifacts differ, it is probably worth looking at the totals, // overlaps and ratios for insertions and deletions separately in the methods // comparison and in this case don't even need to make the simplifying in = del functional assumption -- Added a new VEW argument to bind a gold standard track -- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do -- Deleted random unused functions in IndelUtils	2012-04-06 16:07:46 -04:00
Mark DePristo	52ef4a3e26	Function to compute whether a VariantContext indel is part of a TandemRepeat Returns true iff VC is an non-complex indel where every allele represents an expansion or contraction of a series of identical bases in the reference. The logic of this function is pretty simple. Take all of the non-null alleles in VC. For each insertion allele of n bases, check if that allele matches the next n reference bases. For each deletion allele of n bases, check if this matches the reference bases at n - 2 n, as it must necessarily match the first n bases. If this test returns true for all alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the base differences between the ref and alt alleles	2012-04-06 16:07:46 -04:00
Mark DePristo	08fab49d30	Added function to get bases from the current base forward in the window in ReferenceContext	2012-04-06 16:07:46 -04:00
Ryan Poplin	c77104b815	Adding function call in HaplotypeCaller right before the VariantContext gets written out to disk which partitions all the reads by which allele gave the read the highest likelihood. This will allow variants to be annotated by the refactored VariantAnnotator. Uninformative reads are mapped to Allele.NO_CALL	2012-04-06 00:22:52 -04:00
Mauricio Carneiro	a19c27297f	continuing the BQSR triage... * fixed the loading of the new reduced size reports * reduced BQSR scala script memory to 2Gb * removed dcov parameter from BQSR scala script * fixed estimatedQReported calculation from -log10(pe) to -10log10(pe). updated md5's with the proper PHRED scaled EstimatedQReported	2012-04-05 14:34:15 -04:00
Eric Banks	3561056a9c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-05 10:49:26 -04:00
Eric Banks	5c3ddec4c2	Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update.	2012-04-05 10:49:08 -04:00
Mauricio Carneiro	7c3b3650bb	BQSR bug triage * fixed bug where some keys were using the same recal datum objects * fixed quantization qual calculations when combining multiple reports * fixed rounding error with empirical quality reported when combining reports * fixed combine routine in the gatk reports due to the primary keys being out of order * added auto-recalibration option to BQSR scala script * reduced the size of the recalibration report by ~15% * updated md5's	2012-04-05 09:32:18 -04:00
Eric Banks	2c956efa53	Minor fixups to GenotypeLikelihoods	2012-04-05 09:14:37 -04:00
Mauricio Carneiro	1e65474fec	Added utility to get the reference coordinate given the read coordinate	2012-04-05 09:04:20 -04:00
Guillermo del Angel	6913710e89	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-04 20:17:18 -04:00
Mark DePristo	76e4100d89	By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots -- Updated integration tests as well	2012-04-04 18:48:03 -04:00
Guillermo del Angel	820216dc68	More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation	2012-04-04 16:23:10 -04:00
Ryan Poplin	bfad26353a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-04 16:04:50 -04:00
Ryan Poplin	dda2173c66	Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.	2012-04-04 16:04:29 -04:00
Mark DePristo	fcdd65a0f4	Bugfix for IndelLengthHistogram -- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.	2012-04-04 15:37:43 -04:00
Mark DePristo	1ccea866d8	VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses -- Updated EvalModules to work with new paramter -- adding test file for keepAC0 to public/testdata and integration tests	2012-04-04 15:37:12 -04:00
Eric Banks	9e32a975f8	Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.	2012-04-04 13:47:59 -04:00
Eric Banks	337ff7887a	When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.	2012-04-04 10:57:05 -04:00
Guillermo del Angel	05d8400468	Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)	2012-04-03 20:51:24 -04:00
Guillermo del Angel	5abb07da5d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-03 17:00:45 -04:00
Christopher Hartl	a6837d31d4	Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it.	2012-04-03 16:13:16 -04:00
Guillermo del Angel	63b1e737c6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-03 15:43:50 -04:00
Guillermo del Angel	9e11b4f9a7	Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.	2012-04-03 15:43:32 -04:00
Eric Banks	f9ce9962c4	Minor changes to verbose mode	2012-04-03 10:53:48 -04:00
Eric Banks	f6aa95685d	OutOfMemory exceptions are User Errors	2012-04-02 22:46:56 -04:00
Eric Banks	659b82e74d	Old -B syntax is long gone at this point. Safe to remove the warning.	2012-04-02 22:25:16 -04:00
Eric Banks	99d27ddcc4	Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now.	2012-04-02 14:27:36 -04:00
Mark DePristo	6b7a00061a	VariantsToTable now works with multiple input VCFs	2012-04-02 09:13:35 -04:00
Mark DePristo	fbbb8509ad	Final commits to VariantEval -- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to. -- Cleanup code, reorganize a bit more. -- Fix for broken integrationtests	2012-03-30 20:11:06 -04:00
Mark DePristo	4b45a2c99d	Final version of new VariantEval infrastructure. * WAY FASTER * -- 3x performance for multiple sample analysis with 1000 samples -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2 -- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval -- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also. -- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do. -- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter. -- Commented out GenotypePhasingEvaluator, as it uses the retired TableType -- Stratifications are all fully typed, so it's easy for GATKReports to format them. -- Removed old VE work around from GATKReportColumn -- General code cleanup throughout -- Updated integration tests	2012-03-30 15:31:56 -04:00
Mark DePristo	8c0718a7c9	Fixed missing import	2012-03-30 15:31:55 -04:00
Mark DePristo	097ed4ecc4	Memory usage optimizations and safety improvements to StratNode and StratificationManager -- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users. -- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals	2012-03-30 15:31:55 -04:00
Mark DePristo	b335c22f6d	Fully refactored, mostly cleaned up version of VariantEval using StratificationManager	2012-03-30 15:31:55 -04:00
Mark DePristo	c8086a79e3	New StratificationManager based VariantEval passes unmodified integration tests -- Now needs cleanup and optimizations	2012-03-30 15:31:55 -04:00
Mark DePristo	d37f31e349	First version of VariantEval that runs (approximately correctly) with new StratificationManager	2012-03-30 15:31:54 -04:00
Mark DePristo	8971b54b21	Phase II of Stratification manager -- Renamed and reorganized infrastructure -- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO -- Ready for hookup to VE	2012-03-30 15:31:54 -04:00
Mark DePristo	9f1cd0ff66	Lots of new functionality for StratificationStates manager -- Really working according to unit tests -- A nCombination utils	2012-03-30 15:31:54 -04:00
Mark DePristo	a3d896d80e	Part I of creating a fast state space lookup for VE -- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map. -- Minor code cleanup throughout VE (removing unused headers, for example)	2012-03-30 15:31:53 -04:00
Eric Banks	533c283783	Deprecating AlignmentContext.getExtendedEventPileup(). At this point the only walkers left with any relaiance on extended events are Guillermo's pooled code (he'll update soon) and the Pileup walker. David, I'll leave that last one for you (it should be easy). We can now officially rip the extended event code from the engine.	2012-03-30 10:37:14 -04:00
Eric Banks	6b49af253b	Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there.	2012-03-30 10:33:30 -04:00
Eric Banks	b467cd1dae	Removing dependence on extended events for the remaining Variant Annotator modules.	2012-03-30 09:05:26 -04:00
Eric Banks	b21889812d	Removing some more usages of extended events. Not done yet, but almost there.	2012-03-30 01:51:37 -04:00
Eric Banks	ad6ace2439	Resolving merge conflicts	2012-03-30 01:51:09 -04:00
Eric Banks	f4d4969f23	Don't ever return null for the list of GL models	2012-03-30 00:22:40 -04:00
Eric Banks	44ac49aa34	Removing dependencies in the annotations on extended events. Some refactoring involved in this.	2012-03-30 00:17:02 -04:00
Mauricio Carneiro	cbd21c6339	Nasty, nasty..... VariantEval is overly abusive of the GATKReport (lack of) spec. 1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers. 2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings. Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport. No wonder, the printing needed to be O(n^2).	2012-03-29 17:49:53 -04:00
Eric Banks	c2e27729c7	Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly.	2012-03-29 17:08:25 -04:00
Ryan Poplin	6da9571829	resolving merge conflicts.	2012-03-29 16:16:28 -04:00
Ryan Poplin	ca96544ed0	All the zero quality N bases in the solid reads are adding lots of extra paths in the assembly graph. We now require a minimum base quality for every base in the kmer before adding it to the graph. The large number of solid reads with unmapped mates was also triggering the active region traversal at every base. We now ignore that check for solid reads.	2012-03-29 16:14:29 -04:00
Eric Banks	e4469a83ee	First attempt at removing all traces of extended events from UG; integration tests are expected to fail.	2012-03-29 14:59:29 -04:00
Eric Banks	e61e162c81	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-29 12:33:13 -04:00
Mauricio Carneiro	cf364f26a0	Fixing alignment issue with the GATKReportColumn algorithm Numeric columns were being left-aligned when they should be right-aligned. Fixed it.	2012-03-29 12:28:49 -04:00
Mauricio Carneiro	f80bd4276a	fixed estimated Q reported calculation in the gatherer	2012-03-29 12:28:43 -04:00
Mauricio Carneiro	8a9fb514b6	simplifying GATKReportColumn constructor logic	2012-03-29 12:28:37 -04:00
Eric Banks	e861106398	Accidentally erased important line	2012-03-29 11:08:54 -04:00
Eric Banks	e4a225ed09	Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally.	2012-03-29 11:07:37 -04:00
Guillermo del Angel	c9c3f6b0fc	Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity	2012-03-29 11:05:42 -04:00
Ryan Poplin	9684a2efb0	HaplotypeCaller: Variants found on the same haplotype are now written out with phased genotypes. There are serious eval issues with MNPs so disabling them for now.	2012-03-29 09:41:29 -04:00
Guillermo del Angel	250adca350	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 21:01:49 -04:00
Guillermo del Angel	e0ab4e4b30	Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic	2012-03-28 21:01:31 -04:00
Mauricio Carneiro	8f0e9d74ce	GATKReportTable output refactor writing out a GATKReportTable was O(n^2)!!!!! New implementation is O(n). What a difference, when N = 2^16...	2012-03-28 17:19:12 -04:00
Guillermo del Angel	62ee31afba	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 16:00:38 -04:00
Guillermo del Angel	1eee9d512d	Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal)	2012-03-28 15:41:39 -04:00
Mauricio Carneiro	bb36cd4adf	Quick fixes to BQSRGatherer and GATKReportTable * when gathering, be aware that some keys will be missing from some tables. * when a gatktable has no elements, it should still output the header so we know it had no records	2012-03-28 09:07:54 -04:00
Roger Zurawicki	63cf7ec7ec	Added more primitives to GATK Report Column Type - The Integer column type now accepts byte and shorts - Updated Unit Tests and added a new testParse() test Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-28 09:07:54 -04:00
Guillermo del Angel	08f7d47d7c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 07:42:09 -04:00
Mark DePristo	12aa72f200	Merged bug fix from Stable into Unstable	2012-03-27 22:43:00 -04:00
Mark DePristo	979a84a252	Bugfix for thread unsafe PL cache -- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic -- Solution is to use a fixed cache that's never updated on the fly. My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.	2012-03-27 22:42:30 -04:00
Guillermo del Angel	8f34412fb8	First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing	2012-03-27 20:59:44 -04:00
Guillermo del Angel	ed322bd73f	Fix again merge issues	2012-03-27 15:03:13 -04:00
Guillermo del Angel	b4a7c0d98d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-27 15:01:03 -04:00
Guillermo del Angel	343a061b1c	Fix merge issues when incorporating new AF calculations changes	2012-03-27 15:00:44 -04:00
Mauricio Carneiro	1b75663178	BQSR Gatherer implementation and integration tests * restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers * optmized empirical qual calculation when merging recalibration reports * centralized the quality score quantization functionalities * unified the creating/loading of all the key manager/hash table structures. * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing) * added integration tests for BQSR and on-the-fly recalibration	2012-03-27 13:50:22 -05:00
Ryan Poplin	5dbd3625cd	Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data.	2012-03-27 13:38:52 -04:00
Eric Banks	c112e0824a	I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there.	2012-03-27 11:09:03 -05:00
Mark DePristo	a638996fe2	Cleanup of VariantEval, diatribe about performance problems with StateKey -- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear -- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.	2012-03-27 11:56:24 -04:00
Mark DePristo	679bb03014	Simple utility function for converting an Iterable<T> to Collection<T>	2012-03-27 11:54:58 -04:00
Mark DePristo	1f5f737c8b	Optimizing the GATKReportTable.write -- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables	2012-03-27 11:54:35 -04:00
Mark DePristo	913c8b231f	Fix ErrorRatePerCycle to overload equals and hashcode -- Fixes failing integration tests	2012-03-27 10:35:32 -04:00
Eric Banks	c07a577ba3	Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously).	2012-03-27 00:27:44 -05:00
Mark DePristo	34ea443cdb	Better algorithm for choosing which indel alleles are present in samples -- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors. -- This breakdown is producing spurious clustered indels (lots of these!) around real common indels -- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted. -- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale. -- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP	2012-03-26 16:28:49 -04:00
Mark DePristo	11b6fd990a	GATKReportColumn optimizations -- Was TreeMap even though the sorting wasn't used. Replaced with LinkedHashMap.	2012-03-26 16:28:49 -04:00
Mark DePristo	6be5e82860	VariantEval scalability optimizations -- StateKey no longer extends TreeMap. It's now a final immutable data structure that caches it's toString and hashcode values. TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function. -- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils -- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects. Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE -- VariantEvaluator base class has a cached getSimpleName() function -- VEUtils: general cleanup due to refactoring of StateKey -- VEWalker: much better iteration of map data structures. If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet. This is far better than iterating over the keys and calling get() on each key.	2012-03-26 16:28:48 -04:00
Guillermo del Angel	1c424c0daf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-26 15:15:50 -04:00
Ryan Poplin	019145175b	Major optimizations to graph construction through better use of built in graph.containsVertex and vertex.equals methods. Minor optimizations to MathUtils.approximateLog10SumLog10 method	2012-03-26 11:32:44 -04:00
Ryan Poplin	1fa66f76c9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-25 23:04:47 -04:00
Guillermo del Angel	ce617b2dfc	Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code	2012-03-25 10:20:21 -04:00
Guillermo del Angel	db54c2625f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-25 09:53:35 -04:00
Guillermo del Angel	deb4586559	Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1	2012-03-24 21:49:43 -04:00
Mark DePristo	b063bcd38d	Removing update0 support in VariantEval -- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations. -- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications. -- No need to modify the integration tests, this optimization doesn't change the result of the calculation	2012-03-23 21:02:21 -04:00
Mauricio Carneiro	0509d316d9	More information in the recalibration report * added empirical quality counts to allow quantization during on-the-fly recalibration to any level * added number of observations and errors to all tables to enable plotting of all covariates	2012-03-23 16:15:19 -04:00
Mauricio Carneiro	9f74969e3a	BQSR with GATKReport implementation * restructured BQSR to report recalibrated tables. * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration) * linked quality score quantization to the BQSR stage, outputting a quantization histogram * included the arguments used in BQSR to the GATK Report * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities On-the-fly recalibration with GATK Report * loads all tables from the GATKReport using existing infrastructure (with minor updates) * implemented initialiazation of the covariates using BQSR's argument list * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key * applied quality quantization to the base recalibration * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions	2012-03-23 15:42:32 -04:00
Mauricio Carneiro	f421062b55	Updated read group covariate to use sample.lane instead of the id Added Unit test.	2012-03-23 15:24:07 -04:00
Mauricio Carneiro	539da9e3e1	Fixing GATKReport exception handling when loading a report * allowing tables with no description to go through * GATKReportTable should be more lenient with the format requirements (added to-dos for roger)	2012-03-23 15:23:13 -04:00
Eric Banks	2511839068	Merged bug fix from Stable into Unstable	2012-03-23 13:51:33 -04:00
Eric Banks	d3f2bc4361	Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles.	2012-03-23 13:51:00 -04:00
Mark DePristo	e4ec90cfce	Merged bug fix from Stable into Unstable	2012-03-23 11:27:34 -04:00
Mark DePristo	ff26f2bf68	HierarchicalMicroScheduler no longer attempts to wrap exceptions -- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors. Now HMS simply checks whether the throwable it received on error was a RuntimeException. If so, it is stored and rethrow without wrapping later. If it isn't, only in this case is the exception wrapped in a ReviewedStingException. -- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line -- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled. -- I believe this will finally put to rest all of these annoying HMS captures.	2012-03-23 11:27:21 -04:00
Ryan Poplin	9d22471b79	Merged bug fix from Stable into Unstable	2012-03-23 10:48:34 -04:00
Ryan Poplin	ab288354e9	Better error message for malformed input recal file.	2012-03-23 10:47:01 -04:00
Mark DePristo	fee8d86f63	VariantEval optimization -- Use a LinkedHashMap not a TreeMap so iteration is faster. -- Note that with a lot of stratifications the update0 is taking up a lot of time. For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call	2012-03-22 22:13:24 -04:00
Mark DePristo	6df96644d9	Unified, standard IndelSummary metrics for VariantEval -- Now you always get SNP and indel metrics with VariantEval! -- Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions -- Updated VE integration tests as appropriate	2012-03-22 21:24:37 -04:00
Mark DePristo	bcf80cc7b3	Cleanup in VariantEval. Example of molten VariantEval output -- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share. Code updated to use these routines where appropriate -- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF -- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication. -- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden. -- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets. See IndelHistogram class (in later commit) for example of molten VE output	2012-03-22 21:24:37 -04:00
Mark DePristo	9ddd5aec93	More eval modules being removed from VariantEval -- IndelStatistics is superceded by IndelStatistics	2012-03-22 21:24:36 -04:00
Mark DePristo	bd5b6d1aba	Remove no longer in use Eval modules from VariantEval -- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit) -- No more SamplePreviousGenotypes or PhaseStats -- No more MultiallelicAFs	2012-03-22 21:24:36 -04:00
Menachem Fromer	7faa9938b1	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-22 17:43:44 -04:00
Menachem Fromer	b9b9219ac7	Added respectPhaseInInput flag to RBP and integration tests	2012-03-22 17:40:21 -04:00
Guillermo del Angel	f198cec5e2	Temp commit: new structure for pool caller, now all work is in the same framework as in UG. There's a new genotype calculation model, PoolGenotypeCalculationModel, that does all the work and plugs into UnifiedGenotyperEngine. A new AF module for pools is upcoming. Old pool caller will be removed once all work is migrated	2012-03-22 15:46:39 -04:00
Menachem Fromer	1dfaacfeb5	Check for consistency of the BAM and VCF sample names, with a command line disable to throw if you know what you are doing	2012-03-22 12:40:15 -04:00
Guillermo del Angel	b02ef95bcf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-22 12:14:12 -04:00
Guillermo del Angel	92676c63ca	Make constructor of IndelGenotypeLikelihoodsCalculationModel public so it can be used in unit tests	2012-03-22 12:13:59 -04:00
Guillermo del Angel	58965d6a6e	Merged bug fix from Stable into Unstable	2012-03-22 11:04:11 -04:00
Guillermo del Angel	b8cd959461	Potential corner condition bug fix: protect against null pointer exceptions when computing consensus indel bases when UG is discovering alt alleles. If an alt allele has non-standard bases, skip allele gracefully instead of adding null object into list	2012-03-22 10:06:22 -04:00
Ryan Poplin	a29fc6311a	New debug option to output the assembly graph in dot format. Merge nodes in assembly graph when possible.	2012-03-21 15:48:55 -04:00
Eric Banks	8c09ff9459	Merged bug fix from Stable into Unstable	2012-03-21 12:44:43 -04:00
Eric Banks	58245bfa2f	Bug fix: check to see whether there's a BasePileup before asking for one.	2012-03-21 12:44:09 -04:00
Eric Banks	07c3bd32b3	Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention...	2012-03-21 12:42:13 -04:00
Eric Banks	dcf2fa361d	Minor cleanup	2012-03-21 12:14:31 -04:00
Eric Banks	ab1c48745b	Need to catch RuntimeExceptions coming out of Picard too so that they show up as UserErrors (some BAM errors are thrown as REs).	2012-03-21 12:13:52 -04:00
Ryan Poplin	9e10779fa7	Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests.	2012-03-21 08:45:42 -04:00
Mauricio Carneiro	0e93cf5297	Taking care of bad cigars in the GATK * fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events. * added Unit tests for BadCigarFilter * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)	2012-03-20 14:32:57 -04:00
Eric Banks	5e79046c98	Minor change but I realized from Mark's commit that the code I stole it from was flawed	2012-03-20 08:55:56 -04:00
Eric Banks	ade1971581	Since we allow any generic header types, there's no longer any reason to check for supported types	2012-03-20 00:12:17 -04:00
Eric Banks	2324c5a74f	Simplified the interface for simple VCF header lines by making the VCFSimpleHeaderLine not abstract anymore - now any arbitrary header line with an ID (e.g. the contig and ALT lines) can be part of this class without having to define new classes. Also, renamed the 'named' header line to 'id' since that's more accurate.	2012-03-19 21:29:24 -04:00
Roger Zurawicki	7afb333811	GATK Report code cleanup - Updated the documentation on the code - Made the table.write() method private and updated necessary files. - Added a constructor to GATKReport that takes GATKReportTables - Optimized my code Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-19 11:53:57 -04:00
Mauricio Carneiro	0d4ea30d6d	Updating the BQSR Gatherer to the new file format This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.	2012-03-19 09:02:27 -04:00
Eric Banks	9223e451a3	Merged bug fix from Stable into Unstable	2012-03-18 00:54:19 -04:00
Eric Banks	344a938a70	When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array.	2012-03-18 00:36:30 -04:00
Eric Banks	be9e48ba29	Merged bug fix from Stable into Unstable	2012-03-16 14:33:53 -04:00
Mauricio Carneiro	ec4a870a0f	Added @PG tag to ReduceReads Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.	2012-03-16 14:09:07 -04:00
Mauricio Carneiro	3bfca0ccfd	BitSet implementation of the on-the-fly recalibration using the CSV format file. Infrastructure: * Added static interface to all different clipping algorithms of low quality tail clipping * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration) * EventType is now an independent enum with added capabilities. All functionality is now centralized. BQSR and RecalibrateBases: * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint * Refactored the object creation to take advantage of the compact key structure * Replaced nested hash maps with single hash maps indexed by bitsets * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm). * Excluded contexts with N's from the output file. * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!) * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality) * Added the covariate ID (for optional covariates) to the output for disambiguation purposes * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum. * Reduced memory usage of the BQSR script to 4 Tests: * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager * Added tests for keys without optional covariates * Added tests for on-the-fly recalibration (but more tests are necessary)	2012-03-16 13:02:15 -04:00
Mauricio Carneiro	ca11ab39e7	BitSets keys to lower BQSR's memory footprint Infrastructure: * Generic BitSet implementation with any precision (up to long) * Two's complement implementation of the bit set handles negative numbers (cycle covariate) * Memoized implementation of the BitSet utils for better performance. * All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow. * Replace log/sqrt with bitwise logic to get rid of numerical issues BQSR: * All covariates output BitSets and have the functionality to decode them back into Object values. * Covariates are responsible for determining the size of the key they will use (number of bits). * Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type * No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead Tests: * Unit tests added to every method of BitSetUtils * Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager) * Unit tests added to the cycle and context covariates (will add unit tests to all covariates)	2012-03-16 13:01:48 -04:00
Eric Banks	dce6b91f7d	Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing.	2012-03-16 11:14:37 -04:00
Eric Banks	41068b6985	The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code.	2012-03-15 16:08:58 -04:00
Ryan Poplin	0c6b34e9df	Fixing a bug identified by the ActivityProfile unit tests	2012-03-15 14:24:30 -04:00
Ryan Poplin	252b830aa8	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-15 11:56:04 -04:00
Ryan Poplin	1429ddcf55	Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine	2012-03-14 21:25:43 -04:00
Mark DePristo	7c5cdb51c2	UnitTests for ActivityProfile and minor ART cleanup -- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-( -- UnitTesting framework for ActivityProfile -- needs to be expanded -- Minor helper functions for ActiveRegion to help with unit tests	2012-03-14 17:26:37 -04:00
Mark DePristo	e440c9be98	Clean up logic for adding reads to ART cache -- No longer has duplicate code	2012-03-14 17:26:37 -04:00
Mark DePristo	5bcb5c7433	Preliminary refactoring of ART -- Refactored ART into clearer, simpler procedures. Attempted to merge shared code into utility classes. -- Added some docs -- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive -- Separated band-pass filtering from creation of active regions. Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions -- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions() -- Many TODOs in ActivityProfile.	2012-03-14 17:26:37 -04:00
Ryan Poplin	1da8928407	HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes.	2012-03-14 15:22:21 -04:00
Guillermo del Angel	eca055ccad	Add option in ValidationAmplicons to only output SNPs and INDELs, ignoring complex variants (or SVs, etc.)	2012-03-14 14:26:40 -04:00
Eric Banks	f7c2c818fe	Exact model memory optimization: instead of having a later matrix column pull in data from earlier ones (requiring us to keep them around until all dependencies are hit), the earlier columns push data into their dependents immediately and then are removed. This does trade off speed a little bit (because we need to call approximateLog10Sum each time we add to a dependent instead of once in an array at the end). Note that this commit would normally not get pushed into stable, but I'm about to make a very disruptive push into stable that would make merging this from unstable a nightmare.	2012-03-14 14:02:36 -04:00
Mark DePristo	6a40ca6bec	Merged bug fix from Stable into Unstable	2012-03-14 12:19:33 -04:00
Mark DePristo	bb2c10b785	Capture the class of the exception in GATKRunReport -- As suggested by David.	2012-03-14 12:16:22 -04:00
Ryan Poplin	78a4e7e45e	Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner.	2012-03-14 12:05:05 -04:00
Eric Banks	77243d0df1	Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me	2012-03-13 16:31:51 -04:00
Eric Banks	568a1362f5	Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me	2012-03-13 16:19:15 -04:00
Eric Banks	5d7c761784	Merged bug fix from Stable into Unstable	2012-03-13 11:01:03 -04:00
Eric Banks	5200f7f919	When creating a synthetic VC based on the passed in alleles, set the reference base for indel.	2012-03-13 10:59:58 -04:00
Eric Banks	1675bd4dd7	When creating a synthetic VC based on the passed in alleles, set the length correctly.	2012-03-13 10:55:52 -04:00
Roger Zurawicki	7887a06703	GATKReport v1.0 GATKReport format changes: - All non-data header lines are preceeded with a single pound ( #:) - Every report now has a report header containing the version number and number of tables - Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description. - This new format will allow reports in the future to be gatherable. - Changed the header format to include an end-of-line string ":;" Added features: - Simplified GATK Reports: The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report. A simple GATK Report consists of: - A single table - No primary key ( it is hidden ) Optional: - Only untyped columns. As long as the data is an Object, it will be accepted. - Default column values being empty strings. Limitations: - A simple GATK report cannot contain multiple tables. - It cannot contain typed columns, which prevents arithmetic gathering. - Added a constructor to generate simplified GATK reports. - Added a method to easily add data to simple GATK reports. - Upgraded the input parser take advantage of the new file format (v1). - Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data. - Made some GATKReport methods public, and added more setters and getters. - Added method that compares formats of two GATKReports, and added an equals method to verify all data inside. - The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.) - Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map. - Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered. Test changes: - Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1. - Updated the MD5 hashes in integration tests throughout the GATK. Other changes: - Added the gatherer functions to CoverageByRG - Also added the scatterCount parameter in the Interval Coverage script - Dropped support for reading in legacy GATKReport formats ( v0.) - Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints. - Rewrote the read file method for GATK report files. - Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods. Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-12 23:09:19 -04:00
Eric Banks	10995d349e	Fix old error message	2012-03-12 22:56:08 -04:00
Eric Banks	2314787767	Generalizing to avoid JDK 1.7 incompatibilities	2012-03-12 22:50:59 -04:00
Ryan Poplin	03223029e3	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-12 09:42:37 -04:00
Eric Banks	b4749757f8	Fixes for SLOD: 1) didn't work properly for multi-allelics (randomly chose an allele, possibly one that wasn't genotyped in the full context); 2) in cases when there were more alt alleles than the max allowed and the user is calculating SB, we would recompute the best alt alleles(s); 3) for some reason, we were recomputing the LOD for the full context when we'd already done that. Given that this passes integration tests on my end, this should be the last commit before the release.	2012-03-12 01:07:07 -04:00
Ryan Poplin	2836c161ee	Moving trimToVariableRegion out of reduced reads and into a public static ReadClipper function. HaplotypeCaller clips reads to the active region boundries before passing to the HMM. The philosophy of the HC is moving towards genotyping the entire haplotype sequence contained within the active region as a single allele.	2012-03-11 14:45:59 -04:00
Mark DePristo	1ee46e5c06	Collect only the bare essentials in the GATKRunReport Now looks like: <GATK-run-report> <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id> <start-time>2012/03/10 20.21.19</start-time> <end-time>2012/03/10 20.21.19</end-time> <run-time>0</run-time> <walker-name>CountReads</walker-name> <svn-version>1.4-483-g63ecdb2</svn-version> <total-memory>85000192</total-memory> <max-memory>129957888</max-memory> <user-name>depristo</user-name> <host-name>10.0.1.10</host-name> <java>Apple Inc.-1.6.0_26</java> <machine>Mac OS X-x86_64</machine> <iterations>105</iterations> </GATK-run-report> No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy	2012-03-10 20:27:14 -05:00
Mark DePristo	3ba2e5667c	CalibrateGenotypesLikelihoods include pOfDGivenD now	2012-03-09 16:00:07 -05:00
David Roazen	91d10431d3	BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier. Later on we might want to address in a more general way the fact that we validate user intervals against the reference but not against the merged BAM header produced by the engine at runtime.	2012-03-09 15:20:16 -05:00
David Roazen	bc65f6326f	Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a ReviewedStingException (since this is our problem, not the user's) rather than allowing a cryptic buffer underflow error to occur. Note that this change does not fix the underlying problem in the engine, if there is one (there may be an as-yet-undetected bug in the code that writes the bam schedule). It will just make it easier for us to identify what's going wrong in the future.	2012-03-09 12:33:48 -05:00
David Roazen	32dee7ed9b	Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be thrown when a BAM index file was truncated or corrupt. Now, a UserException is thrown in this situation instructing the user to re-index the BAM. Added a unit test for this case as well.	2012-03-08 15:30:44 -05:00
Guillermo del Angel	c04853eae6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-08 12:30:04 -05:00
Guillermo del Angel	858acf8616	Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns)	2012-03-08 12:29:44 -05:00
Andrey Sivachenko	56f074b520	docs updated	2012-03-07 18:47:15 -05:00
Andrey Sivachenko	117ea605ac	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-07 18:35:07 -05:00
Andrey Sivachenko	497a1b059e	transition to JEXL completed, old parameters setting individual cutoffs now deprecated	2012-03-07 18:34:11 -05:00
Andrey Sivachenko	fbd2f04a04	JEXL support added; intermediate commit, not yet functional	2012-03-07 17:29:42 -05:00
Mark DePristo	0376d73ece	Improved, public version of ErrorRateByCycle -- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example -- Integration tests -- Minor improvements to GATKReportTable with methods to getPrimaryKeys	2012-03-07 13:10:08 -05:00
Christopher Hartl	a6a8fc0521	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable	2012-03-07 10:05:43 -05:00
Mark DePristo	569be953b9	Bugfix for VariantEval -- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report. -- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.	2012-03-06 16:56:59 -05:00
Christopher Hartl	67def6acc8	Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable	2012-03-06 14:23:14 -05:00
Christopher Hartl	20c1fbaf0f	Fixing a merge (turning off downsampling on DoC)	2012-03-06 14:22:45 -05:00
David Roazen	0702ee1587	Public-key authorization scheme to restrict use of NO_ET -Running the GATK with the -et NO_ET or -et STDOUT options now requires a key issued by us. Our reasons for doing this, and the procedure for our users to request keys, are documented here: http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home -A GATK user key is an email address plus a cryptographic signature signed using our private key, all wrapped in a GZIP container. User keys are validated using the public key we now distribute with the GATK. Our private key is kept in a secure location. -Keys are cryptographically secure in that valid keys definitely came from us and keys cannot be fabricated, however keys are not "copy-protected" in any way. -Includes private, standalone utilities to create a new GATK user key (GenerateGATKUserKey) and to create a new master public/private key pair (GenerateKeyPair). Usage of these tools will be documented on the internal wiki shortly. -Comprehensive unit/integration tests, including tests to ensure the continued integrity of the GATK master public/private key pair. -Generation of new user keys and the new unit/integration tests both require access to the GATK private key, which can only be read by members of the group "gsagit".	2012-03-06 00:09:43 -05:00
Lechu	027843d791	I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them. Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>	2012-03-05 21:27:03 -05:00
Ryan Poplin	9b53250bef	Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode.	2012-03-05 21:07:36 -05:00
Ryan Poplin	b37461587d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-05 17:54:59 -05:00
Ryan Poplin	c6ded4d23c	Bug fix for hard clipping reads when base insertion and base deletion qualities are present in the read. Updating HaplotypeCaller integration tests to reflect all the recent changes.	2012-03-05 17:54:42 -05:00
Ryan Poplin	14a77b1e71	Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly.	2012-03-05 12:28:32 -05:00
Mauricio Carneiro	e9ad382e74	unifying the BQSR argument collection	2012-03-05 10:48:26 -05:00
Ryan Poplin	f879daa7d0	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-05 08:29:08 -05:00
Ryan Poplin	d6871967ae	Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably.	2012-03-05 08:28:42 -05:00
Guillermo del Angel	3b5a7c34d7	Added argument to ValidationAmplicons to only output valid sequences - useful for not having to post-filter or grep resulting files before delivering downstream	2012-03-04 10:24:29 -05:00
Mark DePristo	69611af7d3	Workaround for bug in Picard in ReadGroupProperties -- NPE caused when you call getRunDate on a read group without a date.	2012-03-02 18:53:45 -05:00
Mark DePristo	ba71b0aee4	ReadGroupProperties mk3 -- Includes sequencing date	2012-03-02 16:12:42 -05:00
Eric Banks	1e07e97b58	Optimization: create allele list just once, not for each genotype	2012-03-02 13:30:17 -05:00
Ryan Poplin	0ad7d5fbc1	Standalone common Pair HMM utility class with associated unit tests.	2012-03-01 22:41:13 -05:00
Mark DePristo	2f334a57c2	ReadGroupProperties mk2 -- Includes paired end status (T/F) -- Includes count of reads used in calculation -- Includes simple read type (2x76 for example) -- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0	2012-03-01 18:43:53 -05:00
Mauricio Carneiro	486712bfc2	ugly RG encoding	2012-03-01 17:56:45 -05:00
Mark DePristo	aff508e091	ReadGroupProperties walker and associated infrastructure -- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file. -- Median tool that collects up to a given maximum number of elements and returns the median of the elements. -- Unit and integration tests for everything. -- Making name of TestProvider protected so subclasses and override name more easily	2012-03-01 15:01:11 -05:00
Mauricio Carneiro	9e95b10789	Context covariate now operates as a highly compressed bitset * All contexts with 'N' bases are now collapsed as uninformative * Context size is now represented internally as a BitSet but output as a dna string * Temporarily disabled sorted outputs because of null objects	2012-02-29 19:25:21 -05:00
Mauricio Carneiro	d379c3763a	DNA Sequence to BitSet and vice-versa conversion tools * Turns DNA sequences (for context covariates) into bit sets for maximum compression * Allows variable context size representation guaranteeing uniqueness. * Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary). * Unit Tests added	2012-02-29 19:25:20 -05:00
Eric Banks	129b5e7f6b	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-28 10:09:34 -05:00
Eric Banks	a4a279ce80	Damn you, Mark	2012-02-28 10:09:09 -05:00
Khalid Shakir	0681bea5a5	Changed DoC from PartitionType.INTERVAL to PartitionType.NONE since it doesn't have a way to gather scattered outputs. Added MultiallelicSummary to HSP eval.	2012-02-28 09:27:27 -05:00
Eric Banks	bd398e30fd	Another quick optimization	2012-02-28 09:25:35 -05:00
Eric Banks	40bdadbda5	Minor optimization as per Mark	2012-02-28 09:24:07 -05:00
Eric Banks	d7928ad669	Drat, missed one: handle null alleles being passed in.	2012-02-27 21:31:54 -05:00
Mark DePristo	24356f11b7	Merged bug fix from Stable into Unstable -- Resolved conflict Conflicts: public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java	2012-02-27 17:13:17 -05:00
Mark DePristo	0b29d54937	Changed most BAMSchedule ReviewedStingExceptions to UserExceptions -- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.	2012-02-27 17:08:41 -05:00
Mark DePristo	f9e8e82e33	Removed unused class variable from VCFHeaderLineTranslator	2012-02-27 17:07:19 -05:00
Mark DePristo	100ddef930	Fix typo in VariantContextBuilder	2012-02-27 17:06:45 -05:00
Mark DePristo	5f7ccdcc01	Avoid calling getBasePileup when there's no pileup in NBaseCount annotation	2012-02-27 15:12:25 -05:00
Mark DePristo	729bb954e2	Throws ReviewedStingException for a bug when parent VariantContext argument is null	2012-02-27 15:09:00 -05:00
Eric Banks	998ed8fff3	Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous).	2012-02-27 14:56:10 -05:00
Mark DePristo	4d9582de77	More general catching of Exceptions in interval reading to throw MalformedFile exception in all cases -- Now throws UserException no matter what happens during the reading of the intervals file.	2012-02-27 14:02:26 -05:00
Mark DePristo	9712fed7a5	Trap SAMFormatException and rethrow as MalformatedBAM exception -- Trap errors in header and rethrow -- Wrap underlying iterator in MalformatedBAMErrorReformattingIterator	2012-02-27 13:52:50 -05:00
Eric Banks	64754e7870	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-02-27 11:31:41 -05:00
Eric Banks	850c5d0db2	Enabling Rank Sum Tests for multi-allelics: use ref vs any alt allele.	2012-02-27 09:59:36 -05:00

... 14 15 16 17 18 ...

2974 Commits (7dcafe8b8194ce8a9d0b8825812fd11c8f9a0612)