gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Valentin Ruano-Rubio	0f99778a59	Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test Adding Graph-based likelihood ratio calculation to HC To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line. New HC Options (both Advanced and Hidden): ========================================== --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM) Specifies what engine should be used to generate read vs haplotype likelihoods. PairHMM : standard full-PairHMM approach. GraphBased : using the assembly graph to accelarate the process. Random : generate random likelihoods - used for benchmarking purposes only. --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN) It idicates how to merge haplotypes produced using different kmerSizes. Only has effect when used in combination with (--likelihooCalculationEngine GraphBased) COMBO_MIN : use the smallest kmerSize with all haplotypes. COMBO_MAX : use the larger kmerSize with all haplotypes. MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it. MAX_ONLY : use the larger kmerSize with haplotypes asembled using it. Major code changes: =================== * Introduce multiple likelihood calculation engines (before there was just one). * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class. * Added yet another PairHMM implementation with a different API in order to spport local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype). Major components: ================ * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution of the graph-based likelihood approach. * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals to calcualte the likelihoods using the graph as an scafold. * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one used by GraphBasedLikelihoodCalculationEngineInstance to do its work. * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is used by GraphBasedLikelihoodCalcuationEngineInstance to do its work. Remove mergeCommonChains from HaplotypeGraph creation Fixed bamboo issues with HaplotypeGraphUnitTest Fixed probrems with HaplotypeCallerIntegrationTest Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest Fixed ReadThreadingLikelihoodCalculationEngine issues Moved event-block iteration outside GraphBasedEngineInstance Removed unecessary parameter from ReadAnchoring constructor. Fixed test problem Added a bit more documentation to EventBlockSearchEngine Fixing some private - protected dependency issues Further refactoring making GraphBasedInstance and HaplotypeGraph slimmer. Addressed last pull request commit comments Fixed FastLoglessPairHMM public -> protected dependency Fixed probrem with HaplotypeGraph unit test	2013-12-02 19:37:19 -05:00
Chris Hartl	1f777c4898	Introducing the latest-and-greatest in genotyping: CalculatePosteriors. CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution. Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows: while not converged: * AC = [external AC] + [sample AC] * Prior = DirichletMultinomial[AC] * Posteriors = [sample GL + Prior] * sample AC = MLEAC(Posteriors) This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase. Fully unit tested. Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.	2013-11-27 13:00:45 -05:00
Ryan Poplin	3503050a39	Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller -- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller. -- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median -- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs -- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model -- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions -- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings -- We only realign reads in the reference model if the read is informative for a particular haplotype over another -- GVCF blocks will now track and output the minimum PLs over the block -- MD5 changes! -- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny -- GVCF tests: from HC changes above and adding in active region trimming	2013-09-06 16:56:34 -04:00
David Roazen	42d771f748	Remove org.apache.commons.collections.IteratorUtils dependency from the test suite -This was a dependency of the test suite, but not the GATK proper, which caused problems when running the test suite on the packaged GATK jar at release time -Use GATKVCFUtils.readVCF() instead	2013-08-21 19:44:02 -04:00
Michael McCowan	c3a933ce84	Adaptations to accomodate Tribble API changes, comprising mostly of the following. * Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable. * Galvanizing fo generic types. * Test fixups, mostly to pass around LineIterators instead of LineReaders. * New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances. * New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).	2013-08-19 15:52:47 -04:00
Yossi Farjoun	284176cd7b	moved SnpEffUtilUnitTest to public tree	2013-07-30 17:51:40 -04:00
Eric Banks	6df43f730a	Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes. Having them as bytes caused problems for downstream programmers who had data with high MQs.	2013-07-23 23:47:15 -04:00
David Roazen	605a5ac2e3	GATK engine: add ability to do on-the-fly BAM file sample renaming at runtime -User must provide a mapping file via new --sample_rename_mapping_file argument. Mapping file must contain a mapping from absolute bam file path to new sample name (format is described in the docs for the argument). -Requires that each bam file listed in the mapping file contain only one sample in their headers (they may contain multiple read groups for that sample, however). The engine enforces this, and throws a UserException if on-the-fly renaming is requested for a multi-sample bam. -Not all bam files for a traversal need to be listed in the mapping file. -On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders in SAMDataSource (before the headers are even merged), to prevent possible consistency issues. -Renaming is done ONCE at traversal start for each SAMReaders resource creation in the SAMResourcePool; this effectively means once per -nt thread -Comprehensive unit/integration tests Known issues: -if you specify the absolute path to a bam in the mapping file, and then provide a path to that same bam to -I using SYMLINKS, the renaming won't work. The absolute paths will look different to the engine due to the symlink being present in one path and not in the other path. GSA-974 #resolve	2013-07-18 15:48:42 -04:00
David Roazen	c15751e41e	SAMReaderID: fix bug with hash code and equals() method -Two SAMReaderIDs that pointed at the same underlying bam file through a relative vs. an absolute path were not being treated as equal, and had different hash codes. This was causing problems in the engine, since SAMReaderIDs are often used as the keys of HashMaps. -Fix: explicitly use the absolute path to the encapsulated bam file in hashCode() and equals() -Added tests to ensure this doesn't break again	2013-07-15 13:57:00 -04:00
Eric Banks	b16c7ce050	A whole slew of improvements to the Haplotype Caller and related code. 1. Some minor refactorings and claenup (e.g. removing unused imports) throughout. 2. Updates to the KB assessment functionality: a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call. b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling. 3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived the DeBruijn assembler. 4. Improvements to the likelihoods for the HC a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need to divide e by 3 because the observed base could have come from any of the non-observed alleles. b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are turned into Q0s for now. c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real sites with off kilter allele balances and we occasionally lose them to contamination removal. 5. Improved the dangling tail merging implementation.	2013-07-12 10:09:10 -04:00
Mark DePristo	e3e8631ff5	Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences -- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure -- -- Output format looks like: 20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121 20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119 20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120 20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217 20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240 20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268 20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267 We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty. -- Can we enabled for single samples with --emitRefConfidence (-ERC). -- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads -- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class. -- Includes GVCF writer -- Add 1 mb of WEx data to private/testdata -- Integration tests for reference model output for WGS and WEx data -- Emit GQ block information into VCF header for GVCF mode -- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC -- Control max indel size for the reference confidence model from the command line. Increase default to 10 -- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest -- Unittests for ReferenceConfidenceModel -- Unittests for new MathUtils functions	2013-07-02 15:46:38 -04:00
Mark DePristo	41aba491c0	Critical bugfix for adapter clipping in HaplotypeCaller -- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible. -- Removed the broken logic of determining whether a read adaptor is too long. -- Doesn't require isProperPairFlag to be set for a read to be adapter clipped -- Update integration tests for new adapter clipping code	2013-07-02 15:46:36 -04:00
David Roazen	31827022db	Fix pipeline tests that were not respecting the pipeline test dry run setting There are a few pipeline test classes that do not run Queue, but are classified as pipeline tests because they submit farm jobs. Make these unconventional pipeline tests respect the pipeline test dry run setting.	2013-06-28 15:27:17 -04:00
Mark DePristo	fdfe4e41d5	Better GATK version and command line output -- Previous version emitted command lines that look like: ##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..." the new version provides additional information on when the GATK was run and the GATK version in a nicer format: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ..."> -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test: ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff"> ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff"> -- Removed the ProtectedEngineFeaturesIntegrationTest -- Actual unit tests for these features!	2013-06-20 11:19:13 -04:00
Mark DePristo	0672ac5032	Fix public / protected dependency	2013-06-19 19:42:09 -04:00
David Roazen	51ec5404d4	SAMDataSource: always consolidate cigar strings into canonical form -Collapses zero-length and repeated cigar elements, neither of which can necessarily be handled correctly by downstream code (like LIBS). -Consolidation is done before read filters, because not all read filters behave correctly with non-consoliated cigars. -Examined other uses of consolidateCigar() throughout the GATK, and found them to not be redundant with the new engine-level consolidation (they're all on artificially-created cigars in the HaplotypeCaller and SmithWaterman classes) -Improved comments in SAMDataSource.applyDecoratingIterators() -Updated MD5s; differences were examined and found to be innocuous -Two tests: -Unit test for ReadFormattingIterator -Integration test for correct handling of zero-length cigar elements by the GATK engine as a whole	2013-06-19 13:29:01 -04:00
Mark DePristo	7b22467148	Bugfix: defaultBaseQualities actually works now -- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly. Added integration tests to ensure this continues to work. -- [delivers #49538319]	2013-06-17 14:37:27 -04:00
Mark DePristo	b69d210255	Bugfix: allow gzip VCF output in multi-threaded GATK output -- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is. Added integrationtest to ensure this behavior works going forward. -- [delivers #47399279]	2013-06-17 12:39:18 -04:00
David Roazen	d167292688	Reduce number of leftover temp files in GATK runs -WalkerTest now deletes .idx files on exit -ArtificialBAMBuilder now deletes .bai files on exit -VariantsToBinaryPed walker now deletes its temp files on exit	2013-06-14 15:56:03 -04:00
Ryan Poplin	c4e508a71f	Merge pull request #275 from broadinstitute/md_fragment_with_pcr Improvements to HaplotypeCaller and NA12878 KB	2013-06-14 09:32:26 -07:00
Mark DePristo	74f311c973	Emit the GATK version number in the VCF header -- Looks like ##GATKVersion=2.5-159-g3f91d93 in the VCF header line -- delivers [#51595305]	2013-06-13 15:46:16 -04:00
Mark DePristo	6232db3157	Remove STANDARD option from GATKRunReport -- AWS is now the default. Removed old code the referred to the STANDARD type. Deleted unused variables and functions.	2013-06-13 15:18:28 -04:00
Mark DePristo	dd5674b3b8	Add genotyping accuracy assessment to AssessNA12878 -- Now table looks like: Name VariantType AssessmentType Count variant SNPS TRUE_POSITIVE 1220 variant SNPS FALSE_POSITIVE 0 variant SNPS FALSE_NEGATIVE 1 variant SNPS TRUE_NEGATIVE 150 variant SNPS CALLED_NOT_IN_DB_AT_ALL 0 variant SNPS HET_CONCORDANCE 100.00 variant SNPS HOMVAR_CONCORDANCE 99.63 variant INDELS TRUE_POSITIVE 273 variant INDELS FALSE_POSITIVE 0 variant INDELS FALSE_NEGATIVE 15 variant INDELS TRUE_NEGATIVE 79 variant INDELS CALLED_NOT_IN_DB_AT_ALL 2 variant INDELS HET_CONCORDANCE 98.67 variant INDELS HOMVAR_CONCORDANCE 89.58 -- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously) -- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record: 20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6 Indicating that the call was a HET but the expected result was HOM_VAR -- Forbid subsetting of diploid genotypes to just a single allele. -- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG	2013-06-13 15:05:32 -04:00
Mark DePristo	b2dc7095ab	Merge pull request #267 from broadinstitute/dr_reducereads_downsampling_fix Exclude reduced reads from elimination during downsampling	2013-06-11 13:52:28 -07:00
David Roazen	95b5f99feb	Exclude reduced reads from elimination during downsampling Problem: -Downsamplers were treating reduced reads the same as normal reads, with occasionally catastrophic results on variant calling when an entire reduced read happened to get eliminated. Solution: -Since reduced reads lack the information we need to do position-based downsampling on them, best available option for now is to simply exempt all reduced reads from elimination during downsampling. Details: -Add generic capability of exempting items from elimination to the Downsampler interface via new doNotDiscardItem() method. Default inherited version of this method exempts all reduced reads (or objects encapsulating reduced reads) from elimination. -Switch from interfaces to abstract classes to facilitate this change, and do some minor refactoring of the Downsampler interface (push implementation of some methods into the abstract classes, improve names of the confusing clear() and reset() methods). -Rewrite TAROrderedReadCache. This class was incorrectly relying on the ReservoirDownsampler to preserve the relative ordering of items in some circumstances, which was behavior not guaranteed by the API and only happened to work due to implementation details which no longer apply. Restructured this class around the assumption that the ReservoirDownsampler will not preserve relative ordering at all. -Add disclaimer to description of -dcov argument explaining that coverage targets are approximate goals that will not always be precisely met. -Unit tests for all individual downsamplers to verify that reduced reads are exempted from elimination	2013-06-11 16:16:26 -04:00
Eric Banks	dadcfe296d	Reworking of the dangling tails merging code. We now run Smith-Waterman on the dangling tail against the corresponding reference tail. If we can generate a reasonable, low entropy alignment then we trigger the merge to the reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs and don't let those pass through. Added tests for this implementation that checks exact SW results and correct edges added.	2013-06-11 12:53:04 -04:00
Mark DePristo	1c03ebc82d	Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP -- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure -- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map -- Cleanup misc. code as it came up -- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC	2013-06-10 16:20:31 -04:00
Valentin Ruano-Rubio	96073c3058	This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK. The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible. Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward: a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC) b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL) Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring. Implementation: * Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work. * Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants. * Added ReadFilterTest class commont parent for ReadFilter test cases. * Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class. * Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality. * Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used. * Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used. * Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior. * Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior	2013-06-10 10:44:42 -04:00
Michael McCowan	00c06e9e52	Performance improvements: - Memoized MathUtil's cumulative binomial probability function. - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.	2013-06-09 11:26:52 -04:00
Mark DePristo	e19c24f3ee	Bugfix for HaplotypeCaller error: Only one of refStart or refStop must be < 0, not both -- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts. From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads). -- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases. -- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString) -- Added unit tests for reverting soft clipped bases that create a read before the contig -- [delivers #50892431]	2013-06-04 10:33:46 -04:00
Mark DePristo	6555361742	Fix error in merging code in HC -- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested. -- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases. -- delivers #50655421	2013-05-31 16:29:29 -04:00
Mark DePristo	4b206a3540	Check that -compress arguments are within range 0-9 -- Although the original bug report was about SplitSamFile it actually was an engine wide error. The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils -- delivers #49531009	2013-05-31 15:29:02 -04:00
Mark DePristo	b16de45ce4	Command-line read filters are now applied before Walker default filters -- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 before the BQSR filters them out with MappingQualityUnassignedFilter. -- delivers #50222251	2013-05-30 16:54:18 -04:00
Ryan Poplin	61af37d0d2	Create a new normalDistributionLog10 function that is unit tested for use in the VQSR.	2013-05-30 16:00:08 -04:00
David Roazen	a7cb599945	Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage -Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200, since low dcov values can result in problematic downsampling artifacts for locus-based traversals. -Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals controls the number of reads per alignment start position, and even a dcov value of 1 might be safe/desirable in some circumstances. -Also reorganize the global downsampling defaults so that they are specified as annotations to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the DownsamplingMethod class. -The default downsampling settings have not been changed: they are still -dcov 1000 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.	2013-05-29 12:07:12 -04:00
delangel	925232b0fc	Merge pull request #236 from broadinstitute/md_simple_hc_performance_improvements 3 simple performance improvements for HaplotypeCaller	2013-05-22 07:58:28 -07:00
Eric Banks	881b2b50ab	Optimized counting of filtered records by filter. Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment). Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.	2013-05-21 21:54:49 -04:00
Mark DePristo	a1093ad230	Optimization for ActiveRegion.removeAll -- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one. -- Update HaplotypeCaller filterReadsInRegion to use a Set not a List. -- Expanded the unit tests a bit for ActiveRegion.removeAll	2013-05-21 16:18:57 -04:00
Eric Banks	20c7a89030	Fixes to get accurate read counts for Read traversals 1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts) 2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object! 3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions. Also: * Make sure users realize that the read counts are approximate in the print outs. * Removed a lot of unused cruft from the metrics object while I was in there. * Added test to make sure that the ReadMetrics read count does not overflow ints. * Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.	2013-05-21 15:24:07 -04:00
Eric Banks	58f4b81222	Count Reads should use a Long instead of an Integer for counts to prevent overflows. Added unit test.	2013-05-21 15:23:51 -04:00
Mark DePristo	62fc88f92e	CombineVariants no longer adds PASS to unfiltered records -- [Delivers #49876703] -- Add integration test and test file -- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!	2013-05-20 16:53:51 -04:00
Yossi Farjoun	3e2a0b15ed	- Added a @Hidden option ( -outputInsertLength ) to PileupWalker that causes it to emit insert sizes together with the pileup (to assist Mark Daly's investigation of the contamination dependance on insert length) - Converted my old GATKBAMIndexText (within PileupWalkerIntegrationTest) to use a dataProvider - Added two integration tests to test -outputInsertLength option	2013-05-16 12:47:16 -04:00
Mark DePristo	371f3752c1	Subshard timeouts in the GATK -- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards. -- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard. This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider. So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes. -- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter. This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense. Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout. -- Added integration tests to ensure that subshard timeouts are working properly	2013-05-15 07:00:39 -04:00
Mark DePristo	7d78a77f17	Trivial update to ceutrio.ped file to make it really the CEU trio sample names	2013-05-14 17:08:13 -04:00
Mark DePristo	39e4396de0	New ActiveRegionShardBalancer allows efficient NanoScheduling -- Previously we used the LocusShardBalancer for the haplotype caller, which meant that TraverseActiveRegions saw its shards grouped in chunks of 16kb bits on the genome. These locus shards are useful when you want to use the HierarchicalMicroScheduler, as they provide fine-grained accessed to the underlying BAM, but they have two major drawbacks (1) we have to fairly frequently reset our state in TAR to handle moving between shard boundaries and (2) with the nano scheduled TAR we end up blocking at the end of each shard while our threads all finish processing. -- This commit changes the system over to using an ActiveRegionShardBalancers, that combines all of the shard data for a single contig into a single combined shard. This ensures that TAR, and by extensions the HaplotypeCaller, gets all of the data on a single contig together so the the NanoSchedule runs efficiently instead of blocking over and over at shard boundaries. This simple change allows us to scale efficiently to around 8 threads in the nano scheduler: -- See https://www.dropbox.com/s/k7f280pd2zt0lyh/hc_nano_linear_scale.pdf -- See https://www.dropbox.com/s/fflpnan802m2906/hc_nano_log_scale.pdf -- Misc. changes throughout the codebase so we Use the ActiveRegionShardBalancer where appropriate. -- Added unit tests for ActiveRegionShardBalancer to confirm it does the merging as expected. -- Fix bad toString in FilePointer	2013-05-13 11:09:02 -04:00
Mark DePristo	b4f482a421	NanoScheduled ActiveRegionTraversal and HaplotypeCaller -- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results -- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk -- Tests for NanoScheduling -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data -- Some misc. code cleanup of HaplotypeCaller -- Analysis scripts to assess performance of nano scheduled HC -- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.	2013-05-13 11:09:02 -04:00
Eric Banks	2f5ef6db44	New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points. * This version inherits from the original SW implementation so it can use the same matrix creation method. * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the right thing for indels at the edges of the alignments. * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy. * Lots of systematic testing added for this implementation. * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.	2013-05-13 09:36:39 -04:00
David Roazen	639030bd6d	Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements -Diff engine output is now included in the actual exception message thrown as a result of an MD5 mismatch, which allows it to be conveniently viewed on the main page of a build in Bamboo. Minor Additional Improvements: -WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass() method, and the test class name is now included as a regular part of integration test output for each test. -Fix race condition in MD5DB.ensureMd5DbDirectory() -integrationtests dir is now cleaned by "ant clean" GSA-915 #resolve	2013-05-10 19:00:33 -04:00
Mark DePristo	fa8a47ceef	Replace DeBruijnAssembler with ReadThreadingAssembler Problem ------- The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant. Solution -------- The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows: identify all non-unique kmers of size K among all reads and the reference for each sequence (ref and read): find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist for each base in the sequence from the starting vertex kmer: look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers. This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below: -- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes) -- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph. -- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used. -- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway. -- The old DebruijnAssembler is still available with a command line option -- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples. -- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes -- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler Infrastructure changes / improvements ------------------------------------- -- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler -- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers -- Refactored active region trimming code into ActiveRegionTrimmer class -- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default -- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions. -- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations. -- Many many more unit tests throughout the entire assembler, and in random utilities	2013-05-08 21:41:42 -04:00
Mark DePristo	f42bb86bdd	e# This is a combination of 2 commits. Only try to clip adaptors when both reads of the pair are on opposite strands -- Read pairs that have unusual alignments, such as two reads both oriented like: <----- <----- where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads. -- Update MD5s due change in adaptor clipping, which add some coverage in some places	2013-05-03 11:19:14 -04:00

1 2 3 4 5 ...

1362 Commits (0f99778a594de2ab0d219e11a020e050cd4809c3)