gatk-3.8

Commit Graph

Author	SHA1	Message	Date
David Roazen	8b29030467	Change default downsampling coverage target for the HaplotypeCaller to 250 -was previously set to 30, which seems far too aggressive given that with ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any pileup returned by LIBS -250 is a more conservative default used by the UG -can adjust down/up later based on further experiments (GSA-699 will remain open) -verified with Ryan that all integration test differences are either innocent or represent an improvement GSA-699	2013-02-26 09:33:25 -05:00
Ryan Poplin	89e2943dd1	The maximum kmer length is derived from the reads. -- This is done to take advantage of longer reads which can produce less ambiguous haplotypes -- Integration tests change for HC and BiasedDownsampling	2013-02-25 14:40:25 -05:00
Ryan Poplin	6a639c8ffc	Replace Smith-Waterman alignment with the bubble traversal. -- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph. -- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime. -- Refactoring graph functions into a new DeBruijnAssemblyGraph class. -- Bug fix in path.getBases(). -- Adding validation code to the assembly engine. -- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class. -- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes. -- Added ability to ignore bubbles that are too divergent from the reference -- Max kmer can't be bigger than the extension size. -- Reverse the order that we create the assembly graphs so that the bigger kmers are used first. -- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment. -- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation. -- Updating HaplotypeCaller and BiasedDownsampling integration tests. -- Rebased everything into one commit as requested by Eric -- improvements to the bubble traversal are coming as a separate push	2013-02-22 15:42:16 -05:00
Mauricio Carneiro	e3f01673e1	Implementation of the find and diagnose Queue script -- Added 'uncovered intervals' output for FindCoveredIntervals -- updated scala script to make use of it.	2013-02-22 10:19:01 -05:00
Ryan Poplin	62e14f5b58	Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores.	2013-02-21 14:34:17 -05:00
Eric Banks	6996a953a8	Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample). These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons): * Don't unnecessarily overload Allele.getBases() in the Haplotype class. * Haplotype.getBases() was calling clone() on the byte array. * Added a constructor to Allele (and Haplotype) that takes in an Allele as input. * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated). * Rev'ed the variant jar accordingly. For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.	2013-02-21 10:14:11 -05:00
Eric Banks	551d33686c	Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761 Reduce memory footprint of SyntheticRead by replacing several Lists with...	2013-02-20 04:49:07 -08:00
Eric Banks	9dfdb9528b	Merge pull request #49 from broadinstitute/gda_hidden_ug_args Hide arguments related to reference sample operation in UG - for interna...	2013-02-19 16:18:32 -08:00
Eric Banks	0055a6f1cd	Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774 Fix to the Indel Realigner bug described in GSA-774	2013-02-19 16:16:48 -08:00
Guillermo del Angel	5a0a9bc488	Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished.	2013-02-19 19:06:42 -05:00
Mauricio Carneiro	371ea2f24c	Fixed IndelRealigner reference length bug (GSA-774) -- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases -- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases -- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them) -- pulled out ReadBin class so it can be testable -- added unit tests for ReadBin with soft-clips -- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads GSA-774 #resolve	2013-02-19 16:00:36 -05:00
Alec Wysoker	ab75e053da	Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static class that contains the attributes that were scattered across the several Lists.	2013-02-19 15:33:33 -05:00
Ryan Poplin	c025e84c8b	Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping. -- Updated a few integration tests for HC, UG, and UG general ploidy	2013-02-19 14:09:24 -05:00
Ryan Poplin	b7e9c342c7	Reducing the size of the reference padding in the HaplotypeCaller.	2013-02-17 11:09:00 -05:00
Mark DePristo	3231031c1a	Bugfix for FisherStrand -- FisherStrand pValues can sum to slightly greater than 1.0, so they need to be capped to convert to a Phred-scaled quality score	2013-02-16 07:31:38 -08:00
Mark DePristo	9a29d6d4be	Fix an catastrophic bug (WoW!) in the reference calculation of the UG -- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test. -- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.	2013-02-16 07:31:38 -08:00
Mark DePristo	9e28d1e347	Cleanup and unit tests for QualityUtils -- Fixed a few conversion bugs with edge case quals (ones that were very high) -- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s -- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK -- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly. -- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate. -- Renamed probToQual to trueProbToQual -- Added goodProbability and log10OneMinusX to MathUtils -- Went through the GATK and cleaned up many uses of QualityUtils -- Cleanup constants in QualityUtils -- Added full docs for all of the constants -- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity -- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature -- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x) -- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils	2013-02-16 07:31:37 -08:00
MauricioCarneiro	d80b99143f	Merge pull request #37 from broadinstitute/rp_left_alignment_hc_contract_GSA-771	2013-02-15 08:32:45 -08:00
Ryan Poplin	871c8b3866	No need to consider haplotypes which Smith-Waterman aligns off the end of the large padded reference.	2013-02-14 11:18:10 -05:00
Geraldine Van der Auwera	6208742f7c	Refactored GATKDocs categories some more ( GSATDG-62 ) -- Renamed ValidatePileup to CheckPileup since validation is reserved word -- Renamed AlignmentValidation to CheckAlignment (same as above) -- Refactored category definitions to use constants defined in HelpConstants -- Fixed a couple of minor typos and an example error -- Reorganized the GATKDocs index template to use supercategories -- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)	2013-02-13 16:49:18 -05:00
Guillermo del Angel	4308b27f8c	Fixed non-determinism in HaplotypeCaller and some UG calls - -- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads. -- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast) -- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling). -- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test -- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).	2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera	dff5ef562b	Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details) -- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools -- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities -- Reworded some category names and descriptions to be more explicit and user-friendly	2013-02-12 13:36:15 -05:00
Ryan Poplin	3f2f837b6a	Optimization to ReadPosRankSumTest: Don't do the work of parsing through the cigar string for non-informative reads.	2013-02-11 11:36:09 -05:00
Mark DePristo	35139cf990	HaplotypeScore only annotates SNPs -- The new HMM new edge conditions the likelihoods are offset by log10(n possible starts) so the results don't really mean "fits the haplotype well" any longer. This results in grossly inflated HaplotypeScores for indels and with the HaplotypeCaller. So I'm simply not going to emit this annotation value any longer for indels and for the HC	2013-02-09 19:19:28 -05:00
Mark DePristo	e40d83f00e	Final version of PairHMMs with correct edge conditions -- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites -- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done -- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow -- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact -- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC -- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now -- Added locs of docs -- Generalize unit tests with multiple equivalent matches of read to haplotype -- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10 -- Functions to dumpMatrices for debugging -- Fix nasty bug (without original unit tests) in LoglessPairHMM -- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again. -- Added dupString(string, n) method to Utils -- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes -- Unit tests for the hapStartIndex functionality of PairHMM -- Moved computeFirstDifferingPosition to PairHMM, and added unit tests -- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10 -- Still TODOs left in the code that I'll fix up -- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so -- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal -- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors	2013-02-09 19:19:22 -05:00
Mark DePristo	09595cdeb9	Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments	2013-02-09 13:06:54 -05:00
Mark DePristo	2d802e17a4	Delete the CachingPairHMM	2013-02-09 13:06:54 -05:00
Mark DePristo	7dcafe8b81	Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods -- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs -- Actual working unit tests for PairHMMUnitTest -- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results -- PairHMM has protected variables used throughout the subclasses	2013-02-09 13:06:54 -05:00
Mauricio Carneiro	d004bfbe6f	walker to calculate per base coverage distribution -- Base distribution optionally includes deletions -- Implemented an optional filtered coverage distribution option -- Integration tests added for every feature of the traversal This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage. GSATDG-45 #resolve	2013-02-07 16:33:05 -05:00
Mauricio Carneiro	5f49c95cc1	Added distance across contigs calculation to GenomeLocs -- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader) -- unit test added GSATDG-45	2013-02-07 16:31:41 -05:00
depristo	cd4aec177a	Merge pull request #20 from broadinstitute/aw_reduceread_perf_1_GSA-761 Aw reduceread perf 1 gsa 761	2013-02-07 12:11:05 -08:00
Eric Banks	9826192854	Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class! * AlignmentUtils.getMismatchCount() * AlignmentUtils.calcAlignmentByteArrayOffset() * AlignmentUtils.readToAlignmentByteArray(). * AlignmentUtils.leftAlignIndel()	2013-02-07 13:04:24 -05:00
Alec Wysoker	e88bc753aa	Replace with map.containsKey followed by map.get with map.get followed by null check.	2013-02-07 11:58:41 -05:00
Alec Wysoker	72e496d6f3	Eliminate unnecessary zeroing out of primitive arrays immediately after new.	2013-02-07 11:57:43 -05:00
Mark DePristo	59df329776	Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc -- If the VariantContext is a bi-allelic variant already, don't split up the VC (it doesn't do anything) and then combine it back together. This saves us a lot of work on average -- Be more protective of calls to AFCalc with a VariantContext that might only have ref allele, throwing an exception	2013-02-06 10:34:09 -05:00
Eric Banks	562f2406d7	Added check that BaseRecalibrator is not being run on a reduced bam. - Throws user exception if it is. - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument. - Added test to check this is working. - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end. - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process. - Added comment about not changing the program record name, as per reviewer comments. - Removed unused variable.	2013-02-06 10:14:27 -05:00
Eric Banks	4e5ff3d6f1	Bug fix for NPE in HC with --dbsnp argument. - I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null. - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.	2013-02-05 21:59:19 -05:00
Eric Banks	e7c35a907f	Fixes to BQSR for the --maximum_cycle_value argument. - It's now written into the recal report so that it can be used in the PrintReads step. - Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before! - Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR. (This is my last non-branch commit; all future pushes will follow new GATK practices)	2013-02-05 17:38:03 -05:00
MauricioCarneiro	050c4794a5	Merge pull request #11 from yfarjoun/per_sample2 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...	2013-02-05 08:04:29 -08:00
Eric Banks	23c6aee236	Added in some basic unit tests for polyploid consensus creation in RR. - Uncovered small bug in the fix that I added yesterday, which is now fixed properly. - Uncovered massive general bug: polyploid consensus is totally busted for deletions (because of call to read.getReadBases()[readPos]). - Need to consult Mauricio on what to do here (are we supporting het compression for deletions? (Insertions are definitely not supported)	2013-02-05 10:35:45 -05:00
Yossi Farjoun	de03f17be4	-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling. -Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added -includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test. -(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper. -New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map). -Unit tests to the new class and function are provided. -Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly. -Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.	2013-02-04 18:24:36 -05:00
Eric Banks	70f3997a38	More RR tests and fixes. * Fixed implementation of polyploid (het) compression in RR. * The test for a usable site was all wrong. Worked out details with Mauricio to get it right. * Added comprehensive unit tests in HeaderElement class to make sure this is done right. * Still need to add tests for the actual polyploid compression. * No longer allow non-diploid het compression; I don't want to test/handle it, do you? * Added nearly full coverage of tests for the BaseCounts class.	2013-02-04 15:55:15 -05:00
Ryan Poplin	79ef41e7b1	Added some docs, unit test, and contracts to SimpleDeBruijnAssembler. -- Testing that cycles in the reference graph fail graph construction appropriately. -- Minor bug fix in assembly with reduced reads. Added some docs and contracts to SimpleDeBruijnAssembler Added a unit test to SimpleDeBruijnAssembler	2013-02-04 15:17:22 -05:00
Geraldine Van der Auwera	43e3a040b6	Updated UnifiedGenotyper GATKDoc (note on ploidy model)	2013-02-04 14:18:56 -05:00
Eric Banks	2d518f3063	More RR-related updates and tests. - ReduceReads by default now sets up-front ReadWalker downsampling to 40x per start position. - This is the value I used in my tests with Picard to show that memory issues pretty much disappeared. - This should hopefully take care of the memory issues being reported on the forum. - Added javadocs to SlidingWindow (the main RR class) to follow GATK conventions. - Added more unit tests to increase coverage of BaseCounts class. - Added more unit tests to test I/D operators in the SlidingWindow class.	2013-02-04 12:57:43 -05:00
Ryan Poplin	2fee000dba	Adding unit tests for KBestPaths class and fixing edge case bugs.	2013-02-01 13:51:31 -05:00
Guillermo del Angel	a520058ef6	Add option to specify maximum STR length to RepeatCovariates from command line to ease testing	2013-02-01 13:51:31 -05:00
Mark DePristo	22f7fe0d52	Expanded unit tests for AlignmentUtils -- Added JIRA entries for the remaining capabilities to be fixed up and unit tested	2013-02-01 13:51:31 -05:00
Ryan Poplin	ac033ce41a	Intermediate commit of new bubble assembly graph traversal algorithm for the HaplotypeCaller. Adding functionality for a path from an assembly graph to calculate its own cigar string from each of the bubbles instead of doing a massive Smith-Waterman alignment between the path's full base composition and the reference.	2013-01-31 11:32:19 -05:00
Eric Banks	75ceddf9e5	Adding new unit tests for RR. These tests took a frustratingly long time to get to pass, but now we have a framework for testing the adding of reads into the SlidingWindow plus consensus creation. Will flesh these out more after I take care of some other items on my plate.	2013-01-31 09:46:38 -05:00

1 2 3 4 5 ...

389 Commits (65d31ba4adfecb5cfa7efbb4e30e60c7a7975c71)