gatk-3.8

Commit Graph

Author	SHA1	Message	Date
depristo	6204e6ccc9	Merge pull request #76 from broadinstitute/md_kb_bugfix_GSA-795 Bug fixes and optimizations for NA12878 KB	2013-03-01 10:52:16 -08:00
Eric Banks	ebd5404124	Fixed the add functionality of GenomeLocSortedSet. * Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown. * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases. * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs. * Added more unit tests for add functionality of GLSS. * Resolves GSA-775.	2013-02-28 23:31:00 -05:00
Mark DePristo	4095a9ef32	Bugfixes for AssessNA12878 -- Refactor initialization routine into BadSitesWriter. This now adds the GQ and DP genotype header lines which are necessarily if the input VCF doesn't have proper headers -- GATKVariantContextUtils subset to biallelics now tolerates samples with bad GL values for multi-allelics, where it just removes the PLs and issues a warning.	2013-02-28 10:35:06 -05:00
depristo	92d6a4f441	Merge pull request #75 from broadinstitute/eb_missing_rg_error_GSA-407 Added better error message for BAMs with bad read groups.	2013-02-28 05:20:39 -08:00
Eric Banks	12fc198b80	Added better error message for BAMs with bad read groups. * Split the cases into reads that don't have a RG at all vs. those with a RG that's not defined in the header. * Added integration tests to make sure that the correct error is thrown. * Resolved GSA-407.	2013-02-27 16:02:56 -05:00
Eric Banks	69b8173535	Replace uses of NestedHashMap with NestedIntegerArray. * Removed from codebase NestedHashMap since it is unused and untested. * Integration tests change because the BQSR CSV is now sorted automatically. * Resolves GSA-732	2013-02-27 14:03:39 -05:00
David Roazen	752f4335a5	Merged bug fix from Stable into Unstable	2013-02-27 05:20:41 -05:00
David Roazen	2a7af43164	Fix improper dependencies in QScripts used by pipeline tests, and attempt to fix the flawed MisencodedBaseQualityUnitTest -Some QScripts used by public pipeline tests unnecessarily used the (now protected) UnifiedGenotyper. Changed them to use PrintReads instead. -Moved ExampleUnifiedGenotyperPipelineTest to protected -Attempt to fix the flawed and sporadically failing MisencodedBaseQualityUnitTest: After looking at this class a bit, I think the problem was the use of global arrays for the quals shared across all reads in all tests (BAMRecord class definitely does not make a separate copy for each read!). One test (testFixBadQuals) modifies the bad quals array, and if this happens to run before the testBadQualsThrowsError test the bad quals array will have been "fixed" and no exception will be thrown.	2013-02-27 04:45:53 -05:00
David Roazen	a53b4a7521	Merged bug fix from Stable into Unstable	2013-02-26 21:41:13 -05:00
David Roazen	65d31ba4ad	Fix runtime public -> protected dependencies in the test suite -replace unnecessary uses of the UnifiedGenotyper by public integration tests with PrintReads -move NanoSchedulerIntegrationTest to protected, since it's completely dependent on the UnifiedGenotyper	2013-02-26 21:19:12 -05:00
depristo	93205154b5	Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776 Eb fix pairhmm unittest gsa 776	2013-02-26 11:56:58 -08:00
Mauricio Carneiro	711cbd3b5a	Archiving CoverageBySample This walker was not updated since 2009, and users were getting wrong answers when running it with ReduceReads. I don't want to deal with this because DiagnoseTargets does everything this walker does.	2013-02-26 13:49:00 -05:00
depristo	51d618de97	Merge pull request #62 from broadinstitute/rp_increase_max_kmer_in_assembly The maximum kmer length is derived from the reads.	2013-02-26 05:37:02 -08:00
Eric Banks	7519484a38	Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods.	2013-02-25 15:04:23 -05:00
Ryan Poplin	89e2943dd1	The maximum kmer length is derived from the reads. -- This is done to take advantage of longer reads which can produce less ambiguous haplotypes -- Integration tests change for HC and BiasedDownsampling	2013-02-25 14:40:25 -05:00
David Roazen	3645ea9bb6	Sequence dictionary validation: detect problematic contig indexing differences The GATK engine does not behave correctly when contigs are indexed differently in the reads sequence dictionaries vs. the reference sequence dictionary, and the inconsistently-indexed contigs are included in the user's intervals. For example, given the dictionaries: Reference dictionary = { chrM, chr1, chr2, ... } BAM dictionary = { chr1, chr2, ... } and the interval "-L chr1", the engine would fail to correctly retrieve the reads from chr1, since chr1 has a different index in the two dictionaries. With this patch, we throw an exception if there are contig index differences between the dictionaries for reads and reference, AND the user's intervals include at least one of the mismatching contigs. The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY In all other cases, dictionary validation behaves as before. I also added comprehensive unit tests for the (previously-untested) SequenceDictionaryUtils class. GSA-768 #resolve	2013-02-25 11:14:22 -05:00
Ryan Poplin	6a639c8ffc	Replace Smith-Waterman alignment with the bubble traversal. -- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph. -- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime. -- Refactoring graph functions into a new DeBruijnAssemblyGraph class. -- Bug fix in path.getBases(). -- Adding validation code to the assembly engine. -- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class. -- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes. -- Added ability to ignore bubbles that are too divergent from the reference -- Max kmer can't be bigger than the extension size. -- Reverse the order that we create the assembly graphs so that the bigger kmers are used first. -- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment. -- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation. -- Updating HaplotypeCaller and BiasedDownsampling integration tests. -- Rebased everything into one commit as requested by Eric -- improvements to the bubble traversal are coming as a separate push	2013-02-22 15:42:16 -05:00
depristo	2ad559cf58	Merge pull request #59 from broadinstitute/mc_reving_testng_GSA-695 Updating TestNG to the latest version	2013-02-22 10:39:04 -08:00
Mauricio Carneiro	4ac50c89ad	Updating TestNG to the latest version -- changed SkipException constructors that are now private in TestNG -- Updated build.xml to use the latest testng -- Added guice dependency to ivy -- Fixed broken SampleDBUnitTest The SampleDBUnitTest was only passing before because the map comparison in the old TestNG was broken. It was comparing two DIFFERENT samples and testing for "equals" GSA-695 #resolve	2013-02-22 09:40:23 -05:00
Mark DePristo	182c32a2b7	Relax bounds checking in QualityUtils.boundQual -- Previous version did runtime checking that qual >= 0 but BQSR was relying on boundQual to restore -1 to 1. So relax the bound.	2013-02-22 08:46:59 -05:00
Mark DePristo	8ac6d3521f	Vast improvements to AssessNA12878 code and functionality -- AssessNA12878 now breaks out multi-allelics into bi-allelic components. This means that we can properly assess multi-allelic calls against the bi-allelic KB -- Refactor AssessNA12878, moving into assess package in KB. Split out previously private classes in the walker itself into separate classes. Added real docs for all of the classes. -- Vastly expand (from 0) unit tests for NA12878 assessments -- Allow sites only VCs to be evaluated by Assessor -- Move utility for creating simple VCs from a list of string alleles from GATKVariantContextUtilsUnitTest to GATKVariantContextUtils -- Assessor bugfix for discordant records at a site. Previous version didn't handle properly the case where one had a non-matching call in the callset w.r.t. the KB, so that the KB element was eaten during the analysis. Fixed. UnitTested -- See GSA-781 -- Handle multi-allelic variants in KB for more information -- Bugfix for missing site counting in AssessNA12878. Previous version would count N misses for every missed value at a site. Not that this has much impact but it's worth fixing -- UnitTests for BadSitesWriter -- UnitTests for filtered and filtering sites in the Assessor -- Cleanup end report generation code (simply the code). Note that instead of "indel" the new code will print out "INDELS" -- Assessor DoC calculations now us LIBS and RBPs for the depth calculation. The previous version was broken for reduced reads. Added unit test that reads a complex reduced read example and matches the DoC of this BAM with the output of the GATK DoC tool here. -- Added convenience constructor for LIBS using just SAMFileReader and an iterator. It's now easy to create a LIBS from a BAM at a locus. Added advanceToLocus function that moves the LIBS to a specific position. UnitTested via the assessor (which isn't ideal, but is a proper test)	2013-02-21 20:43:12 -05:00
Mark DePristo	29319bf222	Improved allele trimming code in GATKVariantContextUtils -- Now supports trimming the alleles from both the reverse and forward direction. -- Added lots of unit tests for forwrad allele trimming, as well as creating VC from forward and reverse trimming. -- Added docs and tests for the code, to bring it up to GATK spec	2013-02-21 12:01:43 -05:00
Eric Banks	6996a953a8	Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample). These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons): * Don't unnecessarily overload Allele.getBases() in the Haplotype class. * Haplotype.getBases() was calling clone() on the byte array. * Added a constructor to Allele (and Haplotype) that takes in an Allele as input. * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated). * Rev'ed the variant jar accordingly. For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.	2013-02-21 10:14:11 -05:00
Geraldine Van der Auwera	c3e01fea40	Added several more info types / annotations to GATKDocs -- top-level walker type (locus, read etc) -- parallelism options (nt or nct) -- annotation type (for Variant Annotations) -- downsampling settings that override engine defaults -- reference window size -- active region settings -- partitionBy info	2013-02-21 03:12:40 -05:00
Geraldine Van der Auwera	e674b4a524	Added new ReadFilter that allows users to specifically reassign one single mapping quality to a different value. Useful for TopHat and other RNA-seq software users.	2013-02-20 01:24:45 -05:00
MauricioCarneiro	76810465aa	Merge pull request #40 from broadinstitute/gg_retrieve_readfilters_GSATDG-63	2013-02-19 19:42:35 -08:00
Mark DePristo	910d966428	Extend timeout of NanoScheduler deadlock tests -- The previous timeout of 1 second was just dangerously short. Increase the timeout to 10 seconds	2013-02-19 20:25:25 -05:00
Eric Banks	0055a6f1cd	Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774 Fix to the Indel Realigner bug described in GSA-774	2013-02-19 16:16:48 -08:00
Geraldine Van der Auwera	faef85841b	Added GATKDocs fct to indicate default Read Filters for each tool -- Added getClazzAnnotations() as hub to retrieve various annotations values and class properties through reflection -- Added getReadFilters() method to retrieve Read Filter annotations -- getReadFilters() uses recursion to walk up the inheritance to also capture superclass annotations -- getClazzAnnotations() stores collected info in doc handler root, which is unit.forTemplate in Doclet -- Modified FreeMarker template to use the Readfilters info (displayed after arg table, before additional capabilities) -- Tadaaa :-) #GSATDG-63 resolve	2013-02-19 16:12:29 -05:00
Mauricio Carneiro	371ea2f24c	Fixed IndelRealigner reference length bug (GSA-774) -- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases -- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases -- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them) -- pulled out ReadBin class so it can be testable -- added unit tests for ReadBin with soft-clips -- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads GSA-774 #resolve	2013-02-19 16:00:36 -05:00
Mauricio Carneiro	815028edd4	Added verbose error message to the PluginManager -- added a logger.error with a more descriptive message of what the most likely cause of the error is Typical error happens when a walker's global variable is not initialized properly (usually in test conditions). The old error message was very hard to understand "Could not create module because of an exception of type NullPointerException ocurred caused by exception null"	2013-02-19 16:00:35 -05:00
Ryan Poplin	c025e84c8b	Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping. -- Updated a few integration tests for HC, UG, and UG general ploidy	2013-02-19 14:09:24 -05:00
Mark DePristo	be45edeff2	ActivityProfile and ActiveRegions respects engine interval boundaries -- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present. -- UnitTests for ActiveRegion.splitAndTrimToIntervals -- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals -- UnitTesting overlap function in GenomeLocSortedSet -- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775 -- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals -- Added docs for the constructors in GLSS -- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s -- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser	2013-02-18 10:40:25 -05:00
Mark DePristo	3b67aa8aee	Final edge case bug fixes to QualityUtil routines -- log10 functions in QualityUtils allow -Infinity to allow log10(0.0) values -- Fix edge condition of log10OneMinusX failing with Double.MIN_VALUE -- Fix another edge condition of log10OneMinusX failing with a small but not min_value double	2013-02-16 07:31:38 -08:00
Mark DePristo	b393c27f07	QualityUtils now uses runtime argument checks instead of contract -- There's some runtime cost for these tests, but it's not big enough to outweigh the value of catching errors quickly	2013-02-16 07:31:38 -08:00
Mark DePristo	9a29d6d4be	Fix an catastrophic bug (WoW!) in the reference calculation of the UG -- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test. -- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.	2013-02-16 07:31:38 -08:00
Mark DePristo	9e28d1e347	Cleanup and unit tests for QualityUtils -- Fixed a few conversion bugs with edge case quals (ones that were very high) -- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s -- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK -- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly. -- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate. -- Renamed probToQual to trueProbToQual -- Added goodProbability and log10OneMinusX to MathUtils -- Went through the GATK and cleaned up many uses of QualityUtils -- Cleanup constants in QualityUtils -- Added full docs for all of the constants -- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity -- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature -- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x) -- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils	2013-02-16 07:31:37 -08:00
Yossi Farjoun	aa99a5f47c	Added an option to print out the version string @argument (-)-version (should this be @hidden?) Prints out the version to System.out and quit(0) No tests. (any ideas on how to test this would be happily accepted)	2013-02-15 12:42:59 -05:00
droazen	664960373d	Merge pull request #31 from broadinstitute/yf_fast_BAM_index_traversal -re-enables fast BAM indexing	2013-02-15 09:12:32 -08:00
MauricioCarneiro	1dd284a5bb	Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720 PrintReads writes a header when used with -BQSR	2013-02-15 07:18:28 -08:00
MauricioCarneiro	b58a0eca6b	Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62 Refactored GATKDocs categories some more ( GSATDG-62 )	2013-02-14 22:35:07 -08:00
Tad Jordan	6cb80591e3	PrintReads writes a header when used with -BQSR	2013-02-14 22:19:14 -05:00
Yossi Farjoun	3a7c8c13e2	Re-enabled fastBAMindexing by replacing the FileChannel with a SeekableBufferedStream This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads. - Fixed a deterioration in BAMIndex due to rev'ed picard (see below) - Added unit tests for SeekableBufferedStream - Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest) - Added a runtime-test to verify that the amount read equals the amount requested. - Added failing tests with expectedExceptions - Used a DataProvider to make code nicer	2013-02-14 17:51:15 -05:00
Mark DePristo	f92328a1a1	Extend default timeout to 20 minutes -- The default of 10 minutes is right on the edge for some tests, and we really want a default not to enforce a max time (test should be short) but to stop testng from failing to terminate ever in the case where some test is truly hung	2013-02-13 17:43:40 -08:00
Geraldine Van der Auwera	6208742f7c	Refactored GATKDocs categories some more ( GSATDG-62 ) -- Renamed ValidatePileup to CheckPileup since validation is reserved word -- Renamed AlignmentValidation to CheckAlignment (same as above) -- Refactored category definitions to use constants defined in HelpConstants -- Fixed a couple of minor typos and an example error -- Reorganized the GATKDocs index template to use supercategories -- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)	2013-02-13 16:49:18 -05:00
Guillermo del Angel	4308b27f8c	Fixed non-determinism in HaplotypeCaller and some UG calls - -- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads. -- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast) -- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling). -- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test -- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).	2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera	dff5ef562b	Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details) -- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools -- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities -- Reworded some category names and descriptions to be more explicit and user-friendly	2013-02-12 13:36:15 -05:00
Mark DePristo	e40d83f00e	Final version of PairHMMs with correct edge conditions -- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites -- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done -- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow -- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact -- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC -- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now -- Added locs of docs -- Generalize unit tests with multiple equivalent matches of read to haplotype -- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10 -- Functions to dumpMatrices for debugging -- Fix nasty bug (without original unit tests) in LoglessPairHMM -- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again. -- Added dupString(string, n) method to Utils -- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes -- Unit tests for the hapStartIndex functionality of PairHMM -- Moved computeFirstDifferingPosition to PairHMM, and added unit tests -- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10 -- Still TODOs left in the code that I'll fix up -- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so -- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal -- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors	2013-02-09 19:19:22 -05:00
Mark DePristo	09595cdeb9	Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments	2013-02-09 13:06:54 -05:00
Mark DePristo	2d802e17a4	Delete the CachingPairHMM	2013-02-09 13:06:54 -05:00

1 2 3 4 5 ...

3438 Commits (a0be74c2ef145ca784691cf7bdc33ae260c23cf7)