gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	1b36db8940	Make ActiveRegionTraversal robust to excessive coverage -- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time. Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount. This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation. -- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly. -- #resolves GSA-921	2013-04-08 15:48:19 -04:00
Mark DePristo	317dc4c323	Add size() method to Downsampler interface -- This method provides client with the current number of elements, without having to retreive the underlying list<T>. Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones. All of the others are trivially obviously correct.	2013-04-08 15:48:13 -04:00
Ryan Poplin	0c2f795fa5	Merge pull request #147 from broadinstitute/md_hc_symbolic_allele Large number of fundamental improvements to the HaplotypeCaller	2013-04-08 10:29:07 -07:00
Mark DePristo	469dc7f22c	Update KMerErrorCorrectorUnitTest license text	2013-04-08 12:48:20 -04:00
Mark DePristo	21410690a2	Address reviewer comments	2013-04-08 12:48:20 -04:00
Mark DePristo	caf15fb727	Update MD5s to reflect new HC algorithms and parameter values	2013-04-08 12:48:16 -04:00
Mark DePristo	6d22485a4c	Critical bugfix to ReduceRead functionality of the GATKSAMRecord -- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward. -- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.	2013-04-08 12:47:50 -04:00
Mark DePristo	3097936a3d	Update GeneralCallingPipeline -- Bugfix to puts all files in the subdirectory, regardless of whether the outputDir is provided with a ending / or not -- UG now runs single threaded in GeneralCallingPipeline -- GCP HC only needs 2 GB now	2013-04-08 12:47:50 -04:00
Mark DePristo	5a54a4155a	Change key Haplotype default parameter values -- Extension increased to 200 bp -- Min prune factor defaults to 0 -- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging -- Active region trimming enabled by default	2013-04-08 12:47:50 -04:00
Mark DePristo	3a19266843	Fix residual merge conflicts	2013-04-08 12:47:50 -04:00
Mark DePristo	9c7a35f73f	HaplotypeCaller no longer creates haplotypes that involve cycles in the SeqGraph -- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models. Allowing cycles can be re-enabled with an HC command line switch.	2013-04-08 12:47:50 -04:00
Mark DePristo	5545c629f5	Rename Utils to GraphUtils to avoid conflicts with the sting.Utils class; fix broken unit test in SharedVertexSequenceSplitterUnitTest	2013-04-08 12:47:49 -04:00
Mark DePristo	15461567d7	HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype -- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes. This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype. All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).	2013-04-08 12:47:49 -04:00
Mark DePristo	9b5c55a84a	LikelihoodCalculationEngine will now only use reads longer than the minReadLength, which is currently fixed at 20 bp	2013-04-08 12:47:49 -04:00
Mark DePristo	af593094a2	Major improvements to HC that trims down active regions before genotyping -- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension). Radically speeds up calculations when using large active region extensions. The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible. The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes -- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter. The previous error corrector was just broken (conceptually) and was disabled by default in the engine. Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java -- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler	2013-04-08 12:47:49 -04:00
Mark DePristo	4d389a8234	Optimizations for HC infrastructure -- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices -- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final -- Cache HashCode calculation in BaseVertex -- Better docs before the pruneGraph call	2013-04-08 12:47:49 -04:00
Mark DePristo	e916998784	Bugfix for head and tail merging code in SeqGraph -- The previous version of the head merging (and tail merging to a lesser degree) would inappropriately merge source and sinks without sufficient evidence to do so. This would introduce large deletion events at the start / end of the assemblies. Refcatored code to require 20 bp of overlap in the head or tail nodes, as well as unit tested functions to support this.	2013-04-08 12:47:48 -04:00
Mark DePristo	2aac9e2782	More efficient ZipLinearChains algorithm -- Goes through the graph looking for chains to zip, accumulates the vertices of the chains, and then finally go through and updates the graph in one big go. Vastly more efficient than the previous version, but unfortunately doesn't actually work now -- Also incorporate edge weight propagation into SeqGraph zipLinearChains. The edge weights for all incoming and outgoing edges are now their previous value, plus the sum of the internal chain edges / n such edges	2013-04-08 12:47:48 -04:00
Mark DePristo	7105ad65a6	Remove the capability of EventMap to emit symbolic alleles for unassembled events -- These events always occur on the very edge of the haplotypes, and are intrinsically dodgy. So instead of emitting them and then potentially having to deal with merging real basepair events into them we just no longer emit those events.	2013-04-08 12:47:48 -04:00
Mark DePristo	f1d772ac25	LD-based merging algorithm for nearby events in the haplotypes -- Moved R^2 LD haplotype merging system to the utils.haplotype package -- New LD merging only enabled with HC argument. -- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable -- A few misc. bug fixes in LDMerger itself -- Refactoring of Haplotype event splitting and merging code -- Renamed EventExtractor to EventMap -- EventMap has a static method that computes the event maps among n haplotypes -- Refactor Haplotype score and base comparators into their own classes and unit tested them -- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it. -- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator. For those who haven't seen this testing idiom, have a look, and very useful -- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population. -- Fixed fundamental bug in the way the previous R^2 implementation worked -- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are -- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed. Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events -- Bugfixes for EventMap in the HaplotypeCaller as well. Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp. In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M. The new code supports this. UnitTested and documented as well. LDMerger handles case where merging two alleles results in a no-op event. Merging CA/C + A/AA -> CAA/CAA -> no op. Handles this case by removing the two events. UnitTested -- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled -- This new version does a much more specific test (that's actually right). Here's the new algorithm: * Compute probability that two variants are in phase with each other and that no * compound hets exist in the population. * * Implemented as a likelihood ratio test of the hypothesis: * * x11 and x22 are the only haplotypes in the populations * * vs. * * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population. * * Now, since we have to have both variants in the population, we exclude the x11 & x11 state. So the * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22). * * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are: * * - P(x11 & x12 & x21) -- we have hom-ref and both hets * - P(x22 & x12 & x21) -- we have hom-alt and both hets * - P(x22 & x12) -- one haplotype is 22 and the other is het 12 * - P(x22 & x21) -- one haplotype is 22 and the other is het 21	2013-04-08 12:47:48 -04:00
Mark DePristo	167cd49e71	Added -forceActive argument to ActiveRegionWalkers -- Causes the ART tool to treat all bases as active. Useful for debugging	2013-04-08 12:47:48 -04:00
Mark DePristo	8656bd5e29	Haplotype now consolidates cigars in setCigar -- This fixes edge base bugs where non-consolidated cigars are causing problems in users of the Haplotype object. Input arguments are now checks (let's see if we blow up)	2013-04-08 12:47:47 -04:00
Mark DePristo	67cd407854	The GenotypingEngine now uses the samples from the mapping of Samples -> PerReadAllele likelihoods instead of passing around a redundant list of samples	2013-04-08 12:47:47 -04:00
Mark DePristo	0310499b65	System to merge multiple nearby alleles into block substitutions -- Block substitution algorithm that merges nearby events based on distance. -- Also does some cleanup of GenotypingEngine	2013-04-08 12:47:47 -04:00
Mark DePristo	bff13bb5c5	Move Haplotype class to its own package in utils	2013-04-08 12:47:47 -04:00
Mark DePristo	b7d59ea13b	LIBS unit test debugging should be false	2013-04-08 12:47:47 -04:00
Mark DePristo	7cd804f97c	Merge pull request #149 from broadinstitute/gda_ancient_dna_newPipeline Small Queue/scala improvements, and commiting pipeline scripts developed...	2013-04-08 09:32:46 -07:00
Guillermo del Angel	c9d3c67a9b	Small Queue/scala improvements, and commiting pipeline scripts developed for ancient DNA processing for posterity: -- Picard extension so Queue scripts can use FastqToSam -- Single-sample BAM processing: merge/trim reads + BWA + IR + MD + BQSR. Mostly identical to standard pipeline, except for the adaptor trimming/merging which is critical for short-insert libraries. -- Single-sample calling (experimental, work in progress): standard UG run but outputting at all sites, meant for deep whole genomes. New scripts	2013-04-08 11:52:13 -04:00
delangel	4cc4bb36aa	Merge pull request #148 from broadinstitute/mc_fix_hmm_caching Fix caching indices in the PairHMM	2013-04-08 08:09:13 -07:00
Mauricio Carneiro	ebe2edbef3	Fix caching indices in the PairHMM Problem: -------- PairHMM was generating positive likelihoods (even after the re-work of the model) Solution: --------- The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM. Summarized Changes: ------------------ * Redesign the matrices to have only 1 col/row of padding instead of 2. * PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start) * Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version) * Adjust the prior and probability matrices to be one based (logless) * Update Log10PairHMM to work with prior and probability matrices as well * Move prior and probability matrices to parent class * Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations * Simple cleanup of PairHMMUnitTest class for a little speedup * Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base). * Create static indices for the transition probabilities (for better readability) [fixes #47399227]	2013-04-08 11:05:12 -04:00
Mark DePristo	56f4529ef3	Merge pull request #145 from broadinstitute/eb_various_minor_fixes Eb various minor fixes	2013-04-05 06:06:54 -07:00
Eric Banks	6253ba164e	Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs * This occurred when one or more alleles were lost from the record after selection * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718 * Added some integration tests for --keepOriginalAC (there were none before)	2013-04-05 00:53:28 -04:00
Eric Banks	7897d52f32	Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec). * As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270 * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF * Added integration test to confirm failure with User Error * Removed illegal header line in KB test VCF that was causing related tests to fail.	2013-04-05 00:52:32 -04:00
Eric Banks	14bbba0980	Optimization to method for getting values in ArgumentMatch * Very trivial, but I happened to see this code and it drove me nuts so I felt compelled to refactor it. * Instead of iterating over keys in map to get the values, just iterate over the values...	2013-04-04 23:30:47 -04:00
Mark DePristo	03811b4274	Merge pull request #146 from broadinstitute/dr_fix_NA12878KBUnitTestBase NA12878KBUnitTestBase: use @BeforeSuite method instead of unannotated static initializer block	2013-04-04 17:40:48 -07:00
David Roazen	2f42aa5d5d	NA12878KBUnitTestBase: use @BeforeSuite method instead of unannotated static initializer block -TestNG fails to report errors that occur in static initializer blocks before any tests are run in its XML reports. This was causing Bamboo to claim that tests had passed even though there were pre-test errors. -This is a temporary fix until we can find a way to get TestNG to report errors that occur both outside of test methods and outside of @Before* methods.	2013-04-04 16:03:48 -04:00
MauricioCarneiro	78f32ee048	Merge pull request #144 from broadinstitute/dr_disable_contracts_for_tests quick tests, here we come!	2013-04-04 09:44:28 -07:00
David Roazen	6197078c5d	Disable Contracts for Java for tests -cofoja is not compatible with Java 7, so we're forced to disable it for now until a replacement can be found	2013-04-04 11:56:17 -04:00
Mark DePristo	5e622557a7	Merge pull request #142 from broadinstitute/rp_duplicate_active_regions_GSA-912 Critical bug fix for the case of duplicate map calls in ActiveRegionWalk...	2013-04-03 12:00:26 -07:00
Ryan Poplin	8a93bb687b	Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists. -- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create duplicate active regions. -- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside of the current interval. -- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.	2013-04-03 13:15:30 -04:00
Mark DePristo	09edee2c97	Merge pull request #141 from broadinstitute/md_linkedhashsets Use LinkedHashSets in incoming/outgoing vertex functions in BaseGraph	2013-04-02 17:42:07 -07:00
droazen	95be16dca7	Merge pull request #143 from broadinstitute/dr_disable_auto_fai_dict_creation_GSA-866 Remove auto-creation of fai/dict files for fasta references	2013-04-02 15:50:47 -07:00
David Roazen	2eac97a76c	Remove auto-creation of fai/dict files for fasta references -A UserException is now thrown if either the fai or dict file for the reference does not exist, with pointers to instructions for creating these files. -Gets rid of problematic file locking that was causing intermittent errors on our farm. -Integration tests to verify that correct exceptions are thrown in the case of a missing fai / dict file. GSA-866 #resolve	2013-04-02 18:34:08 -04:00
Mark DePristo	bb42c90f2b	Use LinkedHashSets in incoming and outgoing vertex functions in BaseGraph -- Using a LinkedHashSet changed the md5 for HCTestComplexVariants.	2013-04-02 17:58:20 -04:00
Mark DePristo	e7a8e6e8ee	Merge pull request #140 from broadinstitute/dr_interval_intersection_bug_GSA-909 Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists	2013-04-02 11:59:01 -07:00
David Roazen	b4b58a3968	Fix unprintable character in a comment from the BaseEdge class Compiler warnings about this were starting to get to me...	2013-04-02 14:24:23 -04:00
David Roazen	5baf906c28	Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists -The algorithm for finding the intersection of two sets of intervals relies on the sortedness of the intervals within each set, but the engine was not sorting the intervals before attempting to find the intersection. -The result was that if one or both interval lists was unsorted / lexicographically sorted, we would often fail to find the intersection correctly. -Now the IntervalBinding sorts all sets of intervals before returning them, solving the problem. -Added an integration test for this case. GSA-909 #resolve	2013-04-02 14:01:52 -04:00
Ryan Poplin	d412605c9b	Merge pull request #139 from broadinstitute/md_bugfix_suffix_splitter Critical bugfix for CommonSuffixSplitter	2013-04-02 07:11:45 -07:00
Mark DePristo	c191d7de8c	Critical bugfix for CommonSuffixSplitter -- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm. Created unit test that reproduced the issue, and then fixed the underlying issue.	2013-04-02 09:22:33 -04:00
Ryan Poplin	a58a3e7e1e	Merge pull request #134 from broadinstitute/mc_phmm_experiments PairHMM rework	2013-04-01 12:10:43 -07:00

1 2 3 4 5 ...

12188 Commits (1b36db8940dcfae5a85cfb501cac4f671ef4f28a) All Branches Search

12188 Commits (1b36db8940dcfae5a85cfb501cac4f671ef4f28a)

All Branches