Commit Graph

12188 Commits (1b36db8940dcfae5a85cfb501cac4f671ef4f28a)

Author SHA1 Message Date
Mark DePristo 1b36db8940 Make ActiveRegionTraversal robust to excessive coverage
-- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time.  Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount.  This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation.
-- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly.
-- #resolves GSA-921
2013-04-08 15:48:19 -04:00
Mark DePristo 317dc4c323 Add size() method to Downsampler interface
-- This method provides client with the current number of elements, without having to retreive the underlying list<T>.  Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones.  All of the others are trivially obviously correct.
2013-04-08 15:48:13 -04:00
Ryan Poplin 0c2f795fa5 Merge pull request #147 from broadinstitute/md_hc_symbolic_allele
Large number of fundamental improvements to the HaplotypeCaller
2013-04-08 10:29:07 -07:00
Mark DePristo 469dc7f22c Update KMerErrorCorrectorUnitTest license text 2013-04-08 12:48:20 -04:00
Mark DePristo 21410690a2 Address reviewer comments 2013-04-08 12:48:20 -04:00
Mark DePristo caf15fb727 Update MD5s to reflect new HC algorithms and parameter values 2013-04-08 12:48:16 -04:00
Mark DePristo 6d22485a4c Critical bugfix to ReduceRead functionality of the GATKSAMRecord
-- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5].  The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding.  Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i.  Added unit test to ensure this behavior going forward.
-- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.
2013-04-08 12:47:50 -04:00
Mark DePristo 3097936a3d Update GeneralCallingPipeline
-- Bugfix to puts all files in the subdirectory, regardless of whether the outputDir is provided with a ending / or not
-- UG now runs single threaded in GeneralCallingPipeline
-- GCP HC only needs 2 GB now
2013-04-08 12:47:50 -04:00
Mark DePristo 5a54a4155a Change key Haplotype default parameter values
-- Extension increased to 200 bp
-- Min prune factor defaults to 0
-- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging
-- Active region trimming enabled by default
2013-04-08 12:47:50 -04:00
Mark DePristo 3a19266843 Fix residual merge conflicts 2013-04-08 12:47:50 -04:00
Mark DePristo 9c7a35f73f HaplotypeCaller no longer creates haplotypes that involve cycles in the SeqGraph
-- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models.  Allowing cycles can be re-enabled with an HC command line switch.
2013-04-08 12:47:50 -04:00
Mark DePristo 5545c629f5 Rename Utils to GraphUtils to avoid conflicts with the sting.Utils class; fix broken unit test in SharedVertexSequenceSplitterUnitTest 2013-04-08 12:47:49 -04:00
Mark DePristo 15461567d7 HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype
-- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes.  This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype.  All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).
2013-04-08 12:47:49 -04:00
Mark DePristo 9b5c55a84a LikelihoodCalculationEngine will now only use reads longer than the minReadLength, which is currently fixed at 20 bp 2013-04-08 12:47:49 -04:00
Mark DePristo af593094a2 Major improvements to HC that trims down active regions before genotyping
-- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension).  Radically speeds up calculations when using large active region extensions.  The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible.  The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes
-- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter.  The previous error corrector was just broken (conceptually) and was disabled by default in the engine.  Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java
-- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler
2013-04-08 12:47:49 -04:00
Mark DePristo 4d389a8234 Optimizations for HC infrastructure
-- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices
-- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final
-- Cache HashCode calculation in BaseVertex
-- Better docs before the pruneGraph call
2013-04-08 12:47:49 -04:00
Mark DePristo e916998784 Bugfix for head and tail merging code in SeqGraph
-- The previous version of the head merging (and tail merging to a lesser degree) would inappropriately merge source and sinks without sufficient evidence to do so.  This would introduce large deletion events at the start / end of the assemblies.  Refcatored code to require 20 bp of overlap in the head or tail nodes, as well as unit tested functions to support this.
2013-04-08 12:47:48 -04:00
Mark DePristo 2aac9e2782 More efficient ZipLinearChains algorithm
-- Goes through the graph looking for chains to zip, accumulates the vertices of the chains, and then finally go through and updates the graph in one big go.  Vastly more efficient than the previous version, but unfortunately doesn't actually work now
-- Also incorporate edge weight propagation into SeqGraph zipLinearChains.  The edge weights for all incoming and outgoing edges are now their previous value, plus the sum of the internal chain edges / n such edges
2013-04-08 12:47:48 -04:00
Mark DePristo 7105ad65a6 Remove the capability of EventMap to emit symbolic alleles for unassembled events
-- These events always occur on the very edge of the haplotypes, and are intrinsically dodgy.  So instead of emitting them and then potentially having to deal with merging real basepair events into them we just no longer emit those events.
2013-04-08 12:47:48 -04:00
Mark DePristo f1d772ac25 LD-based merging algorithm for nearby events in the haplotypes
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator.  For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed.  Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well.  Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp.  In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M.  The new code supports this.  UnitTested and documented as well.  LDMerger handles case where merging two alleles results in a no-op event.  Merging CA/C + A/AA -> CAA/CAA -> no op.  Handles this case by removing the two events.  UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right).  Here's the new algorithm:

     * Compute probability that two variants are in phase with each other and that no
     * compound hets exist in the population.
     *
     * Implemented as a likelihood ratio test of the hypothesis:
     *
     * x11 and x22 are the only haplotypes in the populations
     *
     * vs.
     *
     * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
     *
     * Now, since we have to have both variants in the population, we exclude the x11 & x11 state.  So the
     * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
     *
     * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
     *
     * - P(x11 & x12 & x21) -- we have hom-ref and both hets
     * - P(x22 & x12 & x21) -- we have hom-alt and both hets
     * - P(x22 & x12) -- one haplotype is 22 and the other is het 12
     * - P(x22 & x21) -- one haplotype is 22 and the other is het 21
2013-04-08 12:47:48 -04:00
Mark DePristo 167cd49e71 Added -forceActive argument to ActiveRegionWalkers
-- Causes the ART tool to treat all bases as active.   Useful for debugging
2013-04-08 12:47:48 -04:00
Mark DePristo 8656bd5e29 Haplotype now consolidates cigars in setCigar
-- This fixes edge base bugs where non-consolidated cigars are causing problems in users of the Haplotype object.  Input arguments are now checks (let's see if we blow up)
2013-04-08 12:47:47 -04:00
Mark DePristo 67cd407854 The GenotypingEngine now uses the samples from the mapping of Samples -> PerReadAllele likelihoods instead of passing around a redundant list of samples 2013-04-08 12:47:47 -04:00
Mark DePristo 0310499b65 System to merge multiple nearby alleles into block substitutions
-- Block substitution algorithm that merges nearby events based on distance.
-- Also does some cleanup of GenotypingEngine
2013-04-08 12:47:47 -04:00
Mark DePristo bff13bb5c5 Move Haplotype class to its own package in utils 2013-04-08 12:47:47 -04:00
Mark DePristo b7d59ea13b LIBS unit test debugging should be false 2013-04-08 12:47:47 -04:00
Mark DePristo 7cd804f97c Merge pull request #149 from broadinstitute/gda_ancient_dna_newPipeline
Small Queue/scala improvements, and commiting pipeline scripts developed...
2013-04-08 09:32:46 -07:00
Guillermo del Angel c9d3c67a9b Small Queue/scala improvements, and commiting pipeline scripts developed for ancient DNA processing for posterity:
-- Picard extension so Queue scripts can use FastqToSam
-- Single-sample BAM processing: merge/trim reads + BWA + IR + MD + BQSR. Mostly identical to standard pipeline,
except for the adaptor trimming/merging which is critical for short-insert libraries.
-- Single-sample calling (experimental, work in progress): standard UG run but outputting at all sites, meant for
deep whole genomes.

New scripts
2013-04-08 11:52:13 -04:00
delangel 4cc4bb36aa Merge pull request #148 from broadinstitute/mc_fix_hmm_caching
Fix caching indices in the PairHMM
2013-04-08 08:09:13 -07:00
Mauricio Carneiro ebe2edbef3 Fix caching indices in the PairHMM
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)

Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.

Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)

[fixes #47399227]
2013-04-08 11:05:12 -04:00
Mark DePristo 56f4529ef3 Merge pull request #145 from broadinstitute/eb_various_minor_fixes
Eb various minor fixes
2013-04-05 06:06:54 -07:00
Eric Banks 6253ba164e Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs
* This occurred when one or more alleles were lost from the record after selection
  * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718
  * Added some integration tests for --keepOriginalAC (there were none before)
2013-04-05 00:53:28 -04:00
Eric Banks 7897d52f32 Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec).
* As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270
  * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF
  * Added integration test to confirm failure with User Error
  * Removed illegal header line in KB test VCF that was causing related tests to fail.
2013-04-05 00:52:32 -04:00
Eric Banks 14bbba0980 Optimization to method for getting values in ArgumentMatch
* Very trivial, but I happened to see this code and it drove me nuts so I felt compelled to refactor it.
  * Instead of iterating over keys in map to get the values, just iterate over the values...
2013-04-04 23:30:47 -04:00
Mark DePristo 03811b4274 Merge pull request #146 from broadinstitute/dr_fix_NA12878KBUnitTestBase
NA12878KBUnitTestBase: use @BeforeSuite method instead of unannotated static initializer block
2013-04-04 17:40:48 -07:00
David Roazen 2f42aa5d5d NA12878KBUnitTestBase: use @BeforeSuite method instead of unannotated static initializer block
-TestNG fails to report errors that occur in static initializer blocks before any tests are run
 in its XML reports. This was causing Bamboo to claim that tests had passed even though there
 were pre-test errors.

-This is a temporary fix until we can find a way to get TestNG to report errors that occur both
 outside of test methods and outside of @Before* methods.
2013-04-04 16:03:48 -04:00
MauricioCarneiro 78f32ee048 Merge pull request #144 from broadinstitute/dr_disable_contracts_for_tests
quick tests, here we come!
2013-04-04 09:44:28 -07:00
David Roazen 6197078c5d Disable Contracts for Java for tests
-cofoja is not compatible with Java 7, so we're forced to disable
 it for now until a replacement can be found
2013-04-04 11:56:17 -04:00
Mark DePristo 5e622557a7 Merge pull request #142 from broadinstitute/rp_duplicate_active_regions_GSA-912
Critical bug fix for the case of duplicate map calls in ActiveRegionWalk...
2013-04-03 12:00:26 -07:00
Ryan Poplin 8a93bb687b Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists.
-- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create
duplicate active regions.
-- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside
of the current interval.
-- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.
2013-04-03 13:15:30 -04:00
Mark DePristo 09edee2c97 Merge pull request #141 from broadinstitute/md_linkedhashsets
Use LinkedHashSets in incoming/outgoing vertex functions in BaseGraph
2013-04-02 17:42:07 -07:00
droazen 95be16dca7 Merge pull request #143 from broadinstitute/dr_disable_auto_fai_dict_creation_GSA-866
Remove auto-creation of fai/dict files for fasta references
2013-04-02 15:50:47 -07:00
David Roazen 2eac97a76c Remove auto-creation of fai/dict files for fasta references
-A UserException is now thrown if either the fai or dict file for the
 reference does not exist, with pointers to instructions for creating
 these files.

-Gets rid of problematic file locking that was causing intermittent
 errors on our farm.

-Integration tests to verify that correct exceptions are thrown in
 the case of a missing fai / dict file.

GSA-866 #resolve
2013-04-02 18:34:08 -04:00
Mark DePristo bb42c90f2b Use LinkedHashSets in incoming and outgoing vertex functions in BaseGraph
-- Using a LinkedHashSet changed the md5 for HCTestComplexVariants.
2013-04-02 17:58:20 -04:00
Mark DePristo e7a8e6e8ee Merge pull request #140 from broadinstitute/dr_interval_intersection_bug_GSA-909
Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
2013-04-02 11:59:01 -07:00
David Roazen b4b58a3968 Fix unprintable character in a comment from the BaseEdge class
Compiler warnings about this were starting to get to me...
2013-04-02 14:24:23 -04:00
David Roazen 5baf906c28 Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
-The algorithm for finding the intersection of two sets of intervals
 relies on the sortedness of the intervals within each set, but the engine
 was not sorting the intervals before attempting to find the intersection.

-The result was that if one or both interval lists was unsorted / lexicographically
 sorted, we would often fail to find the intersection correctly.

-Now the IntervalBinding sorts all sets of intervals before returning them,
 solving the problem.

-Added an integration test for this case.

GSA-909 #resolve
2013-04-02 14:01:52 -04:00
Ryan Poplin d412605c9b Merge pull request #139 from broadinstitute/md_bugfix_suffix_splitter
Critical bugfix for CommonSuffixSplitter
2013-04-02 07:11:45 -07:00
Mark DePristo c191d7de8c Critical bugfix for CommonSuffixSplitter
-- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm.  Created unit test that reproduced the issue, and then fixed the underlying issue.
2013-04-02 09:22:33 -04:00
Ryan Poplin a58a3e7e1e Merge pull request #134 from broadinstitute/mc_phmm_experiments
PairHMM rework
2013-04-01 12:10:43 -07:00