Commit Graph

519 Commits (7d5a710ea6bcea4d5b71bb8cd1f284020a9b8f4e)

Author SHA1 Message Date
Valentin Ruano Rubio 71bbb25c9e Merge pull request #231 from broadinstitute/md_combinevariants_bugfix
CombineVariants no longer adds PASS to unfiltered records
2013-05-20 14:28:20 -07:00
Mark DePristo 62fc88f92e CombineVariants no longer adds PASS to unfiltered records
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!
2013-05-20 16:53:51 -04:00
Ryan Poplin 507853c583 Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result.
-- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.
2013-05-20 14:29:04 -04:00
Eric Banks 8a442d3c9f @Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating. 2013-05-17 10:04:10 -04:00
Yossi Farjoun 9234a0efcd Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs
Bug fixes and missing interval functionality for Diagnose Targets

While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule).

It is with these cavets that I accept the pull request.
2013-05-16 09:25:54 -07:00
Chris Hartl 6da0aed30f Update GCIT md5s to account for trivial changes to description strings 2013-05-14 19:45:30 -04:00
Yossi Farjoun 409a202492 Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC
Add overall genotype concordance to the genotype concordance tool. In ad...
2013-05-14 14:19:54 -07:00
Mauricio Carneiro adcbf947bf Update MD5s and the Diagnose Target scala script 2013-05-13 12:06:17 -04:00
Mark DePristo b4f482a421 NanoScheduled ActiveRegionTraversal and HaplotypeCaller
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
  -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
  -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
2013-05-13 11:09:02 -04:00
Eric Banks 2f5ef6db44 New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
   * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
     right thing for indels at the edges of the alignments.
     * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
   * Lots of systematic testing added for this implementation.
   * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
2013-05-13 09:36:39 -04:00
Mark DePristo 111e8cef0f Merge pull request #219 from broadinstitute/eb_rr_multisample_fix
Fix bug in Reduce Reads that arises in multi-sample mode.
2013-05-09 15:31:14 -07:00
Eric Banks 8b9c6aae3e Fix bug in Reduce Reads that arises in multi-sample mode.
* bitset could legitimately be in an unfinished state but we were trying to access it without finalizing.
  * added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode.
  * tests were easiest to implement as integration tests (this was a really complicated case).
2013-05-08 23:23:51 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Eric Banks d242f1bba3 Secondary alignments were not handled correctly in IndelRealigner
* This is emerging now because BWA-MEM produces lots of reads that are not primary alignments
 * The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it
     was getting confused by these secondary alignments
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Eric Banks b53336c2d0 Added hidden mode for BQSR to force all read groups to be the same one.
* Very useful for debugging sample-specific issues
 * This argument got lost in the transition from BQSR v1 to v2
 * Added unit test to cover this case
2013-05-06 19:09:10 -04:00
Chris Hartl 6ff74deac7 Add overall genotype concordance to the genotype concordance tool. In addition, protect from non-diploid genotypes, which can cause very strange behavior.
Update MD5 sums. As expected, md5 changes are consistent with the genotype concordance field being added to each output.
2013-05-06 13:06:30 -04:00
chartl 98021db264 Merge pull request #208 from broadinstitute/yf_fix_molten_GenotypeConcordance
- Fixed a small bug in the printout of molten data in GenotypeConcordanc...
2013-05-06 08:42:06 -07:00
Guillermo del Angel 874dc8f9c1 Re-fix md5's that changed due to conflicting pushes 2013-05-03 14:59:16 -04:00
Mark DePristo f42bb86bdd e# This is a combination of 2 commits.
Only try to clip adaptors when both reads of the pair are on opposite strands

-- Read pairs that have unusual alignments, such as two reads both oriented like:

  <-----
     <-----

where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs.  This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
2013-05-03 11:19:14 -04:00
Mark DePristo 2bcbdd469f leftAlignCigarSequentially now supports haplotypes with insertions and deletions where the deletion allele was previously removed by the leftAlignSingleIndel during it's cleanup phase. 2013-05-03 09:32:05 -04:00
Guillermo del Angel 0c30a5ebc6 Rev'd up Picard to get PL fix: PLs were saturated to 32767 (Short.MAX_VALUE) when converting from GL to integers. Increase capping to Integer.MAX_VALUE (2^31-1) which should be enough for reasonable sites now. Integration tests change because some tests have some hyper-deep pileups where this case was hit 2013-05-02 16:31:43 -04:00
Yossi Farjoun 4b8b411b92 - Fixed a small bug in the printout of molten data in GenotypeConcordance
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.

- Updated MD5's after looking at diffs to verify that the change is what I expected.
2013-05-02 09:16:07 -04:00
David Roazen f3c94a3c87 Update expected test output for Java 7
-Changes in Java 7 related to comparators / sorting produce a large number
 of innocuous differences in our test output. Updating expectations now
 that we've moved to using Java 7 internally.

-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
 intermittent failures.
2013-05-01 16:18:01 -04:00
Eric Banks 58424e56be Setting the reduce reads count tag was all wrong in a previous commit; fixing.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly.  Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.

The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly.  Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).

Also:
1. counts are now maintained as ints whenever possible.  Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
2013-04-30 13:45:42 -04:00
Guillermo del Angel 20d3137928 Fix for indel calling with UG in presence of reduced reads: When a read is long enough so that there's no reference context available, the reads gets clipped so that it falls again within the reference context range. However, the clipping is incorrect, as it makes the read end precisely at the end of the reference context coordinates. This might lead to a case where a read might span beyond the haplotype if one of the candidate haplotypes is shorter than the reference context (As in the case e.g. with deletions). In this case, the HMM will not work properly and the likelihood will be bad, since "insertions" at end of reads when haplotype is done will be penalized and likelihood will be much lower than it should.
-- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype.
-- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.
2013-04-29 19:33:02 -04:00
Mark DePristo 0387ea8df9 Bugfix for ReadClipper with ReducedReads
-- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts.  Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly.  Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read
-- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously
2013-04-29 11:12:09 -04:00
Guillermo del Angel 4168aaf280 Add feature to specify Allele frequency priors by command line when calling variants.
Use case:
The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele.
Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence).
Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions.
With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs.
-- Updated ancient DNA single sample calling script with filtering options and other cleanups.
-- Added integration test. Removed old -noPrior syntax.
2013-04-26 19:06:39 -04:00
Mark DePristo 759c531d1b Merge pull request #197 from broadinstitute/dr_disable_snpeff_version_check
Add support for snpEff "GATK compatibility mode" (-o gatk)
2013-04-26 13:55:14 -07:00
David Roazen 7d90bbab08 Add support for snpEff "GATK compatibility mode" (-o gatk)
-Do not throw an exception when parsing snpEff output files
 generated by not-officially-supported versions of snpEff,
 PROVIDED that snpEff was run with -o gatk

-Requested by the snpEff author

-Relevant integration tests updated/expanded
2013-04-26 15:47:15 -04:00
Mark DePristo 071fd67d55 Merge pull request #193 from broadinstitute/eb_contamination_fixing_for_reduced_reads
Eb contamination fixing for reduced reads
2013-04-26 09:48:45 -07:00
Mark DePristo 92a6c7b561 Merge pull request #195 from broadinstitute/eb_exclude_sample_file_bug_in_select_variants
Fixed bug reported on the forum where using the --exclude_sample_file ar...
2013-04-26 09:47:38 -07:00
Eric Banks 360e2ba87e Fixed bug reported on the forum where using the --exclude_sample_file argument in SV was giving bad results.
Added integration test.
https://www.pivotaltracker.com/s/projects/793457/stories/47399245
2013-04-26 12:23:11 -04:00
Eric Banks 021adf4220 WTF - I thought we had disabled the randomized dithering of rank sum tests for integration tests?!
Well, it wasn't done so I went ahead and did so.  Lots of MD5 changes accordingly.
2013-04-26 11:24:05 -04:00
Eric Banks ba2c3b57ed Extended the allele-biased down-sampling functionality to handle reduced reads.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.

Added lots of unit tests for new functionality.
2013-04-26 11:23:17 -04:00
Mark DePristo d20be41fee Bugfix for FragmentUtils.mergeOverlappingPairedFragments
-- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences.  If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference.  Updated the code to first clip the adapter sequences from the incoming fragments
-- Update MD5s
2013-04-25 11:11:15 -04:00
Eric Banks 379a9841ce Various bug fixes for recent Reduce Reads additions plus solution implemented for low MQ reads.
1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly
got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites
use binomial prob and at high coverage sites revert to using the old base proportions.  Then we get the
best of both worlds.  As a note, coverage refers to just the individual base counts and not the entire pileup.

2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had
the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We
now use a PriorityQueue instead of a TreeSet to allow for such cases.

3. Each consensus keeps track of its own number of softclipped bases.  There was no reason that that number
should be shared between them.

4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now.  Don't lose that
information.  Maybe we'll decide to change this in the future, but for now we are conservative.

5. Also implemented various small performance optimizations based on profiling.

Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.
2013-04-24 18:18:50 -04:00
MauricioCarneiro 45fec382e7 Merge pull request #180 from broadinstitute/mc_diagnosetargets_missing_targets
DiagnoseTargets Global Refactor
2013-04-24 14:54:55 -07:00
Mauricio Carneiro 367f0c0ac1 Split class names into stratification and metrics
Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric.

 Metrics are generalized by the Metric interface. (for generic access)
 Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)
2013-04-24 14:15:49 -04:00
Mauricio Carneiro 38662f1d47 Limiting access to the DT classes
* Make most classes final, others package local
    * Move to diagnostics.diagnosetargets package
    * Aggregate statistics and walker classes on the same package for simplified visibility.
    * Make status list a LinkedList instead of a HashSet
2013-04-23 14:01:43 -04:00
Ryan Poplin cb4ec3437a After debate reverting SW parameter changes temporarily while we explore global SW plans. 2013-04-23 13:32:06 -04:00
Mauricio Carneiro fdd16dc6f9 DiagnoseTargets refactor
A plugin enabled implementation of DiagnoseTargets

Summarized Changes:
-------------------
   * move argument collection into Thresholder object
   * make thresholder object private member of all statistics classes
   * rework the logic of the mate pairing thresholds
   * update unit and integration tests to reflect the new behavior
   * Implements Locus Statistic plugins
   * Extend Locus Statistic plugins to determine sample status
   * Export all common plugin functionality into utility class
   * Update tests accordingly

[fixes #48465557]
2013-04-22 23:53:10 -04:00
Mauricio Carneiro eb6308a0e4 General DiagnoseTargets documentation cleanup
* remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps.
   * add gatkdocs to all the parameters
   * clean up the logic on callable status a bit (still need to be re-worked into a plugin system)
   * update integration tests
2013-04-22 23:53:09 -04:00
Mauricio Carneiro b3c0abd9e8 Remove REF_N status from DiagnoseTargets
This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply.

[fixes #48442663]
2013-04-22 23:53:09 -04:00
Mauricio Carneiro 2b923f1568 fix for DiagnoseTargets multiple filter output
Problem
-------
Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval

Solution
--------
Only allow low median coverage check if there are reads

[fixes #48442675]
2013-04-22 23:53:09 -04:00
Mauricio Carneiro cf7afc1ad4 Fixed "skipped intervals" bug on DiagnoseTargets
Problem
-------
Diagnose targets was skipping intervals when they were not covered by any reads.

Solution
--------
Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file.

Summarized Changes
------------------
   * Outputs all intervals it iterates over, even if uncovered
   * Outputs leftover intervals in the end of the traversal
   * Updated integration tests

[fixes #47813825]
2013-04-22 23:53:09 -04:00
Mark DePristo be66049a6f Bugfix for CommonSuffixSplitter
-- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases.  In this case we'd eliminate the reference source vertex.  Fixed by checking for this condition and aborting the simplification
-- Update MD5s, including minor improvements
2013-04-21 19:37:01 -04:00
Mark DePristo f0e64850da Two sensitivity / specificity improvements to the haplotype caller
-- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC.  Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data
-- Reduce the min usable base qual to 8 by default in the HC.  In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL.  This is a bit too aggressive of a requirement, so I lowered it to 8.
-- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller

 NA12878 MEM chr20:10-11
 Name    VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
 branch  SNPS                  1216               0               2            194                        0
 branch  INDELS                 312               2              13             71                        7
 master  SNPS                  1214               0               4            194                        1
 master  INDELS                 309               2              16             71                       10

-- Update MD5s in the integration tests to reflect these two new changes
2013-04-17 12:32:31 -04:00
Eric Banks 5bce0e086e Refactored binomial probability code in MathUtils.
* Moved redundant code out of UGEngine
  * Added overloaded methods that assume p=0.5 for speed efficiency
  * Added unit test for the binomialCumulativeProbability method
2013-04-16 18:19:07 -04:00
Eric Banks df189293ce Improve compression in Reduce Reads by incorporating probabilistic model and global het compression
The Problem:
  Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
  regions in an exome) causes RR (with default settings) to consider it a variant region.  This
  seriously hurts compression performance.

The Solution:
  1. We now use a probabilistic model for determining whether we can create a consensus (in other
  words, whether we can error correct a site) instead of the old ratio threshold.  We calculate
  the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
  that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
  2. We also allow het compression globally, not just at known sites.  So if we cannot create a
  consensus at a given site then we try to perform het compression; and if we cannot perform het
  compression that we just don't reduce the variant region.  This way very wonky regions stay
  uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
  locus get het compressed.

Details:
  1. -minvar is now deprecated in favor of -min_pvalue.
  2. Added integration test for bad pvalue input.
  3. -known argument still works to force het compression only at known sites; if it's not included
     then we allow het compression anywhere.  Added unit tests for this.
  4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
     Before finalizing het compression, we now check for insertions or other variant regions (usually due
     to multi-allelics) which can render a region incompressible (and we back out if we find one).  We
     were checking for excessive softclips before, but now we add these tests too.
  5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
     consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
     out instead of backing out all of them.
  6. We no longer create a mini read at the stop of the variant window for het compression.  Instead, we
     allow it to be part of the next global consensus.
  7. The coverage test is no longer run systematically on all integration tests because the quals test
     supercedes it.  The systematic quals test is now much stricter in order to catch bugs and edge cases
     (very useful!).
  8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
     for a consensus was affected by good and bad bases/reads).
  9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
     This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
  10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
     with insertions from a header.

Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
2013-04-16 18:19:06 -04:00
Ryan Poplin e0dfe5ca14 Restore the read filter function in the HaplotypeCaller. 2013-04-16 12:01:30 -04:00
Geraldine Van der Auwera e176fc3af1 Merge pull request #159 from broadinstitute/md_bqsr_ion
Trivial BQSR bug fixes and improvement
2013-04-16 08:54:47 -07:00
Ryan Poplin 936f4da1f6 Merge pull request #166 from broadinstitute/md_hc_persample_haplotypes
Select the haplotypes we move forward for genotyping per sample, not poo...
2013-04-16 08:46:56 -07:00
Mark DePristo 17982bcbf8 Update MD5s for VQSR header change 2013-04-16 11:45:45 -04:00
Mark DePristo 067d24957b Select the haplotypes we move forward for genotyping per sample, not pooled
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples.  This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes.  The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects.  After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model.  It's worse than the current version in both single and multiple samples:

1000G EUR samples:

10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                    50               0               5              8                        1
per_sample  INDELS                   6               0               7              2                        1
pooled      SNPS                    49               0               6              8                        1
pooled      INDELS                   5               0               8              2                        1

100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes

Name        VariantType  TRUE_POSITIVE  FALSE_POSITIVE  FALSE_NEGATIVE  TRUE_NEGATIVE  CALLED_NOT_IN_DB_AT_ALL
per_sample  SNPS                   144               0              22             28                        1
per_sample  INDELS                  28               1              16              9                       11
pooled      SNPS                   143               0              23             28                        1
pooled      INDELS                  27               1              17              9                       11

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug

haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
2013-04-16 09:42:03 -04:00
Mark DePristo 564fe36d22 VariantRecalibrator's VQSR.vcf now contains NEG/POS labels
-- It's useful to know which sites have been used in the training of the model.  The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE
-- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels
2013-04-16 09:09:47 -04:00
Mark DePristo 3144eae51c UnifiedGenotyper bugfix: don't create haplotypes with 0 bases
-- The PairHMM no longer allows us to create haplotypes with 0 bases.  The UG indel caller used to create such haplotypes.  Now we assign -Double.MAX_VALUE likelihoods to such haplotypes.
-- Add integration test to cover this case, along with private/testdata BAM
-- [Fixes #47523579]
2013-04-13 14:57:55 -04:00
Mauricio Carneiro f11c8d22d4 Updating java 7 md5's to java 6 md5's 2013-04-13 08:21:48 -04:00
Mark DePristo b32457be8d Merge pull request #163 from broadinstitute/mc_hmm_caching_again
Fix another caching issue with the PairHMM
2013-04-12 12:34:49 -07:00
Mauricio Carneiro 403f9de122 Fix another caching issue with the PairHMM
The Problem
----------
Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result.

Solution
--------
The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes.

Summarized Changes
------------------
   * Added the recacheReadValue check to flush the matrix (hapStartIndex = 0)
   * Updated related MD5's

Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9
2013-04-12 14:52:45 -04:00
Mark DePristo 0e627bce93 Slight update to Path SW parameters.
-- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.
2013-04-12 12:43:52 -04:00
Mark DePristo 50cdffc61f Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:

ref:               ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble.  The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.

ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls.  I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.

-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all

General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly.  Depreciated old call to inline constants.  This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default.  Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
2013-04-11 18:22:55 -04:00
Mark DePristo 74196ff7db Trivial BQSR bug fixes and improvement
-- Ensure that BQSR works properly for an Ion Torrent BAM.  (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
2013-04-11 17:08:35 -04:00
Ryan Poplin a507381a33 Updating BQSR RecalibrationEngine to work correctly with empty BQSR tables.
-- Previously would crash when a scatter/gather interval contained no usable data.
-- Added unit test to cover this case.
2013-04-11 16:27:59 -04:00
Mark DePristo fb86887bf2 Fast algorithm for determining which kmers are good in a read
-- old algorithm was O(kmerSize * readLen) for each read.  New algorithm is O(readLen)
-- Added real unit tests for the addKmersFromReads to the graph.  Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.
2013-04-11 09:54:22 -04:00
Mark DePristo bf42be44fc Fast DeBruijnGraph creation using the kmer counter
-- The previous creation algorithm used the following algorithm:

for each kmer1 -> kmer2 in each read
  add kmers 1 and 2 to the graph
  add edge kmer1 -> kmer2 in the graph, if it's not present (does check)
  update edge count by 1 if kmer1 -> kmer2 already existed in the graph

-- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)).  This is actually pretty expensive because get and add edges is expensive in jgrapht.
-- The new approach uses the following algorithm:

for each kmer1 -> kmer2 in each read
  add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap

for each kmer pair 1 and 2 in the hash counter
  add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map
  update edge count by count from map if kmer1 -> kmer2 already existed in the graph

-- This algorithm ensures that we add very much fewer edges
-- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself.
-- Overall runtimes are greatly reduced using this algorith
2013-04-10 17:10:59 -04:00
Ryan Poplin 850be5e9da Bug fix in SWPairwiseAlignment.
-- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.
2013-04-10 16:04:37 -04:00
Mark DePristo b115e5c582 Critical bugfix for CommonSuffixSplitter to avoid infinite loops
-- The previous version would enter into an infinite loop in the case where we have a graph that looks like:

X -> A -> B
Y -> A -> B

So that the incoming vertices of B all have the same sequence.  This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph.  Fixed and unit tested

-- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging.  So the GATK will always error out, rather than just go on forever
-- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph.  Equals here means that all vertices have equivalents in both graphs, as do all edges.  If the two graphs are equal we stop simplifying.  It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph.
-- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly
-- #resolves GSA-924
-- See https://jira.broadinstitute.org/browse/GSA-924 for more details
-- Update MD5s due to change in assembly graph construction
2013-04-09 16:19:26 -04:00
Mark DePristo 33ecec535d Turn off the LD merging code by default
-- It's just too hard to interpret the called variation when we merge variants via LD.
-- Can now be turned on with -mergeVariantsViaLD
-- Update MD5s
2013-04-09 10:08:06 -04:00
Mark DePristo caf15fb727 Update MD5s to reflect new HC algorithms and parameter values 2013-04-08 12:48:16 -04:00
Mark DePristo 9c7a35f73f HaplotypeCaller no longer creates haplotypes that involve cycles in the SeqGraph
-- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models.  Allowing cycles can be re-enabled with an HC command line switch.
2013-04-08 12:47:50 -04:00
Mark DePristo 5545c629f5 Rename Utils to GraphUtils to avoid conflicts with the sting.Utils class; fix broken unit test in SharedVertexSequenceSplitterUnitTest 2013-04-08 12:47:49 -04:00
Mark DePristo 15461567d7 HaplotypeCaller no longer uses reads with poor likelihoods w.r.t. any haplotype
-- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes.  This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype.  All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).
2013-04-08 12:47:49 -04:00
Mark DePristo af593094a2 Major improvements to HC that trims down active regions before genotyping
-- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension).  Radically speeds up calculations when using large active region extensions.  The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible.  The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes
-- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter.  The previous error corrector was just broken (conceptually) and was disabled by default in the engine.  Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java
-- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler
2013-04-08 12:47:49 -04:00
Mark DePristo 4d389a8234 Optimizations for HC infrastructure
-- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices
-- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final
-- Cache HashCode calculation in BaseVertex
-- Better docs before the pruneGraph call
2013-04-08 12:47:49 -04:00
Mark DePristo e916998784 Bugfix for head and tail merging code in SeqGraph
-- The previous version of the head merging (and tail merging to a lesser degree) would inappropriately merge source and sinks without sufficient evidence to do so.  This would introduce large deletion events at the start / end of the assemblies.  Refcatored code to require 20 bp of overlap in the head or tail nodes, as well as unit tested functions to support this.
2013-04-08 12:47:48 -04:00
Mark DePristo 2aac9e2782 More efficient ZipLinearChains algorithm
-- Goes through the graph looking for chains to zip, accumulates the vertices of the chains, and then finally go through and updates the graph in one big go.  Vastly more efficient than the previous version, but unfortunately doesn't actually work now
-- Also incorporate edge weight propagation into SeqGraph zipLinearChains.  The edge weights for all incoming and outgoing edges are now their previous value, plus the sum of the internal chain edges / n such edges
2013-04-08 12:47:48 -04:00
Mark DePristo f1d772ac25 LD-based merging algorithm for nearby events in the haplotypes
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator.  For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed.  Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well.  Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp.  In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M.  The new code supports this.  UnitTested and documented as well.  LDMerger handles case where merging two alleles results in a no-op event.  Merging CA/C + A/AA -> CAA/CAA -> no op.  Handles this case by removing the two events.  UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right).  Here's the new algorithm:

     * Compute probability that two variants are in phase with each other and that no
     * compound hets exist in the population.
     *
     * Implemented as a likelihood ratio test of the hypothesis:
     *
     * x11 and x22 are the only haplotypes in the populations
     *
     * vs.
     *
     * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
     *
     * Now, since we have to have both variants in the population, we exclude the x11 & x11 state.  So the
     * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
     *
     * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
     *
     * - P(x11 & x12 & x21) -- we have hom-ref and both hets
     * - P(x22 & x12 & x21) -- we have hom-alt and both hets
     * - P(x22 & x12) -- one haplotype is 22 and the other is het 12
     * - P(x22 & x21) -- one haplotype is 22 and the other is het 21
2013-04-08 12:47:48 -04:00
Mark DePristo 0310499b65 System to merge multiple nearby alleles into block substitutions
-- Block substitution algorithm that merges nearby events based on distance.
-- Also does some cleanup of GenotypingEngine
2013-04-08 12:47:47 -04:00
Mark DePristo bff13bb5c5 Move Haplotype class to its own package in utils 2013-04-08 12:47:47 -04:00
Mauricio Carneiro ebe2edbef3 Fix caching indices in the PairHMM
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)

Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.

Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)

[fixes #47399227]
2013-04-08 11:05:12 -04:00
Eric Banks 6253ba164e Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs
* This occurred when one or more alleles were lost from the record after selection
  * Discussed here: http://gatkforums.broadinstitute.org/discussion/comment/4718#Comment_4718
  * Added some integration tests for --keepOriginalAC (there were none before)
2013-04-05 00:53:28 -04:00
Eric Banks 7897d52f32 Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec).
* As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270
  * This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF
  * Added integration test to confirm failure with User Error
  * Removed illegal header line in KB test VCF that was causing related tests to fail.
2013-04-05 00:52:32 -04:00
Ryan Poplin 8a93bb687b Critical bug fix for the case of duplicate map calls in ActiveRegionWalkers with exome interval lists.
-- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create
duplicate active regions.
-- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside
of the current interval.
-- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.
2013-04-03 13:15:30 -04:00
Mark DePristo bb42c90f2b Use LinkedHashSets in incoming and outgoing vertex functions in BaseGraph
-- Using a LinkedHashSet changed the md5 for HCTestComplexVariants.
2013-04-02 17:58:20 -04:00
Mark DePristo c191d7de8c Critical bugfix for CommonSuffixSplitter
-- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm.  Created unit test that reproduced the issue, and then fixed the underlying issue.
2013-04-02 09:22:33 -04:00
Ryan Poplin a58a3e7e1e Merge pull request #134 from broadinstitute/mc_phmm_experiments
PairHMM rework
2013-04-01 12:10:43 -07:00
Eric Banks 7dd58f671f Merge pull request #132 from broadinstitute/gda_filter_unmasked_sites
Added small feature to VariantFiltration to filter sites outside of a gi...
2013-03-31 06:27:26 -07:00
Guillermo del Angel 9686e91a51 Added small feature to VariantFiltration to filter sites outside of a given mask:
-- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask.
-- Added integration test, and made sure we also test with a BED rod.
2013-03-31 08:48:16 -04:00
Eric Banks 8e2094d2af Updated AssessReducedQuals and applied it systematically to all ReduceReads integration tests.
* Moved to protected for packaging purposes.
  * Cleaned up and removed debugging output.
  * Fixed logic for epsilons so that we really only test significant differences between BAMs.
  * Other small fixes (e.g. don't include low quality reduced reads in overall qual).
  * Most RR integration tests now automatically run the quals test on output.
    * A few are disabled because we expect them to fail in various locations (e.g. due to downsampling).
2013-03-31 00:27:14 -04:00
Guillermo del Angel 6b8bed34d0 Big bad bug fix: feature added to LeftAlignAndTrimVariants to left align multiallelic records didn't work.
-- Corrected logic to pick biallelic vc to left align.
-- Added integration test to make sure this feature is tested and feature to trim bases is also tested.
2013-03-30 19:31:28 -04:00
Mauricio Carneiro 0de6f55660 PairHMM rework
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
   # Initial conditions were not being set properly
   # Emission probabilities in the last row were not adding up to 1

The following commit fixes both by
   # averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
   # discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)

Summarized changes:
   * Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
   * Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
   * Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
   * Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
   * Rename metric lengths to read and haplotype lengths for clarity
   * Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
   * Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
   * Remove unnecessary parameters from updateCell()
   * Fix the expected probabilities coming from the exact model in PairHMMUnitTest
   * Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
   * Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.

[fix 47164949]
2013-03-30 10:50:06 -04:00
Chris Hartl 74a17359a8 MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, through the other methods
that are tested), resulting in slightly different numbers of calls to the RNG, and ultimately
different sets of selected variants.

This commits updates the md5 values for the validation site selector integration test to reflect
these new random subsets of variants that are selected.
2013-03-29 14:52:10 -04:00
Guillermo del Angel 8fbf9c947f Upgrades and changes to LeftAlignVariants, motivated by 1000G consensus indel production:
-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
2013-03-29 10:02:06 -04:00
Chris Hartl 73d1c319bf Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils
Currently, the multi-allelic test is covering the following case:

Eval   A   T,C
Comp   A   C

reciprocate this so that the reverse can be covered.

Eval   A   C
Comp   A   T,C

And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.

This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:

Eval:   A  G,T      0/0   2/0   2/2   1/1
Comp:   A  C,T      0/0   1/0   0/0   0/0

Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:

Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)

Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.

Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
 - dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
 - vectorSum
 - array shuffle, random subset
 - countOccurances (general forms, the char form is used in the codebase)
 - getNMaxElements
 - array permutation
 - sorted array permutation
 - compare floats
 - sum() (for integer arrays and lists).

Final keyword was extensively added to MathUtils.

The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).

The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.

In addition, more extensive tests were added for
 - logBinomialCoefficient (Newton's identity should always hold)
 - logFactorial
 - log10sumlog10 and its approximation

All unit tests pass
2013-03-28 23:25:28 -04:00
MauricioCarneiro a2b69790a6 Merge pull request #128 from broadinstitute/eb_rr_polyploid_compression_GSA-639 2013-03-28 06:39:43 -07:00
Mark DePristo fde7d36926 Updating md5s due to changes in assembly graph creation algorithms and default parameter 2013-03-27 15:31:24 -04:00
Mark DePristo 66910b036c Added new and improved suffix and node merging algorithms
-- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges.  Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger.
-- Refactored haplotype caller infrastructure into graphs package, code cleanup
-- Cleanup new merging / splitting algorithms, with proper docs and unit tests
-- Fix bug in zipping of linear chains.  Because the multiplicity can be 0, protect ourselves with a max function call
-- Fix BaseEdge.max unit test
-- Add docs and some more unit tests
-- Move error correct from DeBruijnGraph to DeBruijnAssembler
-- Replaced uses of System.out.println with logger.info
-- Don't make multiplicity == 0 nodes look like they should be pruned
-- Fix toString of Path
2013-03-27 15:31:18 -04:00
Mark DePristo 78c672676b Bugfix for pruning and removing non-reference edges in graph
-- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base).  This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph.  Probably we should think hard about this.  This commit fixes this logic, so we zip the graph between pruning
-- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink.  In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink.  The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph.  Other routines that use the same traversal idiom refactored to use this iterator.  Added unit tests for all of these capabilities.
-- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate
2013-03-26 14:27:18 -04:00
Mark DePristo ad04fdb233 PerReadAlleleLikelihoodMap getMostLikelyAllele returns an MostLikelyAllele objects now
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users.  The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not.  That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves.  There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second.  For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads.   All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event.  In this case our annotations will all fall apart, returning their default values.  Added a JIRA to address this (should be discussed in group meeting)
2013-03-26 14:27:13 -04:00
Mark DePristo 2472828e1c HC bug fixes: no longer create reference graphs with cycles
-- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer.  The previous test would fail to catch this case.  Now its not possible
-- Lots of code cleanup and refactoring in this push.  Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph.
-- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count.
-- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear
2013-03-26 10:12:24 -04:00
Eric Banks 593d3469d4 Refactored the het (polyploid) consensus creation in ReduceReads.
* It is now cleaner and easier to test; added tests for newly implemented methods.
 * Many fixes to the logic to make it work
   * The most important change was that after triggering het compression we actually need to back it out if it
      creates reads that incorporated too many softclips at any one position (because they get unclipped).
   * There was also an off-by-one error in the general code that only manifested itself with het compression.
 * Removed support for creating a het consensus around deletions (which was broken anyways).
   * Mauricio gave his blessing for this.
 * Het compression now works only against known sites (with -known argument).
    * The user can pass in one or more VCFs with known SNPs (other variants are ignored).
    * If no known SNPs are provided het compression will automatically be disabled.
 * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
   strandedness from normal reduced reads.
    * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
    * This allows us to update the FisherStrand annotation to count het compressed reduced reads
       towards the FS calculation.
    * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
       backwards compatible.]
    * Updated integration tests accordingly with new het compressed bams (both for RR and UG).
 * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
   RR properly, so I fixed it too.
    * Also, the test in the UG engine for determining whether there are too many overlapping
       deletions is updated to handle RR.
 * I added a special hook in the RR integration tests to additionally run the systematic
   coverage checking tool I wrote earlier.
    * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
       not lost from original to reduced bam.
    * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
       from all but 1 sample (now fixed).
    * AssessReducedCoverage moved from private to protected for packaging reasons.
 * #resolve GSA-639

At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
2013-03-25 09:34:54 -04:00
Mark DePristo 965043472a Vastly more powerful, cleaner graph simplification approach
-- Generalizes previous node merging and splitting approaches.  Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph.  Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons).  Fully UnitTested.  These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node.
-- This new suite of algorithms is fully integrated into the HC, replacing previous approaches
-- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph.  This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.
2013-03-23 17:40:55 -04:00
Mark DePristo 7ae15dadbe HC now by default only uses reads with MAPQ >= 20 for assembly and calling
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements.  The new version simply uses a read filter to remove them upfront.  This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
2013-03-21 13:10:50 -04:00
Mark DePristo aa7f172b18 Cap the computational cost of the kmer based error correction in the DeBruijnGraph
-- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph.  If the number of ops would exceed this threshold, the original graph is used.
-- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction.  So we live with this limitations while we continue to explore better algorithms
-- Updating MD5s to reflect changes in assembly algorithms
2013-03-21 09:21:35 -04:00
Mark DePristo 6d7d21ca47 Bugfix for incorrect branch diamond merging algorithm
-- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices.  Fixed.  In the process simplified the entire algorithm and codebase
-- Resolves https://jira.broadinstitute.org/browse/GSA-884
2013-03-20 22:54:37 -04:00
Mark DePristo 3a8f001c27 Misc. fixes upon pull request review
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
2013-03-20 22:54:37 -04:00
Mark DePristo 5226b24a11 HaplotypeCaller instructure cleanup and unit testing
-- UnitTest for isRootOfDiamond along with key bugfix detected while testing
-- Fix up the equals methods in BaseEdge.  Now called hasSameSourceAndTarget and seqEquals.  A much more meaningful naming
-- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs
-- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence
-- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself
-- Massive expansion of the SeqGraph unit tests.  We now really test out the zipping and branch merging code.
-- Near final cleanup of the current codebase
-- DeBruijnVertex cleanup and optimizations.  Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[].  Optimize the code to make use of this constraint
2013-03-20 22:54:37 -04:00
Mark DePristo 2e36f15861 Update md5s to reflect new downsampling and assembly algorithm output
-- Only minor differences, with improvement in allele discovery where the sites differ.  The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome
2013-03-20 22:54:37 -04:00
Mark DePristo 1fa5050faf Cleanup, unit test, and optimize KBestPaths and Path
-- Split Path from inner class of KBestPaths
-- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation
-- Path now properly typed throughout the code
-- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast
2013-03-20 22:54:36 -04:00
Mark DePristo 98c4cd060d HaplotypeCaller now uses SeqGraph instead of kmer graph to build haplotypes.
-- DeBruijnAssembler functions are no longer static.  This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents.  These DeBruijn versions now inherit from these base classes.  Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system.  All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest.  Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
2013-03-20 22:54:36 -04:00
Mark DePristo 0f4328f6fe Basic kmer error correction algorithm xfor the HaplotypeCaller
-- Error correction algorithm for the assembler.  Only error correct reads to others that are exactly 1 mismatch away
-- The assembler logic is now: build initial graph, error correct*, merge nodes*, prune dead nodes, merge again, make haplotypes.  The * elements are new
-- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing.
-- Easier way to control the testing of the graph assembly algorithms
-- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler
-- Simple protected parsing function for making DeBruijnAssemblyGraph
-- Change the default prune factor for the graph to 1, from 2
-- debugging graph transformations are controllable from command line
2013-03-20 22:54:36 -04:00
Eric Banks 1fae750ebe Merge pull request #120 from broadinstitute/aw_reduce_reads_clear_name_cache
Clear ReduceReads name cache after each set of reads produced by ReduceR...
2013-03-20 19:47:42 -07:00
Guillermo del Angel ea01dbf130 Fix to issue encountered when running HaplotypeCaller in GGA mode with data from other 1000G callers.
In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up.
Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site.
-- Added unit test to UG engine to cover this case.
-- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes).
GSA-878 #resolve
2013-03-20 14:30:37 -04:00
Alec Wysoker bccc9d79e5 Clear ReduceReads name cache after each set of reads produced by ReduceReadsStash.
Name cache was filling up with names of all reads in entire file, which for large file eventually
consumes all of memory.  Only keep read name cache for the reads that are together in one variant
region, so that a pair of reads within the same variant region will still be joined via read name.
Otherwise the ability to connect a read to its mate is lost.

Update MD5s in integration test to reflect altered output.
Add new integration test that confirms that pair within variant region is joined by read name.
2013-03-19 14:12:33 -04:00
Ryan Poplin 0cf5d30dac Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes. 2013-03-15 14:20:25 -04:00
Ryan Poplin b8991f5e98 Fix for edge case bug of trying to create insertions/deletions on the edge of contigs.
-- Added integration test using MT that previously failed
2013-03-15 12:32:13 -04:00
Mark DePristo 2d35065238 QualityByDepth remaps QD values > 40 to a gaussian around 30
-- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads.  In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value.  This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR.
-- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm
-- Update md5s
2013-03-14 16:09:41 -04:00
Mark DePristo b5b63eaac7 New GATKSAMRecord concept of a strandless read, update to FS
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads.  Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands.  This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called.  Added new GATKSAMRecord method setReducedCounts() that does the right thing.  Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation.  Differences are just minor updates to the FS
2013-03-13 11:16:36 -04:00
MauricioCarneiro 4403e3572a Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111 2013-03-12 13:02:35 -07:00
MauricioCarneiro 3a16ba04d4 Merge pull request #97 from broadinstitute/eb_refactor_sliding_window
Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug
2013-03-12 12:27:26 -07:00
Geraldine Van der Auwera f972963918 Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
2013-03-12 10:57:14 -04:00
Eric Banks 05e69b6294 Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug.
* Allow RR to write its BAM to stdout by setting required=true for @Output.
  * Fixed bug in sliding window where a break in coverage after a long stretch without
     a variant region was causing a doubling of all the reads before the break.
  * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods.
  * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().
2013-03-12 09:06:55 -04:00
Ryan Poplin c96fbcb995 Use the indel heterozygosity prior when calling indels with the HC 2013-03-11 14:12:43 -04:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
David Roazen c5c99c8339 Split long-running integration test classes into multiple classes
This is to facilitate the current experiment with class-level test
suite parallelism. It's our hope that with these changes, we can get
the runtime of the integration test suite down to 20 minutes or so.

-UnifiedGenotyper tests: these divided nicely into logical categories
 that also happened to distribute the runtime fairly evenly

-UnifiedGenotyperPloidy: these had to be divided arbitrarily into two
 classes in order to halve the runtime

-HaplotypeCaller: turns out that the tests for complex and symbolic
 variants make up half the runtime here, so merely moving these into
 a separate class was sufficient

-BiasedDownsampling: most of these tests use excessively large intervals
 that likely can't be reduced without defeating the goals of the tests. I'm
 disabling these tests for now until they can either be redesigned to use smaller
 intervals around the variants of interest, or refactored into unit tests
 (creating a JIRA for Yossi for this task)
2013-03-01 13:55:23 -05:00
Eric Banks 69b8173535 Replace uses of NestedHashMap with NestedIntegerArray.
* Removed from codebase NestedHashMap since it is unused and untested.
 * Integration tests change because the BQSR CSV is now sorted automatically.
 * Resolves GSA-732
2013-02-27 14:03:39 -05:00
David Roazen 6466463d5a Merged bug fix from Stable into Unstable 2013-02-26 21:54:54 -05:00
David Roazen 12a3d7ecad Fix licenses on files modified in 2.4-1 2013-02-26 21:53:17 -05:00
David Roazen a53b4a7521 Merged bug fix from Stable into Unstable 2013-02-26 21:41:13 -05:00
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
depristo 93205154b5 Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776
Eb fix pairhmm unittest gsa 776
2013-02-26 11:56:58 -08:00
Eric Banks 734353e9df Merge pull request #60 from broadinstitute/mc_fastutil_GSATDG-83
Brought all of ReduceReads to fastutils
2013-02-26 11:56:41 -08:00
David Roazen 8b29030467 Change default downsampling coverage target for the HaplotypeCaller to 250
-was previously set to 30, which seems far too aggressive given that with
 ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any
 pileup returned by LIBS

-250 is a more conservative default used by the UG

-can adjust down/up later based on further experiments (GSA-699 will
 remain open)

-verified with Ryan that all integration test differences are either
 innocent or represent an improvement

GSA-699
2013-02-26 09:33:25 -05:00
Eric Banks 396b7e0933 Fixed the intermittent PairHMM unit test failure.
The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the
BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of
that class; so if the optimized tests were run first we were unintentionally running those same tests
again with the basic ones (but expecting different results).
2013-02-25 15:05:13 -05:00
Eric Banks 7519484a38 Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods. 2013-02-25 15:04:23 -05:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
Mauricio Carneiro 9e5a31b595 Brought all of ReduceReads to fastutils
-- Added unit tests to ReduceReads name compression
-- Updated reduce reads walker for unit testing

GSATDG-83
2013-02-23 22:53:23 -05:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
Eric Banks 0055a6f1cd Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774
Fix to the Indel Realigner bug described in GSA-774
2013-02-19 16:16:48 -08:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Ryan Poplin c025e84c8b Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping.
-- Updated a few integration tests for HC, UG, and UG general ploidy
2013-02-19 14:09:24 -05:00
Mark DePristo be45edeff2 ActivityProfile and ActiveRegions respects engine interval boundaries
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet.  Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
2013-02-18 10:40:25 -05:00
Mark DePristo 73a363b166 Update MD5s due to new QualityUtils calculations
-- Increase the allowed runtime of one UG integration test
-- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before.  Some updates can push this right over the edge.  Increased limit
-- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster
-- Updating MD5s due to more correct quality utils.  DuplicatesWalkers quality estimates have changed.  One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different
2013-02-16 07:31:38 -08:00
Mark DePristo 9a29d6d4be Fix an catastrophic bug (WoW!) in the reference calculation of the UG
-- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0.  This is all because there wasn't a unit test.
-- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.
2013-02-16 07:31:38 -08:00
Mark DePristo 9e28d1e347 Cleanup and unit tests for QualityUtils
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value.  Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual.  Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils.  Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
2013-02-16 07:31:37 -08:00
MauricioCarneiro 1dd284a5bb Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720
PrintReads writes a header when used with -BQSR
2013-02-15 07:18:28 -08:00
Tad Jordan 6cb80591e3 PrintReads writes a header when used with -BQSR 2013-02-14 22:19:14 -05:00