Commit Graph

602 Commits (89c4e57478a022d822d2791e8ffd68be55a98f2c)

Author SHA1 Message Date
Valentin Ruano-Rubio 89c4e57478 gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested.
Changes:
-------

  <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles.

  When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake.

  Integration test MD5 have been changed accordingly.

Additional fix:
--------------

  Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick.

  In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.
2014-01-30 11:23:33 -05:00
Valentin Ruano-Rubio 9e7bf75e89 Fix for the PairHMM transition probability miscalculation.
Problem:

matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.

Reports:

https://www.pivotaltracker.com/s/projects/793457/stories/62471780
https://www.pivotaltracker.com/s/projects/793457/stories/61082450

Changes:

The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.

MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.

Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
2014-01-26 16:30:36 -05:00
Eric Banks 0323caefc8 Added some bug fixes to the gVCF merging code after finally getting some real data to play with.
Still under construction, awaiting more test data from Valentin.
2014-01-08 08:34:35 -05:00
Ami Levy-Moonshine 6da53aea09 Write a new tool for spliting reads that have N cigar string.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.

In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns

create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils

create integration test for that class

small change in a comment in FullProcessingPipeline

last commit:

Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)

- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs

- Allow the user to print the split positions (not working proparly currently)
2014-01-01 22:21:36 -05:00
Eric Banks 83e09b1f64 Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine

In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process.  So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants.  I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.

Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
2013-12-31 12:07:56 -05:00
Joel Thibault fd0a02e52e New VCF engine arguments to specify an alternate IndexCreator
- CatVariants updates to use custom VCF indices
- Scala scripts for VCF index testing
2013-12-03 13:31:02 -05:00
Eric Banks 6bee6a1b53 Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles.
Previously, we would strip out the PLs and AD values since they were no longer accurate.  However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.

Now the PLs and AD get correctly selected down.

While I was in there I also refactored some related code in subsetDiploidAlleles().  There were no real changes there - I just
broke it out into smaller chunks as per our best practices.

Added unit tests and updated integration tests.
Addressed reviews.
2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio 0f99778a59 Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test

Adding Graph-based likelihood ratio calculation to HC

  To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test
2013-12-02 19:37:19 -05:00
Chris Hartl 1f777c4898 Introducing the latest-and-greatest in genotyping: CalculatePosteriors.
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.

Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
 while not converged:
  * AC = [external AC] + [sample AC]
  * Prior = DirichletMultinomial[AC]
  * Posteriors = [sample GL + Prior]
  * sample AC = MLEAC(Posteriors)

This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.

Fully unit tested.

Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
2013-11-27 13:00:45 -05:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
David Roazen 42d771f748 Remove org.apache.commons.collections.IteratorUtils dependency from the test suite
-This was a dependency of the test suite, but not the GATK proper,
 which caused problems when running the test suite on the packaged
 GATK jar at release time

-Use GATKVCFUtils.readVCF() instead
2013-08-21 19:44:02 -04:00
Michael McCowan c3a933ce84 Adaptations to accomodate Tribble API changes, comprising mostly of the following.
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
2013-08-19 15:52:47 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
Mark DePristo e3e8631ff5 Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction.  Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20      10026072        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026073        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026074        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,121
20      10026075        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026076        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026077        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026078        .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,217
20      10026079        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,240
20      10026080        .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,268
20      10026081        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,267

We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values.  Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval.  The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures.  Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line.  Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
2013-07-02 15:46:38 -04:00
Mark DePristo 41aba491c0 Critical bugfix for adapter clipping in HaplotypeCaller
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference.  Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
2013-07-02 15:46:36 -04:00
Mark DePristo fdfe4e41d5 Better GATK version and command line output
-- Previous version emitted command lines that look like:

##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."

the new version provides additional information on when the GATK was run and the GATK version in a nicer format:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">

 -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
 ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">

 -- Removed the ProtectedEngineFeaturesIntegrationTest
 -- Actual unit tests for these features!
2013-06-20 11:19:13 -04:00
Mark DePristo dd5674b3b8 Add genotyping accuracy assessment to AssessNA12878
-- Now table looks like:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE              1220
variant  SNPS         FALSE_POSITIVE                0
variant  SNPS         FALSE_NEGATIVE                1
variant  SNPS         TRUE_NEGATIVE               150
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL       0
variant  SNPS         HET_CONCORDANCE          100.00
variant  SNPS         HOMVAR_CONCORDANCE        99.63
variant  INDELS       TRUE_POSITIVE               273
variant  INDELS       FALSE_POSITIVE                0
variant  INDELS       FALSE_NEGATIVE               15
variant  INDELS       TRUE_NEGATIVE                79
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL       2
variant  INDELS       HET_CONCORDANCE           98.67
variant  INDELS       HOMVAR_CONCORDANCE        89.58

-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles.  So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A.  This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB.  Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:

20      10769255        .       A       ATGTG   165.73  .       ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE     GT:AD:DP:GQ:PL  0/1:1,9:10:6:360,0,6

Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function.  Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
2013-06-13 15:05:32 -04:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Michael McCowan 00c06e9e52 Performance improvements:
- Memoized MathUtil's cumulative binomial probability function.
 - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
2013-06-09 11:26:52 -04:00
Mark DePristo e19c24f3ee Bugfix for HaplotypeCaller error: Only one of refStart or refStop must be < 0, not both
-- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts.  From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads).
-- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases.
-- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString)
-- Added unit tests for reverting soft clipped bases that create a read before the contig
-- [delivers #50892431]
2013-06-04 10:33:46 -04:00
Mark DePristo 6555361742 Fix error in merging code in HC
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper.  The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H.  This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
2013-05-31 16:29:29 -04:00
Ryan Poplin 61af37d0d2 Create a new normalDistributionLog10 function that is unit tested for use in the VQSR. 2013-05-30 16:00:08 -04:00
Mark DePristo a1093ad230 Optimization for ActiveRegion.removeAll
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion.  This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element.   New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set.  Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
2013-05-21 16:18:57 -04:00
Mark DePristo 371f3752c1 Subshard timeouts in the GATK
-- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards.
-- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard.  This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider.  So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes.
-- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter.  This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense.  Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout.
-- Added integration tests to ensure that subshard timeouts are working properly
2013-05-15 07:00:39 -04:00
Eric Banks 2f5ef6db44 New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
   * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
     right thing for indels at the edges of the alignments.
     * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
   * Lots of systematic testing added for this implementation.
   * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
2013-05-13 09:36:39 -04:00
David Roazen 639030bd6d Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
-Diff engine output is now included in the actual exception message thrown as a
 result of an MD5 mismatch, which allows it to be conveniently viewed on the
 main page of a build in Bamboo.

Minor Additional Improvements:

-WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass()
 method, and the test class name is now included as a regular part of integration
 test output for each test.

-Fix race condition in MD5DB.ensureMd5DbDirectory()

-integrationtests dir is now cleaned by "ant clean"

GSA-915 #resolve
2013-05-10 19:00:33 -04:00
Mark DePristo fa8a47ceef Replace DeBruijnAssembler with ReadThreadingAssembler
Problem
-------
The DeBruijn assembler was too slow.  The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp).  This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.

Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers.  The algorithm operates as follows:

identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
  find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
  for each base in the sequence from the starting vertex kmer:
    look at the existing outgoing nodes of current vertex V.  If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
    If no matching next vertex exists, find a unique vertex with kmer K.  If one exists, merge the sequence into this vertex, and continue
    If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue

This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size.  This allows us to assemble well with even very short kmers.

This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:

-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples.  This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor.  This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization.  This change is essential for us to get good variants from our paths when the kmer size is small.  It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation.  Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary.  This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler

Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations.  This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate.  Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
2013-05-08 21:41:42 -04:00
Mark DePristo f42bb86bdd e# This is a combination of 2 commits.
Only try to clip adaptors when both reads of the pair are on opposite strands

-- Read pairs that have unusual alignments, such as two reads both oriented like:

  <-----
     <-----

where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs.  This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
2013-05-03 11:19:14 -04:00
Mark DePristo f5a301fb63 Bugfix for AlignmentUtils.trimCigarByBases
-- Previous version would trim down 2M2D2M into 2M if you asked for the first 2 bases, but this can result in incorrect alignment of the bases to the reference as the bases no longer span the full reference interval expected.  Fixed and added unit tests
2013-05-03 09:32:05 -04:00
David Roazen f3c94a3c87 Update expected test output for Java 7
-Changes in Java 7 related to comparators / sorting produce a large number
 of innocuous differences in our test output. Updating expectations now
 that we've moved to using Java 7 internally.

-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
 intermittent failures.
2013-05-01 16:18:01 -04:00
Eric Banks 58424e56be Setting the reduce reads count tag was all wrong in a previous commit; fixing.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly.  Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.

The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly.  Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).

Also:
1. counts are now maintained as ints whenever possible.  Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
2013-04-30 13:45:42 -04:00
Mark DePristo 0387ea8df9 Bugfix for ReadClipper with ReducedReads
-- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts.  Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly.  Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read
-- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously
2013-04-29 11:12:09 -04:00
Eric Banks ba2c3b57ed Extended the allele-biased down-sampling functionality to handle reduced reads.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.

Added lots of unit tests for new functionality.
2013-04-26 11:23:17 -04:00
Mark DePristo 528c3d083a Merge pull request #191 from broadinstitute/dr_fix_rod_system_locking
Detect stuck lock-acquisition calls, and disable file locking for tests
2013-04-25 09:32:54 -07:00
Mark DePristo d20be41fee Bugfix for FragmentUtils.mergeOverlappingPairedFragments
-- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences.  If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference.  Updated the code to first clip the adapter sequences from the incoming fragments
-- Update MD5s
2013-04-25 11:11:15 -04:00
David Roazen 4d56142163 Detect stuck lock-acquisition calls, and disable file locking for tests
-Acquire file locks in a background thread with a timeout of 30 seconds,
 and throw a UserException if a lock acquisition call times out

    * should solve the locking issue for most people provided they
      RETRY failed farm jobs

    * since we use NON-BLOCKING lock acquisition calls, any call that
      takes longer than a second or two indicates a problem with the
      underlying OS file lock support

    * use daemon threads so that stuck lock acquisition tasks don't
      prevent the JVM from exiting

-Disable both auto-index creation and file locking for integration tests
 via a hidden GATK argument --disable_auto_index_creation_and_locking_when_reading_rods

    * argument not safe for general use, since it allows reading from
      an index file without first acquiring a lock

    * this is fine for the test suite, since all index files already
      exist for test files (or if they don't, they should!)

-Added missing indices for files in private/testdata

-Had to delete most of RMDTrackBuilderUnitTest, since it mostly tested auto-index
 creation, which we can't test with locking disabled, but I replaced the deleted
 tests with some tests of my own.

-Unit test for FSLockWithShared to test the timeout feature
2013-04-24 22:49:02 -04:00
Eric Banks 5bce0e086e Refactored binomial probability code in MathUtils.
* Moved redundant code out of UGEngine
  * Added overloaded methods that assume p=0.5 for speed efficiency
  * Added unit test for the binomialCumulativeProbability method
2013-04-16 18:19:07 -04:00
Mark DePristo 50cdffc61f Slightly improved Smith-Waterman parameter values for HaplotypeCaller Path comparisons
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:

ref:               ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble.  The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.

ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA

The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls.  I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.

-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all

General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly.  Depreciated old call to inline constants.  This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default.  Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
2013-04-11 18:22:55 -04:00
Ryan Poplin 850be5e9da Bug fix in SWPairwiseAlignment.
-- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.
2013-04-10 16:04:37 -04:00
Mark DePristo 6d22485a4c Critical bugfix to ReduceRead functionality of the GATKSAMRecord
-- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5].  The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding.  Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i.  Added unit test to ensure this behavior going forward.
-- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.
2013-04-08 12:47:50 -04:00
Mark DePristo af593094a2 Major improvements to HC that trims down active regions before genotyping
-- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension).  Radically speeds up calculations when using large active region extensions.  The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible.  The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes
-- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter.  The previous error corrector was just broken (conceptually) and was disabled by default in the engine.  Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java
-- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler
2013-04-08 12:47:49 -04:00
Mark DePristo f1d772ac25 LD-based merging algorithm for nearby events in the haplotypes
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator.  For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed.  Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well.  Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp.  In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M.  The new code supports this.  UnitTested and documented as well.  LDMerger handles case where merging two alleles results in a no-op event.  Merging CA/C + A/AA -> CAA/CAA -> no op.  Handles this case by removing the two events.  UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right).  Here's the new algorithm:

     * Compute probability that two variants are in phase with each other and that no
     * compound hets exist in the population.
     *
     * Implemented as a likelihood ratio test of the hypothesis:
     *
     * x11 and x22 are the only haplotypes in the populations
     *
     * vs.
     *
     * all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
     *
     * Now, since we have to have both variants in the population, we exclude the x11 & x11 state.  So the
     * p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
     *
     * Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
     *
     * - P(x11 & x12 & x21) -- we have hom-ref and both hets
     * - P(x22 & x12 & x21) -- we have hom-alt and both hets
     * - P(x22 & x12) -- one haplotype is 22 and the other is het 12
     * - P(x22 & x21) -- one haplotype is 22 and the other is het 21
2013-04-08 12:47:48 -04:00
Mark DePristo 0310499b65 System to merge multiple nearby alleles into block substitutions
-- Block substitution algorithm that merges nearby events based on distance.
-- Also does some cleanup of GenotypingEngine
2013-04-08 12:47:47 -04:00
Mark DePristo bff13bb5c5 Move Haplotype class to its own package in utils 2013-04-08 12:47:47 -04:00
Mark DePristo b7d59ea13b LIBS unit test debugging should be false 2013-04-08 12:47:47 -04:00
David Roazen 5baf906c28 Intervals: fix bug where we could fail to find the intersection of unsorted/missorted interval lists
-The algorithm for finding the intersection of two sets of intervals
 relies on the sortedness of the intervals within each set, but the engine
 was not sorting the intervals before attempting to find the intersection.

-The result was that if one or both interval lists was unsorted / lexicographically
 sorted, we would often fail to find the intersection correctly.

-Now the IntervalBinding sorts all sets of intervals before returning them,
 solving the problem.

-Added an integration test for this case.

GSA-909 #resolve
2013-04-02 14:01:52 -04:00
Chris Hartl 73d1c319bf Rarely-occurring logic bugfix for GenotypeConcordance, streamlining and testing of MathUtils
Currently, the multi-allelic test is covering the following case:

Eval   A   T,C
Comp   A   C

reciprocate this so that the reverse can be covered.

Eval   A   C
Comp   A   T,C

And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.

This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:

Eval:   A  G,T      0/0   2/0   2/2   1/1
Comp:   A  C,T      0/0   1/0   0/0   0/0

Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:

Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)

Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.

Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
 - dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
 - vectorSum
 - array shuffle, random subset
 - countOccurances (general forms, the char form is used in the codebase)
 - getNMaxElements
 - array permutation
 - sorted array permutation
 - compare floats
 - sum() (for integer arrays and lists).

Final keyword was extensively added to MathUtils.

The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).

The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.

In addition, more extensive tests were added for
 - logBinomialCoefficient (Newton's identity should always hold)
 - logFactorial
 - log10sumlog10 and its approximation

All unit tests pass
2013-03-28 23:25:28 -04:00
Eric Banks 593d3469d4 Refactored the het (polyploid) consensus creation in ReduceReads.
* It is now cleaner and easier to test; added tests for newly implemented methods.
 * Many fixes to the logic to make it work
   * The most important change was that after triggering het compression we actually need to back it out if it
      creates reads that incorporated too many softclips at any one position (because they get unclipped).
   * There was also an off-by-one error in the general code that only manifested itself with het compression.
 * Removed support for creating a het consensus around deletions (which was broken anyways).
   * Mauricio gave his blessing for this.
 * Het compression now works only against known sites (with -known argument).
    * The user can pass in one or more VCFs with known SNPs (other variants are ignored).
    * If no known SNPs are provided het compression will automatically be disabled.
 * Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
   strandedness from normal reduced reads.
    * GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
    * This allows us to update the FisherStrand annotation to count het compressed reduced reads
       towards the FS calculation.
    * [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
       backwards compatible.]
    * Updated integration tests accordingly with new het compressed bams (both for RR and UG).
 * In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
   RR properly, so I fixed it too.
    * Also, the test in the UG engine for determining whether there are too many overlapping
       deletions is updated to handle RR.
 * I added a special hook in the RR integration tests to additionally run the systematic
   coverage checking tool I wrote earlier.
    * AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
       not lost from original to reduced bam.
    * This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
       from all but 1 sample (now fixed).
    * AssessReducedCoverage moved from private to protected for packaging reasons.
 * #resolve GSA-639

At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
2013-03-25 09:34:54 -04:00
Mark DePristo 3a8f001c27 Misc. fixes upon pull request review
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
2013-03-20 22:54:37 -04:00