gatk-3.8/protected/java/test/org/broadinstitute/sting/gatk/walkers
Eric Banks df189293ce Improve compression in Reduce Reads by incorporating probabilistic model and global het compression
The Problem:
  Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
  regions in an exome) causes RR (with default settings) to consider it a variant region.  This
  seriously hurts compression performance.

The Solution:
  1. We now use a probabilistic model for determining whether we can create a consensus (in other
  words, whether we can error correct a site) instead of the old ratio threshold.  We calculate
  the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
  that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
  2. We also allow het compression globally, not just at known sites.  So if we cannot create a
  consensus at a given site then we try to perform het compression; and if we cannot perform het
  compression that we just don't reduce the variant region.  This way very wonky regions stay
  uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
  locus get het compressed.

Details:
  1. -minvar is now deprecated in favor of -min_pvalue.
  2. Added integration test for bad pvalue input.
  3. -known argument still works to force het compression only at known sites; if it's not included
     then we allow het compression anywhere.  Added unit tests for this.
  4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
     Before finalizing het compression, we now check for insertions or other variant regions (usually due
     to multi-allelics) which can render a region incompressible (and we back out if we find one).  We
     were checking for excessive softclips before, but now we add these tests too.
  5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
     consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
     out instead of backing out all of them.
  6. We no longer create a mini read at the stop of the variant window for het compression.  Instead, we
     allow it to be part of the next global consensus.
  7. The coverage test is no longer run systematically on all integration tests because the quals test
     supercedes it.  The systematic quals test is now much stricter in order to catch bugs and edge cases
     (very useful!).
  8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
     for a consensus was affected by good and bad bases/reads).
  9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
     This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
  10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
     with insertions from a header.

Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
2013-04-16 18:19:06 -04:00
..
annotator Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
beagle Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
bqsr Trivial BQSR bug fixes and improvement 2013-04-11 17:08:35 -04:00
compression/reducereads Improve compression in Reduce Reads by incorporating probabilistic model and global het compression 2013-04-16 18:19:06 -04:00
diagnostics walker to calculate per base coverage distribution 2013-02-07 16:33:05 -05:00
diffengine Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs) 2013-03-12 10:57:14 -04:00
fasta Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
filters Don't allow users to specify keys and IDs that contain angle brackets or equals signs (not allowed in VCF spec). 2013-04-05 00:52:32 -04:00
genotyper UnifiedGenotyper bugfix: don't create haplotypes with 0 bases 2013-04-13 14:57:55 -04:00
haplotypecaller Restore the read filter function in the HaplotypeCaller. 2013-04-16 12:01:30 -04:00
indels Fixed IndelRealigner reference length bug (GSA-774) 2013-02-19 16:00:36 -05:00
phasing Updated all JAVA file licenses accordingly 2013-01-10 17:06:41 -05:00
validation MathUtils.randomSubset() now uses Collections.shuffle() (indirectly, through the other methods 2013-03-29 14:52:10 -04:00
varianteval Move some VCF/VariantContext methods back to the GATK based on feedback 2013-01-29 16:56:55 -05:00
variantrecalibration Update MD5s for VQSR header change 2013-04-16 11:45:45 -04:00
variantutils Using --keepOriginalAC in SelectVariants was causing it to emit bad VCFs 2013-04-05 00:53:28 -04:00