Commit Graph

12 Commits (55bab9fa87ab5410d20caf889f9afa85fcb6130c)

Author SHA1 Message Date
Geraldine Van der Auwera 4d20c71e09 Improvements to various gatkdocs
- Make -rod required
    - Document that contaminationFile is currently not functional with HC
    - Document liftover process more clearly
    - Document VariantEval combinations of ST and VE that are incompatible
    - Added a caveat about using MVLR from HC and UG.
    - Added caveat about not using -mte with -nt
    - Clarified masking options
    - Fixed docs based on Erics comments
2013-08-10 10:01:31 -07:00
Mark DePristo 318f7e74e4 Better docs on the meaning of heterozygosity
-- [delivers #53522209]
2013-08-07 07:27:45 -04:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
Mark DePristo e3e8631ff5 Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction.  Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20      10026072        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026073        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026074        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,121
20      10026075        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026076        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026077        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026078        .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,217
20      10026079        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,240
20      10026080        .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,268
20      10026081        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,267

We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values.  Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval.  The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures.  Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line.  Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
2013-07-02 15:46:38 -04:00
Guillermo del Angel 4168aaf280 Add feature to specify Allele frequency priors by command line when calling variants.
Use case:
The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele.
Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence).
Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions.
With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs.
-- Updated ancient DNA single sample calling script with filtering options and other cleanups.
-- Added integration test. Removed old -noPrior syntax.
2013-04-26 19:06:39 -04:00
Eric Banks ba2c3b57ed Extended the allele-biased down-sampling functionality to handle reduced reads.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.

Added lots of unit tests for new functionality.
2013-04-26 11:23:17 -04:00
Ryan Poplin c96fbcb995 Use the indel heterozygosity prior when calling indels with the HC 2013-03-11 14:12:43 -04:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
Yossi Farjoun de03f17be4 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
   _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
2013-02-04 18:24:36 -05:00
Eric Banks f7b80116d6 Don't let users play with the different exact model implementations. 2013-01-25 10:52:02 -05:00
Mauricio Carneiro 2a4ccfe6fd Updated all JAVA file licenses accordingly
GSATDG-5
2013-01-10 17:06:41 -05:00
Eric Banks f0bd1b5ae5 Okay, all public->protected dependencies are gone except for the BQSR arguments. I'll need to think through this but should be able to make that work too. 2013-01-08 15:46:32 -05:00