Commit Graph

839 Commits (3841635fcb4d711df818b6d62df722ef70cddffc)

Author SHA1 Message Date
Eric Banks 69e78efeae Merge pull request #366 from broadinstitute/gg_gatkdocfixes
More gatkdoc fixes
2013-08-13 04:52:03 -07:00
Eric Banks bcf9a1cda5 Merge pull request #370 from broadinstitute/rp_dont_output_filtered_variants_in_VQSR
Adding mode to VQSR to not output variant records that are filtered out ...
2013-08-12 12:01:50 -07:00
Ryan Poplin a45011d7e7 Adding mode to VQSR to not output variant records that are filtered out after applying the recalibration. Necessary for 1000G calling. 2013-08-12 11:22:59 -04:00
Ryan Poplin 59f56bef30 Cleaning up help text for the -numBad argument. 2013-08-12 09:51:56 -04:00
Geraldine Van der Auwera 4d20c71e09 Improvements to various gatkdocs
- Make -rod required
    - Document that contaminationFile is currently not functional with HC
    - Document liftover process more clearly
    - Document VariantEval combinations of ST and VE that are incompatible
    - Added a caveat about using MVLR from HC and UG.
    - Added caveat about not using -mte with -nt
    - Clarified masking options
    - Fixed docs based on Erics comments
2013-08-10 10:01:31 -07:00
Mark DePristo b7d1096ced Added onlyEmitSamples argument to UnifiedGenotyper
-- When provided, this argument causes us to only emit the selected samples into the VCF.  No INFO field annotations (AC for example) or other features are modified.  It's current primary use is for efficiently evaluating joint calling.
-- Add integration test for onlyEmitSamples
2013-08-09 11:00:15 -04:00
Mark DePristo ccf0df0fea Misc. debugging functionality to FS calculation (disabled by default) 2013-08-08 12:06:23 -04:00
Mark DePristo 00f4d767e4 Merge pull request #364 from broadinstitute/md_vqsr_improvements
Separate num Gaussians for + and - GMM in VQSR
2013-08-07 04:37:45 -07:00
Mark DePristo c21402d4af Separate num Gaussians for + and - GMM in VQSR
-- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models.  However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model.
-- Now the maxGaussians argument only applies to the positive model
-- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter)
-- Removes the percentBadVariants argument.  The only way to control how many variants are included in the negative model is with minNumBad
-- Reduced the minNumBad argument default to 1000 from 2500
-- Update MD5s for VQSR.  md5s changed significantly due to underlying changes in the default GMM model.  Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected.
-- minNumBad is now numBad
-- Plot all negative training points as well, since this significantly changes our view of the GMM PDF
2013-08-07 07:36:50 -04:00
Mark DePristo 318f7e74e4 Better docs on the meaning of heterozygosity
-- [delivers #53522209]
2013-08-07 07:27:45 -04:00
Mark DePristo 40bc7d6a9c Bugfix for ReferenceConfidenceModel
-- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype".  Now we detect this case and emit the standard no variation output.
-- [delivers #54625060]
2013-08-06 16:00:32 -04:00
Ryan Poplin a46f633bd6 Fix for the VQSR visualization script with the new ordering of annotations. 2013-08-02 19:10:45 -04:00
Mauricio Carneiro 285ab2ac62 Better caching for the HaplotypeCaller
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.

Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.

Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
2013-08-02 01:27:29 -04:00
Eric Banks 1e396af4d0 Two reduce reads updates/fixes:
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.

Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.

Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.

2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.

Refactored it so that it can only remove non-consensus reads.
2013-08-01 14:34:59 -04:00
Ryan Poplin 4f3411f3d4 Max number of haplotypes to evaluate no longer grows unbounded with the number of samples. This is necessary for multi-sample calling projects with over 100 samples. 2013-07-31 10:48:55 -04:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
droazen b8709b1942 Merge pull request #332 from broadinstitute/st_fpga_hmm
FPGA support for PairHMM
2013-07-30 14:21:21 -07:00
Joseph Rose d2860a5486 Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose) 2013-07-30 15:38:32 -04:00
Mauricio Carneiro 7b731dd596 Removed native method call
and fixed indentation.
2013-07-30 13:59:58 -04:00
Geraldine Van der Auwera edbd17b8e0 Added note of caution to VQSR gatkdocs for option BOTH of recalibration mode 2013-07-26 15:51:29 -04:00
Ryan Poplin f52196496d Merge pull request #347 from broadinstitute/eb_more_dnagling_tail_improvements
More specific fix for the dangling tail edge case with a single leading deletion.
2013-07-26 07:25:47 -07:00
Ryan Poplin 8c205dda1b Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line. 2013-07-26 10:22:43 -04:00
Eric Banks 9372c5ef41 Merge pull request #334 from broadinstitute/mc_generic_input_for_qualify_missing_intervals
QualifyMissingIntervals: support different formats
2013-07-25 12:39:26 -07:00
sathibault 71eb944e62 Adding CnyPairHMMUnitTest 2013-07-25 14:19:50 -05:00
Eric Banks 5dfa863caa Fully stranded implementation of RR (plus bug fix for insertions and het compression).
Now only filtered reads are unstranded.  All consensus reads have strand, so that we
emit 2 consensus reads in general now: one for each strand.

This involved some refactoring of the sliding window which cleaned it up a lot.

Also included is a bug fix:
insertions downstream of a variant region weren't triggering a stop to the compression.
2013-07-25 14:48:53 -04:00
Eric Banks 0a2b5ddadf More specific fix for the dangling tail edge case with a single leading deletion.
The previous fix was too general (and therefore incorrect) and caused the HC to exception out.
Added "unit" test for this exact case.
2013-07-25 12:24:46 -04:00
Mauricio Carneiro 31ab0824b1 quick indentation fixes to FPGA code 2013-07-24 14:09:49 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
Guillermo del Angel 9dd109b79a Last feature request from Reich/Paavo labs: the allSitePLs feature in UG worked but not quite filled requirements. What's needed is the ability to have all 10 PLs for EVERY site, regardless of whether they are variant or not. Previous version only emitted the 10 PLs in reference sites. Problem is that, if all PLs are emitted in all sites and every single site is quad-allelic (only way to have the PLs printed out in a valid way) then the ability to filter variants and to use the INFO fields may be compromised.
So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before.
Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.
2013-07-18 12:54:52 -04:00
Scott Thibault 5d198d3400 Added write to likelihoods.txt for batch hmm 2013-07-15 10:16:39 -05:00
sathibault 0a8f75b953 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
2013-07-15 08:17:32 -05:00
Mauricio Carneiro 8c07614321 QualifyMissingIntervals: support different formats
Problem
-------
Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters.

Solution
-------
There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle).

ps: Also added an average depth column to the output
2013-07-12 17:32:53 -04:00
Yossi Farjoun afcf7b96db - Added per-sample AlleleBiasedDownsampling capability to HaplotypeCaller
- Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF

- overrode default contamination value in test
2013-07-12 16:22:02 -04:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
sathibault 23fe3e449a Revert "Fixed batching bug."
This reverts commit 3e56c83d0eec7c374e5f187d1ef124d42ecc071e.
2013-07-11 11:30:37 -05:00
sathibault 7458b59bb3 Fixed batching bug. 2013-07-11 11:08:46 -05:00
Guillermo del Angel aba55dbb23 Moved some HC parameters related to active region extensions to command line arguments so that they're more easily modified. Some of these parameters need tinkering in order to call some large indels. See GSA-891 and subtasks for particular examples thereof. 2013-07-10 14:31:10 -04:00
Eric Banks 73fc7f6ab1 Reduce Reads output should never be expected to be sorted (hence the need to sort on disk) but for some reason it was with -nwayout mode. 2013-07-08 10:33:36 -04:00
Eric Banks 5f5c90e65c Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site.
Trivial fix, added integration test to cover it.
2013-07-05 00:04:52 -04:00
Mark DePristo 5f34054cc1 Remove filtering of MAPQ 0 reads from CalledHaplotypeBAMWriter 2013-07-02 15:46:49 -04:00
Mark DePristo ed0b1c5aba Fix bug in ReadThreadingAssembler in cycle failures causing NPE 2013-07-02 15:46:48 -04:00
Mark DePristo e3e8631ff5 Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction.  Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20      10026072        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026073        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026074        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,121
20      10026075        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026076        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026077        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026078        .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,217
20      10026079        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,240
20      10026080        .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,268
20      10026081        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,267

We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values.  Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval.  The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures.  Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line.  Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
2013-07-02 15:46:38 -04:00
Mark DePristo 41aba491c0 Critical bugfix for adapter clipping in HaplotypeCaller
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference.  Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
2013-07-02 15:46:36 -04:00
Scott Thibault 82dcdc01c0 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
2013-06-28 10:13:05 -05:00
Scott Thibault e691fa3e19 FPGA null pointer bug fix 2013-06-28 08:52:09 -05:00
Ryan Poplin 825b603acb Merge pull request #298 from broadinstitute/md_likelihood_rank_sum
Md likelihood rank sum
2013-06-27 11:14:25 -07:00
Mark DePristo a514dd0643 Merge pull request #307 from broadinstitute/eb_rr_off_by_one_error
Proper fix for previous RR -cancer_mode fix.
2013-06-26 13:02:23 -07:00
Eric Banks 876e40466a Proper fix for previous RR -cancer_mode fix.
I "fixed" this once before but instead of testing with unit tests I used integration tests.
Bad decision.

The proper fix is in now, with a bonafide unit test included.
2013-06-26 14:48:09 -04:00
Eric Banks f242be12c0 Make this walker @Hidden 2013-06-26 11:45:21 -04:00
Mark DePristo ff76d0c877 Merge pull request #304 from broadinstitute/eb_rr_header_negative_fix_again
Fixing the 'header is negative' problem in Reduce Reads... again.
2013-06-24 11:55:52 -07:00