Commit Graph

869 Commits (5a2ef37ead2360bb87f4d74feb225cdffae54e1f)

Author SHA1 Message Date
Mauricio Carneiro 7b731dd596 Removed native method call
and fixed indentation.
2013-07-30 13:59:58 -04:00
Geraldine Van der Auwera edbd17b8e0 Added note of caution to VQSR gatkdocs for option BOTH of recalibration mode 2013-07-26 15:51:29 -04:00
Ryan Poplin f52196496d Merge pull request #347 from broadinstitute/eb_more_dnagling_tail_improvements
More specific fix for the dangling tail edge case with a single leading deletion.
2013-07-26 07:25:47 -07:00
Ryan Poplin 8c205dda1b Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line. 2013-07-26 10:22:43 -04:00
Eric Banks 9372c5ef41 Merge pull request #334 from broadinstitute/mc_generic_input_for_qualify_missing_intervals
QualifyMissingIntervals: support different formats
2013-07-25 12:39:26 -07:00
sathibault 71eb944e62 Adding CnyPairHMMUnitTest 2013-07-25 14:19:50 -05:00
Eric Banks 5dfa863caa Fully stranded implementation of RR (plus bug fix for insertions and het compression).
Now only filtered reads are unstranded.  All consensus reads have strand, so that we
emit 2 consensus reads in general now: one for each strand.

This involved some refactoring of the sliding window which cleaned it up a lot.

Also included is a bug fix:
insertions downstream of a variant region weren't triggering a stop to the compression.
2013-07-25 14:48:53 -04:00
Eric Banks 0a2b5ddadf More specific fix for the dangling tail edge case with a single leading deletion.
The previous fix was too general (and therefore incorrect) and caused the HC to exception out.
Added "unit" test for this exact case.
2013-07-25 12:24:46 -04:00
Mauricio Carneiro 31ab0824b1 quick indentation fixes to FPGA code 2013-07-24 14:09:49 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
Guillermo del Angel 9dd109b79a Last feature request from Reich/Paavo labs: the allSitePLs feature in UG worked but not quite filled requirements. What's needed is the ability to have all 10 PLs for EVERY site, regardless of whether they are variant or not. Previous version only emitted the 10 PLs in reference sites. Problem is that, if all PLs are emitted in all sites and every single site is quad-allelic (only way to have the PLs printed out in a valid way) then the ability to filter variants and to use the INFO fields may be compromised.
So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before.
Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.
2013-07-18 12:54:52 -04:00
Scott Thibault 5d198d3400 Added write to likelihoods.txt for batch hmm 2013-07-15 10:16:39 -05:00
sathibault 0a8f75b953 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
2013-07-15 08:17:32 -05:00
Mauricio Carneiro 8c07614321 QualifyMissingIntervals: support different formats
Problem
-------
Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters.

Solution
-------
There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle).

ps: Also added an average depth column to the output
2013-07-12 17:32:53 -04:00
Yossi Farjoun afcf7b96db - Added per-sample AlleleBiasedDownsampling capability to HaplotypeCaller
- Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF

- overrode default contamination value in test
2013-07-12 16:22:02 -04:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
sathibault 23fe3e449a Revert "Fixed batching bug."
This reverts commit 3e56c83d0eec7c374e5f187d1ef124d42ecc071e.
2013-07-11 11:30:37 -05:00
sathibault 7458b59bb3 Fixed batching bug. 2013-07-11 11:08:46 -05:00
Guillermo del Angel aba55dbb23 Moved some HC parameters related to active region extensions to command line arguments so that they're more easily modified. Some of these parameters need tinkering in order to call some large indels. See GSA-891 and subtasks for particular examples thereof. 2013-07-10 14:31:10 -04:00
Eric Banks 73fc7f6ab1 Reduce Reads output should never be expected to be sorted (hence the need to sort on disk) but for some reason it was with -nwayout mode. 2013-07-08 10:33:36 -04:00
Eric Banks 5f5c90e65c Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site.
Trivial fix, added integration test to cover it.
2013-07-05 00:04:52 -04:00
Mark DePristo 5f34054cc1 Remove filtering of MAPQ 0 reads from CalledHaplotypeBAMWriter 2013-07-02 15:46:49 -04:00
Mark DePristo ed0b1c5aba Fix bug in ReadThreadingAssembler in cycle failures causing NPE 2013-07-02 15:46:48 -04:00
Mark DePristo e3e8631ff5 Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction.  Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20      10026072        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026073        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026074        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,121
20      10026075        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026076        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026077        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026078        .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,217
20      10026079        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,240
20      10026080        .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,268
20      10026081        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,267

We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values.  Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval.  The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures.  Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line.  Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
2013-07-02 15:46:38 -04:00
Mark DePristo 41aba491c0 Critical bugfix for adapter clipping in HaplotypeCaller
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference.  Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
2013-07-02 15:46:36 -04:00
Scott Thibault 82dcdc01c0 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
2013-06-28 10:13:05 -05:00
Scott Thibault e691fa3e19 FPGA null pointer bug fix 2013-06-28 08:52:09 -05:00
Ryan Poplin 825b603acb Merge pull request #298 from broadinstitute/md_likelihood_rank_sum
Md likelihood rank sum
2013-06-27 11:14:25 -07:00
Mark DePristo a514dd0643 Merge pull request #307 from broadinstitute/eb_rr_off_by_one_error
Proper fix for previous RR -cancer_mode fix.
2013-06-26 13:02:23 -07:00
Eric Banks 876e40466a Proper fix for previous RR -cancer_mode fix.
I "fixed" this once before but instead of testing with unit tests I used integration tests.
Bad decision.

The proper fix is in now, with a bonafide unit test included.
2013-06-26 14:48:09 -04:00
Eric Banks f242be12c0 Make this walker @Hidden 2013-06-26 11:45:21 -04:00
Mark DePristo ff76d0c877 Merge pull request #304 from broadinstitute/eb_rr_header_negative_fix_again
Fixing the 'header is negative' problem in Reduce Reads... again.
2013-06-24 11:55:52 -07:00
Eric Banks 165b936fcd Fixing the 'header is negative' problem in Reduce Reads... again.
Previous fixes and tests only covered trailing soft-clips.  Now that up front
hard-clipping is working properly though, we were failing on those in the tool.

Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
2013-06-24 14:06:21 -04:00
Valentin Ruano-Rubio b97f9a487d Merged bug fix from Stable into Unstable 2013-06-24 14:00:01 -04:00
Mark DePristo 191e4ca251 Merge pull request #300 from broadinstitute/mc_move_qualify_intervals_to_protected
Few bug fixes to this tool now that it is in protected
2013-06-24 09:35:45 -07:00
Valentin Ruano-Rubio 3e5ff6095f Added the pertinent DocumentedGATKFeature annotation ot AnalyzeCovariates 2013-06-21 17:02:26 -04:00
Eric Banks d976aae2b1 Another fix for the Indel Realigner that arises because of secondary alignments.
This time we don't accidentally drop reads (phew), but this bug does cause us not to
update the alignment start of the mate.  Fixed and added unit test to cover it.
2013-06-21 16:59:22 -04:00
Mark DePristo 8caf39cb65 Experimental LikelihoodRankSum annotation
-- Added experimental LikelihoodRankSum, which required slightly more detailed access to the information managed by the base class, so added an overloaded getElementForRead also provides access to the MostLikelyAllele class
-- Added base class default implementation of getElementForPileupElement() which returns null, indicating that the pileup version isn't supported.
-- Added @Override to many of the RankSum classes for safety's sake

-- Updates to GeneralCallingPipeline: annotate sites with dbSNP IDs,
-- R script to assess the value of annotations for VQSR
2013-06-21 13:57:11 -04:00
Mark DePristo f726d8130a VariantRecalibrator bugfix for bad log10sumlog10 values
-- The VR, when the model is bad, may evaluate log10sumlog10 where some of the values in the vector are NaN. This case is now trapped in VR and handled as previously -- indicating that the model has failed and evaluation continues.
2013-06-21 12:28:53 -04:00
Mark DePristo dee51c4189 Error out when NCT and BAMOUT are used with the HaplotypeCaller
-- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled.  Check in initialize if this is the case, and throw a UserException
2013-06-21 09:25:57 -04:00
Mark DePristo fdfe4e41d5 Better GATK version and command line output
-- Previous version emitted command lines that look like:

##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."

the new version provides additional information on when the GATK was run and the GATK version in a nicer format:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">

 -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
 ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">

 -- Removed the ProtectedEngineFeaturesIntegrationTest
 -- Actual unit tests for these features!
2013-06-20 11:19:13 -04:00
sathibault 3db8908ae8 Remove debug print statement 2013-06-20 08:28:58 -05:00
Mark DePristo 0672ac5032 Fix public / protected dependency 2013-06-19 19:42:09 -04:00
Valentin Ruano-Rubio 1f8282633b Removed plots generation from the BaseRecalibration software
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table

Implementation:

* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
2013-06-19 14:47:56 -04:00
Valentin Ruano-Rubio 08f92bb6f9 Added AnalyzeCovariates tool to generate BQSR assessment quality plots.
Implemtation details:

* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
2013-06-19 14:38:02 -04:00
Guillermo del Angel f176c854c6 Swapping in logless Pair HMM for default usage with UG:
-- Changed default HMM model.
-- Removed check.
-- Changed md5's: PL's in the high 100s change by a point or two due to new implementation.
-- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.
2013-06-18 10:06:27 -04:00
Ryan Poplin 8511c4385c Adding new pruning parameter to ReadThreadingAssembler
-- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample.
-- adding unit test to test this new functionality
2013-06-17 16:46:40 -04:00
Guillermo del Angel f6025d25ae Feature requested by Reich lab and Paavo lab in Leipzig for ancient DNA processing:
-- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will:
a) Emit all 3 ALT snp alleles in the ALT column.
b) Emit all corresponding 10 PL values.
It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields.
-- Tweaks and fixes to processing pipelines for Reich lab.
2013-06-17 13:21:09 -04:00
delangel 485ceb1e12 Merge pull request #283 from broadinstitute/md_beagleoutput
Simpler FILTER and info field encoding for BeagleOutputToVCF
2013-06-17 09:31:03 -07:00
Eric Banks e48f754478 Fixes to several of the annotations for reduced reads (and other issues).
1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads.

2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations).

3. Fixed a massive bug in the BaseQualityRankSumTest annotation!  It was not using the base qualities but rather
the read likelihoods?!

Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values.
Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and
makes insignificant distributions look significant (so it can falsely cause the filtering of good sites).

Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.
2013-06-16 01:18:20 -04:00
Mark DePristo 1677a0a458 Simpler FILTER and info field encoding for BeagleOutputToVCF
-- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE.  So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field.  This obviously would cause problems for indels, as so the tool was blowing up in this case.  Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead.  This works in general with any type of allele.
 -- Here's an example output line from the previous and current versions:
 old: 20    64150   rs7274499       C       .       3041.68 BGL_RM_WAS_A    AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88
 new: 20    64062   .       G       .       100.39  BGL_SET_TO_MONOMORPHIC  AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15
-- update MD5s to reflect these changes
-- [delivers #50847721]
2013-06-14 15:56:13 -04:00
Mark DePristo dd5674b3b8 Add genotyping accuracy assessment to AssessNA12878
-- Now table looks like:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE              1220
variant  SNPS         FALSE_POSITIVE                0
variant  SNPS         FALSE_NEGATIVE                1
variant  SNPS         TRUE_NEGATIVE               150
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL       0
variant  SNPS         HET_CONCORDANCE          100.00
variant  SNPS         HOMVAR_CONCORDANCE        99.63
variant  INDELS       TRUE_POSITIVE               273
variant  INDELS       FALSE_POSITIVE                0
variant  INDELS       FALSE_NEGATIVE               15
variant  INDELS       TRUE_NEGATIVE                79
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL       2
variant  INDELS       HET_CONCORDANCE           98.67
variant  INDELS       HOMVAR_CONCORDANCE        89.58

-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles.  So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A.  This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB.  Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:

20      10769255        .       A       ATGTG   165.73  .       ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE     GT:AD:DP:GQ:PL  0/1:1,9:10:6:360,0,6

Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function.  Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
2013-06-13 15:05:32 -04:00
Mark DePristo 33720b83eb No longer merge overlapping fragments from HaplotypeCaller
-- Merging overlapping fragments turns out to be a bad idea.  In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small.  That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference.  Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging.  Best to leave haplotype construction to the assembly graph.
2013-06-13 15:05:32 -04:00
Mark DePristo c837d67b2f Merge pull request #273 from broadinstitute/rp_readIsPoorlyModelled
Relaxing the constraints on the readIsPoorlyModelled function.
2013-06-13 08:40:24 -07:00
Ryan Poplin f44efc27ae Relaxing the constraints on the readIsPoorlyModelled function.
-- Turns out we were aggressively throwing out borderline-good reads.
2013-06-13 11:06:23 -04:00
Ryan Poplin d5f0848bd5 HC bam writer now sets the read to MQ0 if it isn't informative
-- Makes visualization of read evidence easier in IGV.
2013-06-13 10:11:54 -04:00
sathibault 336050ab71 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
2013-06-13 07:28:24 -05:00
Ryan Poplin d1f397c711 Fixing bug with dangling tails in which the tail connects all the way back to the reference source node.
-- List of vertices can't contain a source node.
2013-06-12 12:23:01 -04:00
Ryan Poplin e1fd3dff9a Merge pull request #268 from broadinstitute/eb_calling_accuracy_improvements_to_HC
Eb calling accuracy improvements to hc
2013-06-11 11:18:51 -07:00
Eric Banks 2c3c680eb7 Misc changes and cleanup from all previous commits in this push.
1. By default, do not include the UG CEU callset for assessment.
2. Updated md5s that are different now with all the HC changes.
2013-06-11 12:53:11 -04:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Guillermo del Angel 55d5f2194c Read Error Corrector for haplotype assembly
Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps:
1. K-mer graph buildup.
For each read in an active region, a map from k-mers to the number of times they have been seen is built.
2. Building correction map.
All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable).
3. Correction proposal
For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty.

The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent.

By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.
2013-06-11 12:26:24 -04:00
Eric Banks c0030f3f2d We no longer subset down to the best N haplotypes for the GL calculation.
I explain in comments within the code that this was causing problems with the marginalization over events.
2013-06-11 11:51:26 -04:00
Eric Banks c0e3874db0 Change the HC's phredScaledGlobalReadMismappingRate from 60 to 45, because Ryan and Mark told me to. 2013-06-11 11:51:26 -04:00
Eric Banks 77868d034f Do not allow the use of Ns in reads for graph construction.
Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones.
This was the cause of at least one FN where there were Ns at a SNP position.
2013-06-11 11:51:26 -04:00
Eric Banks e4e7d39e2c Fix FN problem stemming from sequence graphs that contain cycles.
Problem:
The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers.
Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final
sequence graph.  Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions.

Solution:
We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them.  If (and only if) we
can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate
a graph without cycles or attempt too many iterations.
2013-06-11 11:51:26 -04:00
Ryan Poplin 58e354176e Minor changes to docs in the graph pruning. 2013-06-11 10:33:22 -04:00
Mark DePristo 1c03ebc82d Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
2013-06-10 16:20:31 -04:00
Mark DePristo 0d593cff70 Refactor rsID and overlap detection in VariantOverlapAnnotator utility class
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele.  This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
2013-06-10 15:51:13 -04:00
Mauricio Carneiro 1d67d07cf1 better docs for Qualify Missing Intervals
now that it's available to the public, better give'em good docs!
2013-06-10 15:17:40 -04:00
Mauricio Carneiro c84f0deb1d Don't crash if cds file is not provided
CDS file should be optional.
2013-06-10 13:42:00 -04:00
Mauricio Carneiro a95fbd48e5 Moving QualifyMissingIntervals to protected
Making this walker available so we can share it with the CSER group for CLIA analysis.
2013-06-10 13:11:41 -04:00
Valentin Ruano-Rubio 96073c3058 This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.

Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:

a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)

b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)

Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.

Implementation:

* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
2013-06-10 10:44:42 -04:00
Michael McCowan 00c06e9e52 Performance improvements:
- Memoized MathUtil's cumulative binomial probability function.
 - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
2013-06-09 11:26:52 -04:00
Mark DePristo 209dd64268 HaplotypeCaller now emits per-sample DP
-- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller
-- The depth for the HC is the sum of the informative alleles at this site.  It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP).
-- Update MD5s
-- delivers [#48240601]
2013-06-06 09:47:32 -04:00
Mark DePristo 34bdf20132 Bugfix for bad AD values in UG/HC
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
2013-06-05 17:48:41 -04:00
Mark DePristo c9f5b53efa Bugfix for HC can fail to assemble the correct reference sequence in some cases
-- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger.  In the case where you have this graph:

ACT [ref source] -> C
G -> ACT -> C

we would merge into

G -> ACT -> C

which would linearlize into

GACTC

Causing us to add bases to the reference source node that couldn't be recovered.  The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves.

-- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM).  This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error
-- Did some misc. cleanup of code while debugging
-- [delivers #50917729]
2013-06-03 16:16:39 -04:00
Ryan Poplin ab40f4af43 Break out the GGA kmers and the read kmers into separate functions for the DeBruijn assembler.
-- Added unit test for new function.
2013-06-03 14:00:35 -04:00
Ryan Poplin 21334e728d Merge pull request #252 from broadinstitute/md_bqsr_index_out_of_bounds
Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
2013-06-03 07:13:00 -07:00
sathibault de2a2a4cc7 Added command-line flag to disble FPGA
Completed integration with FPGA driver
2013-06-03 07:30:32 -05:00
Mark DePristo 6555361742 Fix error in merging code in HC
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper.  The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H.  This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
2013-05-31 16:29:29 -04:00
Mark DePristo 64b4d80729 Make BQSR calculateIsIndel robust to indel CIGARs are start/end of read
-- The previous implementation attempted to be robust to this, but not all cases were handled properly.  Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works.  The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand.  Everything is now consistent, as can be seen in the symmetry of the unit tests:

        tests.add(new Object[]{"1D3M",   false, EventType.BASE_DELETION, new int[]{0,0,0}});
        tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"3M1D",   false, EventType.BASE_DELETION, new int[]{0,0,1}});

        tests.add(new Object[]{"1D3M",   true, EventType.BASE_DELETION, new int[]{1,0,0}});
        tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}});
        tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}});
        tests.add(new Object[]{"3M1D",   true, EventType.BASE_DELETION, new int[]{0,0,0}});

        tests.add(new Object[]{"4M1I",   false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});
        tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}});
        tests.add(new Object[]{"1I4M",   false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});

        tests.add(new Object[]{"4M1I",   true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});
        tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}});
        tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
        tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
        tests.add(new Object[]{"1I4M",   true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});

-- delivers #50445353
2013-05-31 13:58:37 -04:00
Ryan Poplin b5b9d745a7 New implementation of the GGA mode in the HaplotypeCaller
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
2013-05-31 10:35:36 -04:00
Ryan Poplin 61af37d0d2 Create a new normalDistributionLog10 function that is unit tested for use in the VQSR. 2013-05-30 16:00:08 -04:00
Eric Banks a5a68c09fa Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not.  So I refactored that method to include a boolean argument as to whether or not
clips should be ignored.  Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
2013-05-29 16:41:01 -04:00
Mark DePristo 684c91c2e7 Merge pull request #245 from broadinstitute/dr_enforce_min_dcov
Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
2013-05-29 09:52:13 -07:00
David Roazen a7cb599945 Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
 since low dcov values can result in problematic downsampling artifacts for locus-based
 traversals.

-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
 controls the number of reads per alignment start position, and even a dcov value of 1 might
 be safe/desirable in some circumstances.

-Also reorganize the global downsampling defaults so that they are specified as annotations
 to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
 DownsamplingMethod class.

-The default downsampling settings have not been changed: they are still -dcov 1000
 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
2013-05-29 12:07:12 -04:00
Mauricio Carneiro f1affa9fbb Turn off downsampling for DiagnoseTargets
Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)
2013-05-28 14:58:50 -04:00
Ryan Poplin 85905dba92 Bugfix for GGA mode in UG silently ignoring indels
-- Started by Mark. Finished up by Ryan.
-- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls.
-- https://www.pivotaltracker.com/story/show/48924339 for more information
-- [resolves #48924339]
2013-05-24 13:47:26 -04:00
Mauricio Carneiro da21924b44 Make the missing targets output never use stdout
Problem
--------
Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided

Solution
--------
Make it NOT default to stdout

[Delivers #50386741]
2013-05-22 14:22:54 -04:00
Mark DePristo d167743852 Archived banded logless PairHMM
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM.  Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available.  The implementation is being archived for future reference

Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing.  Added example of 100kb of data to private/testdata.  Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
2013-05-22 12:24:00 -04:00
Mark DePristo a1093ad230 Optimization for ActiveRegion.removeAll
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion.  This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element.   New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set.  Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
2013-05-21 16:18:57 -04:00
Eric Banks 1f3624d204 Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing 2013-05-21 11:35:05 -04:00
Valentin Ruano Rubio 71bbb25c9e Merge pull request #231 from broadinstitute/md_combinevariants_bugfix
CombineVariants no longer adds PASS to unfiltered records
2013-05-20 14:28:20 -07:00
Mark DePristo 62fc88f92e CombineVariants no longer adds PASS to unfiltered records
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!
2013-05-20 16:53:51 -04:00
Ryan Poplin 507853c583 Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result.
-- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.
2013-05-20 14:29:04 -04:00
Eric Banks 8a442d3c9f @Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating. 2013-05-17 10:04:10 -04:00
sathibault 195f0c3e98 Disable CnyPairHMM 2013-05-17 08:30:23 -05:00
Yossi Farjoun 9234a0efcd Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs
Bug fixes and missing interval functionality for Diagnose Targets

While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule).

It is with these cavets that I accept the pull request.
2013-05-16 09:25:54 -07:00
Chris Hartl 6da0aed30f Update GCIT md5s to account for trivial changes to description strings 2013-05-14 19:45:30 -04:00