Commit Graph

1376 Commits (89c4e57478a022d822d2791e8ffd68be55a98f2c)

Author SHA1 Message Date
Valentin Ruano-Rubio 89c4e57478 gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested.
Changes:
-------

  <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles.

  When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake.

  Integration test MD5 have been changed accordingly.

Additional fix:
--------------

  Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick.

  In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.
2014-01-30 11:23:33 -05:00
Valentin Ruano-Rubio 9e7bf75e89 Fix for the PairHMM transition probability miscalculation.
Problem:

matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.

Reports:

https://www.pivotaltracker.com/s/projects/793457/stories/62471780
https://www.pivotaltracker.com/s/projects/793457/stories/61082450

Changes:

The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.

MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.

Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
2014-01-26 16:30:36 -05:00
Geraldine Van der Auwera edf5880022 Updated SAMPileup codec and pileup-related docs
Problem: the codec was written to take in consensus pileups produced with pileup -c option (which consists of 10 or 13 fields per line depending on the variant type) but errored out on the basic pileup format (which only has 6 fields per line). This was inconsistent and confusing to users.

	Solution: I added a switch in the parsing to recognize and handle both cases more appropriately, and updated related docs. While I was at it I also improved error messages in CheckPileup, which now emits User Error: Bad Input exceptions when reporting mismatches. Which may not be the best thing to do (ultimately they're not really errors, they're just reporting unwelcome results) but it beats emitting Runtime Exceptions.

	Tested by CheckPileupIntegrationTest which tests both format cases.
2014-01-14 09:14:16 -05:00
Eric Banks 0323caefc8 Added some bug fixes to the gVCF merging code after finally getting some real data to play with.
Still under construction, awaiting more test data from Valentin.
2014-01-08 08:34:35 -05:00
Eric Banks f172c349f6 Adding the functionality to enable users to input a file of VCFs for -V.
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs.  Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.

For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line.  In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path).  For example:
foo.vcf
VCF,name=bar bar.vcf

Note that a VCF list file name must end with '.list'.

Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
2014-01-08 00:45:00 -05:00
Ami Levy-Moonshine 6da53aea09 Write a new tool for spliting reads that have N cigar string.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.

In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns

create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils

create integration test for that class

small change in a comment in FullProcessingPipeline

last commit:

Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)

- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs

- Allow the user to print the split positions (not working proparly currently)
2014-01-01 22:21:36 -05:00
Eric Banks 5a1564d1f2 Merge pull request #456 from broadinstitute/eb_unify_hc_combination_steps
Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.
2013-12-31 18:57:27 -08:00
Eric Banks 83e09b1f64 Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine

In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process.  So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants.  I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.

Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
2013-12-31 12:07:56 -05:00
David Roazen 4a79831adc Add ability to specify min/max required/recommended values for numeric arguments in the @Argument annotation
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
 to @Argument annotations for command-line arguments

-"minValue" and "maxValue" specify hard limits that generate an exception if violated

-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated

-Works only for numeric arguments (int, double, etc.) with @Argument annotations

-Only considers values actually specified by the user on the command line, not default values
 assigned in the code

As requested by Geraldine
2013-12-18 18:09:08 -05:00
Eric Banks 400e7c1404 Fixed bug in the filtering of lifted over variants where a deletion at the end of a contig could cause it to error out.
Added a unit test.
2013-12-11 14:07:18 -05:00
Joel Thibault 5fe0531b4d Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy 2013-12-03 23:12:14 -05:00
Joel Thibault fd0a02e52e New VCF engine arguments to specify an alternate IndexCreator
- CatVariants updates to use custom VCF indices
- Scala scripts for VCF index testing
2013-12-03 13:31:02 -05:00
Joel Thibault 42f78bdb3a Add a class-based DataProvider 2013-12-03 13:31:01 -05:00
Eric Banks 6bee6a1b53 Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles.
Previously, we would strip out the PLs and AD values since they were no longer accurate.  However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.

Now the PLs and AD get correctly selected down.

While I was in there I also refactored some related code in subsetDiploidAlleles().  There were no real changes there - I just
broke it out into smaller chunks as per our best practices.

Added unit tests and updated integration tests.
Addressed reviews.
2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio 0f99778a59 Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test

Adding Graph-based likelihood ratio calculation to HC

  To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test
2013-12-02 19:37:19 -05:00
Chris Hartl 1f777c4898 Introducing the latest-and-greatest in genotyping: CalculatePosteriors.
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.

Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
 while not converged:
  * AC = [external AC] + [sample AC]
  * Prior = DirichletMultinomial[AC]
  * Posteriors = [sample GL + Prior]
  * sample AC = MLEAC(Posteriors)

This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.

Fully unit tested.

Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
2013-11-27 13:00:45 -05:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
David Roazen 42d771f748 Remove org.apache.commons.collections.IteratorUtils dependency from the test suite
-This was a dependency of the test suite, but not the GATK proper,
 which caused problems when running the test suite on the packaged
 GATK jar at release time

-Use GATKVCFUtils.readVCF() instead
2013-08-21 19:44:02 -04:00
Michael McCowan c3a933ce84 Adaptations to accomodate Tribble API changes, comprising mostly of the following.
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
2013-08-19 15:52:47 -04:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
David Roazen 605a5ac2e3 GATK engine: add ability to do on-the-fly BAM file sample renaming at runtime
-User must provide a mapping file via new --sample_rename_mapping_file argument.
 Mapping file must contain a mapping from absolute bam file path to new sample name
 (format is described in the docs for the argument).

-Requires that each bam file listed in the mapping file contain only one sample
 in their headers (they may contain multiple read groups for that sample, however).
 The engine enforces this, and throws a UserException if on-the-fly renaming is
 requested for a multi-sample bam.

-Not all bam files for a traversal need to be listed in the mapping file.

-On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders
 in SAMDataSource (before the headers are even merged), to prevent possible consistency
 issues.

-Renaming is done ONCE at traversal start for each SAMReaders resource creation in the
 SAMResourcePool; this effectively means once per -nt thread

-Comprehensive unit/integration tests

Known issues: -if you specify the absolute path to a bam in the mapping file, and then
               provide a path to that same bam to -I using SYMLINKS, the renaming won't
               work. The absolute paths will look different to the engine due to the
               symlink being present in one path and not in the other path.

GSA-974 #resolve
2013-07-18 15:48:42 -04:00
David Roazen c15751e41e SAMReaderID: fix bug with hash code and equals() method
-Two SAMReaderIDs that pointed at the same underlying bam file through
 a relative vs. an absolute path were not being treated as equal, and
 had different hash codes. This was causing problems in the engine, since
 SAMReaderIDs are often used as the keys of HashMaps.

-Fix: explicitly use the absolute path to the encapsulated bam file in
 hashCode() and equals()

-Added tests to ensure this doesn't break again
2013-07-15 13:57:00 -04:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
Mark DePristo e3e8631ff5 Working version of HaplotypeCaller ReferenceConfidenceModel that accounts for indels as well as SNP confidences
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction.  Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20      10026072        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026073        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026074        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,121
20      10026075        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,119
20      10026076        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026077        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:3,0:3:9:0,9,120
20      10026078        .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:5,0:5:15:0,15,217
20      10026079        .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,240
20      10026080        .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:6,0:6:18:0,18,268
20      10026081        .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:7,0:7:21:0,21,267

We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values.  Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval.  The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures.  Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line.  Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
2013-07-02 15:46:38 -04:00
Mark DePristo 41aba491c0 Critical bugfix for adapter clipping in HaplotypeCaller
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference.  Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
2013-07-02 15:46:36 -04:00
David Roazen 31827022db Fix pipeline tests that were not respecting the pipeline test dry run setting
There are a few pipeline test classes that do not run Queue, but are
classified as pipeline tests because they submit farm jobs. Make these
unconventional pipeline tests respect the pipeline test dry run setting.
2013-06-28 15:27:17 -04:00
Mark DePristo fdfe4e41d5 Better GATK version and command line output
-- Previous version emitted command lines that look like:

##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."

the new version provides additional information on when the GATK was run and the GATK version in a nicer format:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">

 -- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:

 ##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
 ##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">

 -- Removed the ProtectedEngineFeaturesIntegrationTest
 -- Actual unit tests for these features!
2013-06-20 11:19:13 -04:00
Mark DePristo 0672ac5032 Fix public / protected dependency 2013-06-19 19:42:09 -04:00
David Roazen 51ec5404d4 SAMDataSource: always consolidate cigar strings into canonical form
-Collapses zero-length and repeated cigar elements, neither of which
 can necessarily be handled correctly by downstream code (like LIBS).

-Consolidation is done before read filters, because not all read filters
 behave correctly with non-consoliated cigars.

-Examined other uses of consolidateCigar() throughout the GATK, and
 found them to not be redundant with the new engine-level consolidation
 (they're all on artificially-created cigars in the HaplotypeCaller
 and SmithWaterman classes)

-Improved comments in SAMDataSource.applyDecoratingIterators()

-Updated MD5s; differences were examined and found to be innocuous

-Two tests: -Unit test for ReadFormattingIterator
            -Integration test for correct handling of zero-length
             cigar elements by the GATK engine as a whole
2013-06-19 13:29:01 -04:00
Mark DePristo 7b22467148 Bugfix: defaultBaseQualities actually works now
-- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly.  Added integration tests to ensure this continues to work.
-- [delivers #49538319]
2013-06-17 14:37:27 -04:00
Mark DePristo b69d210255 Bugfix: allow gzip VCF output in multi-threaded GATK output
-- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is.  Added integrationtest to ensure this behavior works going forward.
-- [delivers #47399279]
2013-06-17 12:39:18 -04:00
David Roazen d167292688 Reduce number of leftover temp files in GATK runs
-WalkerTest now deletes *.idx files on exit

-ArtificialBAMBuilder now deletes *.bai files on exit

-VariantsToBinaryPed walker now deletes its temp files on exit
2013-06-14 15:56:03 -04:00
Ryan Poplin c4e508a71f Merge pull request #275 from broadinstitute/md_fragment_with_pcr
Improvements to HaplotypeCaller and NA12878 KB
2013-06-14 09:32:26 -07:00
Mark DePristo 74f311c973 Emit the GATK version number in the VCF header
-- Looks like ##GATKVersion=2.5-159-g3f91d93 in the VCF header line
-- delivers [#51595305]
2013-06-13 15:46:16 -04:00
Mark DePristo 6232db3157 Remove STANDARD option from GATKRunReport
-- AWS is now the default.  Removed old code the referred to the STANDARD type.  Deleted unused variables and functions.
2013-06-13 15:18:28 -04:00
Mark DePristo dd5674b3b8 Add genotyping accuracy assessment to AssessNA12878
-- Now table looks like:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE              1220
variant  SNPS         FALSE_POSITIVE                0
variant  SNPS         FALSE_NEGATIVE                1
variant  SNPS         TRUE_NEGATIVE               150
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL       0
variant  SNPS         HET_CONCORDANCE          100.00
variant  SNPS         HOMVAR_CONCORDANCE        99.63
variant  INDELS       TRUE_POSITIVE               273
variant  INDELS       FALSE_POSITIVE                0
variant  INDELS       FALSE_NEGATIVE               15
variant  INDELS       TRUE_NEGATIVE                79
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL       2
variant  INDELS       HET_CONCORDANCE           98.67
variant  INDELS       HOMVAR_CONCORDANCE        89.58

-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles.  So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A.  This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB.  Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:

20      10769255        .       A       ATGTG   165.73  .       ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE     GT:AD:DP:GQ:PL  0/1:1,9:10:6:360,0,6

Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function.  Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
2013-06-13 15:05:32 -04:00
Mark DePristo b2dc7095ab Merge pull request #267 from broadinstitute/dr_reducereads_downsampling_fix
Exclude reduced reads from elimination during downsampling
2013-06-11 13:52:28 -07:00
David Roazen 95b5f99feb Exclude reduced reads from elimination during downsampling
Problem:
-Downsamplers were treating reduced reads the same as normal reads,
 with occasionally catastrophic results on variant calling when an
 entire reduced read happened to get eliminated.

Solution:
-Since reduced reads lack the information we need to do position-based
 downsampling on them, best available option for now is to simply
 exempt all reduced reads from elimination during downsampling.

Details:
-Add generic capability of exempting items from elimination to
 the Downsampler interface via new doNotDiscardItem() method.
 Default inherited version of this method exempts all reduced reads
 (or objects encapsulating reduced reads) from elimination.

-Switch from interfaces to abstract classes to facilitate this change,
 and do some minor refactoring of the Downsampler interface (push
 implementation of some methods into the abstract classes, improve
 names of the confusing clear() and reset() methods).

-Rewrite TAROrderedReadCache. This class was incorrectly relying
 on the ReservoirDownsampler to preserve the relative ordering of
 items in some circumstances, which was behavior not guaranteed by
 the API and only happened to work due to implementation details
 which no longer apply. Restructured this class around the assumption
 that the ReservoirDownsampler will not preserve relative ordering
 at all.

-Add disclaimer to description of -dcov argument explaining that
 coverage targets are approximate goals that will not always be
 precisely met.

-Unit tests for all individual downsamplers to verify that reduced
 reads are exempted from elimination
2013-06-11 16:16:26 -04:00
Eric Banks dadcfe296d Reworking of the dangling tails merging code.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort.  Also, we put in a check for low-complexity of graphs
and don't let those pass through.

Added tests for this implementation that checks exact SW results and correct edges added.
2013-06-11 12:53:04 -04:00
Mark DePristo 1c03ebc82d Implement ActiveRegionTraversal RefMetaDataTracker for map call; HaplotypeCaller now annotates ID from dbSNP
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
2013-06-10 16:20:31 -04:00
Valentin Ruano-Rubio 96073c3058 This commit addresses JIRA issue GSA-948: Prevent users from doing the wrong thing with RNA-Seq data and the GATK.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.

Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:

a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)

b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)

Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.

Implementation:

* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
2013-06-10 10:44:42 -04:00
Michael McCowan 00c06e9e52 Performance improvements:
- Memoized MathUtil's cumulative binomial probability function.
 - Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
2013-06-09 11:26:52 -04:00
Mark DePristo e19c24f3ee Bugfix for HaplotypeCaller error: Only one of refStart or refStop must be < 0, not both
-- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts.  From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads).
-- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases.
-- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString)
-- Added unit tests for reverting soft clipped bases that create a read before the contig
-- [delivers #50892431]
2013-06-04 10:33:46 -04:00
Mark DePristo 6555361742 Fix error in merging code in HC
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper.  The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H.  This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
2013-05-31 16:29:29 -04:00
Mark DePristo 4b206a3540 Check that -compress arguments are within range 0-9
-- Although the original bug report was about SplitSamFile it actually was an engine wide error.  The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils
-- delivers #49531009
2013-05-31 15:29:02 -04:00
Mark DePristo b16de45ce4 Command-line read filters are now applied before Walker default filters
-- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 *before* the BQSR filters them out with MappingQualityUnassignedFilter.
-- delivers #50222251
2013-05-30 16:54:18 -04:00
Ryan Poplin 61af37d0d2 Create a new normalDistributionLog10 function that is unit tested for use in the VQSR. 2013-05-30 16:00:08 -04:00
David Roazen a7cb599945 Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
 since low dcov values can result in problematic downsampling artifacts for locus-based
 traversals.

-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
 controls the number of reads per alignment start position, and even a dcov value of 1 might
 be safe/desirable in some circumstances.

-Also reorganize the global downsampling defaults so that they are specified as annotations
 to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
 DownsamplingMethod class.

-The default downsampling settings have not been changed: they are still -dcov 1000
 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
2013-05-29 12:07:12 -04:00
delangel 925232b0fc Merge pull request #236 from broadinstitute/md_simple_hc_performance_improvements
3 simple performance improvements for HaplotypeCaller
2013-05-22 07:58:28 -07:00