Commit Graph

894 Commits (400e7c14048ae3c408d7d8ac704982f5f9d0f429)

Author SHA1 Message Date
David Roazen 932cd3ada7 Fix 3rd-party library dependency issues in the HC/PairHMM tests
In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.

If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
2013-12-06 13:16:55 -05:00
Eric Banks e022db4690 Added docs for the minPruning argument in the HC 2013-12-05 11:50:56 -05:00
Geraldine Van der Auwera 3ab2f4edb2 Fixed documentation for -deletions argument in the UAC 2013-12-04 19:55:24 -05:00
amilev 0d94019bd6 Merge pull request #434 from broadinstitute/mc_dt_gccontent
Add GC Content to DiagnoseTargets
2013-12-04 09:42:26 -08:00
Joel Thibault 5fe0531b4d Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy 2013-12-03 23:12:14 -05:00
Mauricio Carneiro 701ede2817 Add GC Content to DiagnoseTargets 2013-12-03 23:04:40 -05:00
Eric Banks 6bee6a1b53 Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles.
Previously, we would strip out the PLs and AD values since they were no longer accurate.  However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.

Now the PLs and AD get correctly selected down.

While I was in there I also refactored some related code in subsetDiploidAlleles().  There were no real changes there - I just
broke it out into smaller chunks as per our best practices.

Added unit tests and updated integration tests.
Addressed reviews.
2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio 0f99778a59 Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test

Adding Graph-based likelihood ratio calculation to HC

  To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test
2013-12-02 19:37:19 -05:00
Eric Banks 84ddfb41b5 Merge pull request #438 from broadinstitute/rp_vqsr_num_bad_stability_fixes_and_runtime_optimizations
Various VQSR optimizations in runtime and stability.
2013-12-02 08:37:37 -08:00
Ryan Poplin 6a922e7aca Merge pull request #435 from broadinstitute/eb_fix_ug_bug_for_long_deletions
Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.
2013-12-02 08:09:23 -08:00
Ryan Poplin b57054c63c Various VQSR optimizations in both runtime and accuracy.
-- For very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build the Gaussian mixture model.
-- Annotations are ordered by the difference in means between known and novel instead of by their standard deviation.
-- Removed the training set quality score threshold.
-- Now uses 2 gaussians by default for the negative model.
-- Num bad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.
-- Model plots are now generated much faster.
-- Stricter threshold for determining model convergence.
-- All VQSR integration tests change because of these changes to the model.
-- Add test for downsampling of training data.
2013-11-29 13:04:46 -05:00
Eric Banks df6499e58c Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int.
For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing
the min MQ filter.

Added unit test.

Delivers PT#61567540.
2013-11-27 18:55:03 -05:00
Eric Banks 51d1a26725 Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.
His code was excessively clipping reads because it was looking at their cigar string instead of just
the read length.  This meant that it was basically impossible to call large deletions in UG even with
perfect evidence in the reads (as reported by Craig D).

Integration tests change because (IMO after looking at sites in IGV) reads with indels similar to the one
being genotyped used to be given too much likelihood and now give less.

Added unit tests for new methods.
2013-11-27 13:54:39 -05:00
Chris Hartl 1f777c4898 Introducing the latest-and-greatest in genotyping: CalculatePosteriors.
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.

Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
 while not converged:
  * AC = [external AC] + [sample AC]
  * Prior = DirichletMultinomial[AC]
  * Posteriors = [sample GL + Prior]
  * sample AC = MLEAC(Posteriors)

This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.

Fully unit tested.

Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
2013-11-27 13:00:45 -05:00
Eric Banks 0fac4fb3b6 Make the reference model calculation work with reduced reads.
It's just a matter of using PileupElement.getRepresentativeCount() instead of '++'.
2013-11-21 10:53:33 -05:00
Eric Banks adb77b406f Fixed poor implementation of isRefSource() and isRefSink() among others.
There was already a note in the code about how wrong the implementation was.
The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails.
Delivers PT #61069820.
2013-11-21 10:53:27 -05:00
Ami Levy-Moonshine e6ef37de1d Add an option to filter the read bases that are taking into account for the coveraged intervals. For that, new two arguments were added: minBaseQuality and minMappingQuality 2013-11-18 17:29:32 -05:00
MauricioCarneiro 7f08250870 Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2
Improve the PairHMM API for better FPGA integration
2013-11-14 10:47:07 -08:00
bradtaylor e40a07bb58 Improve the PairHMM API for better FPGA integration
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.

PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.

CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.

LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.

PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera dac3dbc997 Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle
Clarified caveat for InbreedingCoefficient
Cleaned up docstrings for ReduceReads
Brushed up doc for ErrorRatePerCycle
2013-11-13 14:33:04 -05:00
Eric Banks 0e3d83d1ef Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline
Improvements to the reference model pipeline.
2013-11-05 06:33:17 -08:00
Eric Banks 09dfaf1a68 Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline
Add interpretation to QualifyMissingIntervals
2013-11-05 06:08:13 -08:00
Ryan Poplin b22c9c2cb4 Improvements to the reference model pipeline.
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
2013-11-01 17:58:25 -04:00
Mauricio Carneiro 5ed47988b8 Changed the parameter names from cds to baits
Making the usage more clear since the parameter is being used over and over to define baited
regions. Updated the headers accordingly and made it more readable.
2013-10-24 17:15:56 -04:00
Chris Hartl 9d932e8c60 Merged bug fix from Stable into Unstable 2013-10-10 14:31:33 -04:00
Chris Hartl 6f46d1187a Remember to copy the integration test changes *as well as* the walker changes 2013-10-10 14:30:37 -04:00
Mauricio Carneiro 5d6421494b Fix mismatching number of columns in report
Quick fix the missing column header in the QualifyMissingIntervals
report.

Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
2013-10-09 14:38:15 -04:00
Mauricio Carneiro 63ace685c9 add unit tests 2013-10-04 11:44:07 -04:00
Mauricio Carneiro 839b918f58 Length metric updates to QualifyMissingIntervals
* add a length of the overlaping interval metric as per CSER request
   * standardized the distance metrics to be positive when fully  overlapping and the longest off-target tail (as a negative number)  when not overlapping
   * add gatkdocs to the tool (finally!)
2013-10-04 10:18:13 -04:00
Geraldine Van der Auwera 9f7fa247f6 Disable VQSR tranche plots in INDEL mode 2013-09-30 17:14:37 -04:00
Ryan Poplin ef1d58b7ff Bugfix for hom ref records that aren't GVCF blocks. 2013-09-29 19:19:26 -04:00
Geraldine Van der Auwera 27808d336a Minor clarifications regarding ignoreFilter argument 2013-09-26 13:13:53 -04:00
Geraldine Van der Auwera a9fa5206ee Merge pull request #399 from broadinstitute/eb_update_docs_for_DepthPerSampleHC
Updated docs for DepthPerSampleHC to deliver PT #54237024.
2013-09-25 15:20:19 -07:00
Ryan Poplin f362597f69 Merge pull request #400 from broadinstitute/mm_bugfix_combine_variants_implicit_casting
Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.
2013-09-25 11:47:17 -07:00
Michael McCowan 5113e21437 Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.
* Updated expected test data in which an integer annotation (MQ0) was formatted as a double.
2013-09-25 13:12:02 -04:00
Eric Banks 2783c84c6b Updated docs for DepthPerSampleHC to deliver PT #54237024. 2013-09-24 22:32:19 -04:00
Eric Banks d6992d1263 Updated docs to tell users not to use PCR indel error modeling for PCR free data. 2013-09-23 15:48:47 -04:00
Mauricio Carneiro 5bbad75402 Changing max coverage threshold
Because Integer.maxValue is not unit testable
2013-09-20 18:54:40 -04:00
Geraldine Van der Auwera 175388de1d Merge pull request #396 from broadinstitute/mc_dt_excessive_coverage_defaults
Updating excessive coverage default parameter & docs+test for QualifyMissingIntervals
2013-09-20 15:12:16 -07:00
Mauricio Carneiro 5e2ffc74fc Automated interpretation for QualifyMissingIntervals
* add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval
   * add unit tests -- this tool is now public, we need tests.
   * slightly better docs -- in an effort to produce better docs for this tool
2013-09-20 16:47:12 -04:00
Mauricio Carneiro 74639463b9 Updating excessive coverage default parameter
most people don't care about excessive coverage (unless you're very
particular about your analysis). Therefore the best possible default
value for this is Integer.maxValue so it doesn't get in the way.

Itemized Changes:
   * change maximumCoverage threshold to Integer.maxValue

[delivers #57353620]
2013-09-19 23:07:20 -04:00
MauricioCarneiro 014bc4269e Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation
Add Array Logless PairHMM
2013-09-08 20:16:53 -07:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
Ryan Poplin add17dc463 Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error
Changed the error for the record size mismatch in the genotyping engine ...
2013-08-30 10:29:54 -07:00
Eric Banks ea0deb1bb2 Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible
to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to
handle in the code).
2013-08-30 12:18:19 -04:00
Louis Bergelson 4473b0065e adding a check for the UNAVAILABLE case of GenotypeType in CountVariants 2013-08-29 17:27:00 -04:00
bradtaylor 0435bbe38f Retreived PairHMM benchmarks from archive and made improvements
PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely.

Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords".

I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data.

It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to:
PairHMMBandedEmpiricalBenchmark
PairHMMTestData
BandedLoglessPairHMMUnitTest

I decided against this as the banded benchmark and unit test are archived.
2013-08-28 17:23:35 -04:00
bradtaylor 86fe9fae76 Changes to Array PairHMM to address review comments
Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking

Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.
2013-08-28 17:23:29 -04:00
bradtaylor 3671e41b0c Add Array Logless PairHMM
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.

Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.

This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.

We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)

Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
2013-08-28 17:21:23 -04:00
Ryan Poplin 7479152977 Merged bug fix from Stable into Unstable 2013-08-28 12:40:25 -04:00