Commit Graph

869 Commits (9d932e8c6091f38e124d3c9e78d13287e5960a6d)

Author SHA1 Message Date
Chris Hartl 9d932e8c60 Merged bug fix from Stable into Unstable 2013-10-10 14:31:33 -04:00
Chris Hartl 6f46d1187a Remember to copy the integration test changes *as well as* the walker changes 2013-10-10 14:30:37 -04:00
Mauricio Carneiro 63ace685c9 add unit tests 2013-10-04 11:44:07 -04:00
Mauricio Carneiro 839b918f58 Length metric updates to QualifyMissingIntervals
* add a length of the overlaping interval metric as per CSER request
   * standardized the distance metrics to be positive when fully  overlapping and the longest off-target tail (as a negative number)  when not overlapping
   * add gatkdocs to the tool (finally!)
2013-10-04 10:18:13 -04:00
Geraldine Van der Auwera 9f7fa247f6 Disable VQSR tranche plots in INDEL mode 2013-09-30 17:14:37 -04:00
Ryan Poplin ef1d58b7ff Bugfix for hom ref records that aren't GVCF blocks. 2013-09-29 19:19:26 -04:00
Geraldine Van der Auwera 27808d336a Minor clarifications regarding ignoreFilter argument 2013-09-26 13:13:53 -04:00
Geraldine Van der Auwera a9fa5206ee Merge pull request #399 from broadinstitute/eb_update_docs_for_DepthPerSampleHC
Updated docs for DepthPerSampleHC to deliver PT #54237024.
2013-09-25 15:20:19 -07:00
Ryan Poplin f362597f69 Merge pull request #400 from broadinstitute/mm_bugfix_combine_variants_implicit_casting
Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.
2013-09-25 11:47:17 -07:00
Michael McCowan 5113e21437 Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.
* Updated expected test data in which an integer annotation (MQ0) was formatted as a double.
2013-09-25 13:12:02 -04:00
Eric Banks 2783c84c6b Updated docs for DepthPerSampleHC to deliver PT #54237024. 2013-09-24 22:32:19 -04:00
Eric Banks d6992d1263 Updated docs to tell users not to use PCR indel error modeling for PCR free data. 2013-09-23 15:48:47 -04:00
Mauricio Carneiro 5bbad75402 Changing max coverage threshold
Because Integer.maxValue is not unit testable
2013-09-20 18:54:40 -04:00
Geraldine Van der Auwera 175388de1d Merge pull request #396 from broadinstitute/mc_dt_excessive_coverage_defaults
Updating excessive coverage default parameter & docs+test for QualifyMissingIntervals
2013-09-20 15:12:16 -07:00
Mauricio Carneiro 5e2ffc74fc Automated interpretation for QualifyMissingIntervals
* add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval
   * add unit tests -- this tool is now public, we need tests.
   * slightly better docs -- in an effort to produce better docs for this tool
2013-09-20 16:47:12 -04:00
Mauricio Carneiro 74639463b9 Updating excessive coverage default parameter
most people don't care about excessive coverage (unless you're very
particular about your analysis). Therefore the best possible default
value for this is Integer.maxValue so it doesn't get in the way.

Itemized Changes:
   * change maximumCoverage threshold to Integer.maxValue

[delivers #57353620]
2013-09-19 23:07:20 -04:00
MauricioCarneiro 014bc4269e Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation
Add Array Logless PairHMM
2013-09-08 20:16:53 -07:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
Ryan Poplin add17dc463 Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error
Changed the error for the record size mismatch in the genotyping engine ...
2013-08-30 10:29:54 -07:00
Eric Banks ea0deb1bb2 Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible
to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to
handle in the code).
2013-08-30 12:18:19 -04:00
Louis Bergelson 4473b0065e adding a check for the UNAVAILABLE case of GenotypeType in CountVariants 2013-08-29 17:27:00 -04:00
bradtaylor 0435bbe38f Retreived PairHMM benchmarks from archive and made improvements
PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely.

Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords".

I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data.

It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to:
PairHMMBandedEmpiricalBenchmark
PairHMMTestData
BandedLoglessPairHMMUnitTest

I decided against this as the banded benchmark and unit test are archived.
2013-08-28 17:23:35 -04:00
bradtaylor 86fe9fae76 Changes to Array PairHMM to address review comments
Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking

Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.
2013-08-28 17:23:29 -04:00
bradtaylor 3671e41b0c Add Array Logless PairHMM
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.

Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.

This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.

We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)

Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
2013-08-28 17:21:23 -04:00
Ryan Poplin 7479152977 Merged bug fix from Stable into Unstable 2013-08-28 12:40:25 -04:00
Ryan Poplin 6bda569666 One of the log10sumLog10s in the VQSR was missed in a previous bug fix. Thanks to Mike McCowan for spotting this one. 2013-08-28 12:40:08 -04:00
Geraldine Van der Auwera ed465cd2a5 Fixed a few typos and clarified some doc points. 2013-08-26 17:33:17 -04:00
David Roazen 42d771f748 Remove org.apache.commons.collections.IteratorUtils dependency from the test suite
-This was a dependency of the test suite, but not the GATK proper,
 which caused problems when running the test suite on the packaged
 GATK jar at release time

-Use GATKVCFUtils.readVCF() instead
2013-08-21 19:44:02 -04:00
Eric Banks d4dc5ba04a Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records.
Added test to make sure this is no longer happening.
2013-08-21 15:46:57 -04:00
Eric Banks 6663d48ffe Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates
Adaptations to accomodate Tribble API changes.
2013-08-19 18:31:02 -07:00
Michael McCowan c3a933ce84 Adaptations to accomodate Tribble API changes, comprising mostly of the following.
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
2013-08-19 15:52:47 -04:00
Eric Banks 17eb7b49fe Adding ability to use Ryan's PCR error modeling to the Haplotype Caller.
There is now a command-line option to set the model to use in the HC.
Incorporated Ryan's current (unmerged) branch in for most of these changes.

Because small changes to the math can have drastic effects, I decided not to let users tweak
the calculations themselves.  Instead they can select either NONE, CONSERVATIVE (the default),
or AGGRESSIVE.

Note that any base insertion/deletion qualities from BQSR are still used.

Also, note that the repeat unit x repeat length approach gave very poor results against the KB,
so it is not included as an option here.
2013-08-16 01:53:04 -04:00
Eric Banks 69e78efeae Merge pull request #366 from broadinstitute/gg_gatkdocfixes
More gatkdoc fixes
2013-08-13 04:52:03 -07:00
Eric Banks bcf9a1cda5 Merge pull request #370 from broadinstitute/rp_dont_output_filtered_variants_in_VQSR
Adding mode to VQSR to not output variant records that are filtered out ...
2013-08-12 12:01:50 -07:00
Ryan Poplin a45011d7e7 Adding mode to VQSR to not output variant records that are filtered out after applying the recalibration. Necessary for 1000G calling. 2013-08-12 11:22:59 -04:00
Ryan Poplin 59f56bef30 Cleaning up help text for the -numBad argument. 2013-08-12 09:51:56 -04:00
Geraldine Van der Auwera 4d20c71e09 Improvements to various gatkdocs
- Make -rod required
    - Document that contaminationFile is currently not functional with HC
    - Document liftover process more clearly
    - Document VariantEval combinations of ST and VE that are incompatible
    - Added a caveat about using MVLR from HC and UG.
    - Added caveat about not using -mte with -nt
    - Clarified masking options
    - Fixed docs based on Erics comments
2013-08-10 10:01:31 -07:00
Mark DePristo b7d1096ced Added onlyEmitSamples argument to UnifiedGenotyper
-- When provided, this argument causes us to only emit the selected samples into the VCF.  No INFO field annotations (AC for example) or other features are modified.  It's current primary use is for efficiently evaluating joint calling.
-- Add integration test for onlyEmitSamples
2013-08-09 11:00:15 -04:00
Mark DePristo ccf0df0fea Misc. debugging functionality to FS calculation (disabled by default) 2013-08-08 12:06:23 -04:00
Mark DePristo 00f4d767e4 Merge pull request #364 from broadinstitute/md_vqsr_improvements
Separate num Gaussians for + and - GMM in VQSR
2013-08-07 04:37:45 -07:00
Mark DePristo c21402d4af Separate num Gaussians for + and - GMM in VQSR
-- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models.  However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model.
-- Now the maxGaussians argument only applies to the positive model
-- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter)
-- Removes the percentBadVariants argument.  The only way to control how many variants are included in the negative model is with minNumBad
-- Reduced the minNumBad argument default to 1000 from 2500
-- Update MD5s for VQSR.  md5s changed significantly due to underlying changes in the default GMM model.  Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected.
-- minNumBad is now numBad
-- Plot all negative training points as well, since this significantly changes our view of the GMM PDF
2013-08-07 07:36:50 -04:00
Mark DePristo 318f7e74e4 Better docs on the meaning of heterozygosity
-- [delivers #53522209]
2013-08-07 07:27:45 -04:00
Mark DePristo 40bc7d6a9c Bugfix for ReferenceConfidenceModel
-- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype".  Now we detect this case and emit the standard no variation output.
-- [delivers #54625060]
2013-08-06 16:00:32 -04:00
Ryan Poplin a46f633bd6 Fix for the VQSR visualization script with the new ordering of annotations. 2013-08-02 19:10:45 -04:00
Mauricio Carneiro 285ab2ac62 Better caching for the HaplotypeCaller
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.

Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.

Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
2013-08-02 01:27:29 -04:00
Eric Banks 1e396af4d0 Two reduce reads updates/fixes:
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.

Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.

Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.

2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.

Refactored it so that it can only remove non-consensus reads.
2013-08-01 14:34:59 -04:00
Ryan Poplin 4f3411f3d4 Max number of haplotypes to evaluate no longer grows unbounded with the number of samples. This is necessary for multi-sample calling projects with over 100 samples. 2013-07-31 10:48:55 -04:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
droazen b8709b1942 Merge pull request #332 from broadinstitute/st_fpga_hmm
FPGA support for PairHMM
2013-07-30 14:21:21 -07:00
Joseph Rose d2860a5486 Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose) 2013-07-30 15:38:32 -04:00