Commit Graph

858 Commits (d6992d12632c0cfc6b3bdf5b88872fb2d4625e41)

Author SHA1 Message Date
Eric Banks d6992d1263 Updated docs to tell users not to use PCR indel error modeling for PCR free data. 2013-09-23 15:48:47 -04:00
Mauricio Carneiro 5bbad75402 Changing max coverage threshold
Because Integer.maxValue is not unit testable
2013-09-20 18:54:40 -04:00
Geraldine Van der Auwera 175388de1d Merge pull request #396 from broadinstitute/mc_dt_excessive_coverage_defaults
Updating excessive coverage default parameter & docs+test for QualifyMissingIntervals
2013-09-20 15:12:16 -07:00
Mauricio Carneiro 5e2ffc74fc Automated interpretation for QualifyMissingIntervals
* add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval
   * add unit tests -- this tool is now public, we need tests.
   * slightly better docs -- in an effort to produce better docs for this tool
2013-09-20 16:47:12 -04:00
Mauricio Carneiro 74639463b9 Updating excessive coverage default parameter
most people don't care about excessive coverage (unless you're very
particular about your analysis). Therefore the best possible default
value for this is Integer.maxValue so it doesn't get in the way.

Itemized Changes:
   * change maximumCoverage threshold to Integer.maxValue

[delivers #57353620]
2013-09-19 23:07:20 -04:00
MauricioCarneiro 014bc4269e Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation
Add Array Logless PairHMM
2013-09-08 20:16:53 -07:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
Ryan Poplin add17dc463 Merge pull request #388 from broadinstitute/eb_change_record_size_mismatch_to_user_error
Changed the error for the record size mismatch in the genotyping engine ...
2013-08-30 10:29:54 -07:00
Eric Banks ea0deb1bb2 Changed the error for the record size mismatch in the genotyping engine to be a user error since it is possible
to reach this state with input VCFs that contain the same event multiple times (and it's not something we want to
handle in the code).
2013-08-30 12:18:19 -04:00
Louis Bergelson 4473b0065e adding a check for the UNAVAILABLE case of GenotypeType in CountVariants 2013-08-29 17:27:00 -04:00
bradtaylor 0435bbe38f Retreived PairHMM benchmarks from archive and made improvements
PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely.

Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords".

I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data.

It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to:
PairHMMBandedEmpiricalBenchmark
PairHMMTestData
BandedLoglessPairHMMUnitTest

I decided against this as the banded benchmark and unit test are archived.
2013-08-28 17:23:35 -04:00
bradtaylor 86fe9fae76 Changes to Array PairHMM to address review comments
Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking

Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.
2013-08-28 17:23:29 -04:00
bradtaylor 3671e41b0c Add Array Logless PairHMM
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.

Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.

This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.

We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)

Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
2013-08-28 17:21:23 -04:00
Ryan Poplin 7479152977 Merged bug fix from Stable into Unstable 2013-08-28 12:40:25 -04:00
Ryan Poplin 6bda569666 One of the log10sumLog10s in the VQSR was missed in a previous bug fix. Thanks to Mike McCowan for spotting this one. 2013-08-28 12:40:08 -04:00
Geraldine Van der Auwera ed465cd2a5 Fixed a few typos and clarified some doc points. 2013-08-26 17:33:17 -04:00
David Roazen 42d771f748 Remove org.apache.commons.collections.IteratorUtils dependency from the test suite
-This was a dependency of the test suite, but not the GATK proper,
 which caused problems when running the test suite on the packaged
 GATK jar at release time

-Use GATKVCFUtils.readVCF() instead
2013-08-21 19:44:02 -04:00
Eric Banks d4dc5ba04a Fixed bug in PhaseByTransmission where it was completely dropping multi-allelic records.
Added test to make sure this is no longer happening.
2013-08-21 15:46:57 -04:00
Eric Banks 6663d48ffe Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates
Adaptations to accomodate Tribble API changes.
2013-08-19 18:31:02 -07:00
Michael McCowan c3a933ce84 Adaptations to accomodate Tribble API changes, comprising mostly of the following.
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
2013-08-19 15:52:47 -04:00
Eric Banks 17eb7b49fe Adding ability to use Ryan's PCR error modeling to the Haplotype Caller.
There is now a command-line option to set the model to use in the HC.
Incorporated Ryan's current (unmerged) branch in for most of these changes.

Because small changes to the math can have drastic effects, I decided not to let users tweak
the calculations themselves.  Instead they can select either NONE, CONSERVATIVE (the default),
or AGGRESSIVE.

Note that any base insertion/deletion qualities from BQSR are still used.

Also, note that the repeat unit x repeat length approach gave very poor results against the KB,
so it is not included as an option here.
2013-08-16 01:53:04 -04:00
Eric Banks 69e78efeae Merge pull request #366 from broadinstitute/gg_gatkdocfixes
More gatkdoc fixes
2013-08-13 04:52:03 -07:00
Eric Banks bcf9a1cda5 Merge pull request #370 from broadinstitute/rp_dont_output_filtered_variants_in_VQSR
Adding mode to VQSR to not output variant records that are filtered out ...
2013-08-12 12:01:50 -07:00
Ryan Poplin a45011d7e7 Adding mode to VQSR to not output variant records that are filtered out after applying the recalibration. Necessary for 1000G calling. 2013-08-12 11:22:59 -04:00
Ryan Poplin 59f56bef30 Cleaning up help text for the -numBad argument. 2013-08-12 09:51:56 -04:00
Geraldine Van der Auwera 4d20c71e09 Improvements to various gatkdocs
- Make -rod required
    - Document that contaminationFile is currently not functional with HC
    - Document liftover process more clearly
    - Document VariantEval combinations of ST and VE that are incompatible
    - Added a caveat about using MVLR from HC and UG.
    - Added caveat about not using -mte with -nt
    - Clarified masking options
    - Fixed docs based on Erics comments
2013-08-10 10:01:31 -07:00
Mark DePristo b7d1096ced Added onlyEmitSamples argument to UnifiedGenotyper
-- When provided, this argument causes us to only emit the selected samples into the VCF.  No INFO field annotations (AC for example) or other features are modified.  It's current primary use is for efficiently evaluating joint calling.
-- Add integration test for onlyEmitSamples
2013-08-09 11:00:15 -04:00
Mark DePristo ccf0df0fea Misc. debugging functionality to FS calculation (disabled by default) 2013-08-08 12:06:23 -04:00
Mark DePristo 00f4d767e4 Merge pull request #364 from broadinstitute/md_vqsr_improvements
Separate num Gaussians for + and - GMM in VQSR
2013-08-07 04:37:45 -07:00
Mark DePristo c21402d4af Separate num Gaussians for + and - GMM in VQSR
-- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models.  However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model.
-- Now the maxGaussians argument only applies to the positive model
-- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter)
-- Removes the percentBadVariants argument.  The only way to control how many variants are included in the negative model is with minNumBad
-- Reduced the minNumBad argument default to 1000 from 2500
-- Update MD5s for VQSR.  md5s changed significantly due to underlying changes in the default GMM model.  Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected.
-- minNumBad is now numBad
-- Plot all negative training points as well, since this significantly changes our view of the GMM PDF
2013-08-07 07:36:50 -04:00
Mark DePristo 318f7e74e4 Better docs on the meaning of heterozygosity
-- [delivers #53522209]
2013-08-07 07:27:45 -04:00
Mark DePristo 40bc7d6a9c Bugfix for ReferenceConfidenceModel
-- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype".  Now we detect this case and emit the standard no variation output.
-- [delivers #54625060]
2013-08-06 16:00:32 -04:00
Ryan Poplin a46f633bd6 Fix for the VQSR visualization script with the new ordering of annotations. 2013-08-02 19:10:45 -04:00
Mauricio Carneiro 285ab2ac62 Better caching for the HaplotypeCaller
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.

Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.

Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
2013-08-02 01:27:29 -04:00
Eric Banks 1e396af4d0 Two reduce reads updates/fixes:
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.

Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.

Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.

2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.

Refactored it so that it can only remove non-consensus reads.
2013-08-01 14:34:59 -04:00
Ryan Poplin 4f3411f3d4 Max number of haplotypes to evaluate no longer grows unbounded with the number of samples. This is necessary for multi-sample calling projects with over 100 samples. 2013-07-31 10:48:55 -04:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
droazen b8709b1942 Merge pull request #332 from broadinstitute/st_fpga_hmm
FPGA support for PairHMM
2013-07-30 14:21:21 -07:00
Joseph Rose d2860a5486 Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose) 2013-07-30 15:38:32 -04:00
Mauricio Carneiro 7b731dd596 Removed native method call
and fixed indentation.
2013-07-30 13:59:58 -04:00
Geraldine Van der Auwera edbd17b8e0 Added note of caution to VQSR gatkdocs for option BOTH of recalibration mode 2013-07-26 15:51:29 -04:00
Ryan Poplin f52196496d Merge pull request #347 from broadinstitute/eb_more_dnagling_tail_improvements
More specific fix for the dangling tail edge case with a single leading deletion.
2013-07-26 07:25:47 -07:00
Ryan Poplin 8c205dda1b Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line. 2013-07-26 10:22:43 -04:00
Eric Banks 9372c5ef41 Merge pull request #334 from broadinstitute/mc_generic_input_for_qualify_missing_intervals
QualifyMissingIntervals: support different formats
2013-07-25 12:39:26 -07:00
sathibault 71eb944e62 Adding CnyPairHMMUnitTest 2013-07-25 14:19:50 -05:00
Eric Banks 5dfa863caa Fully stranded implementation of RR (plus bug fix for insertions and het compression).
Now only filtered reads are unstranded.  All consensus reads have strand, so that we
emit 2 consensus reads in general now: one for each strand.

This involved some refactoring of the sliding window which cleaned it up a lot.

Also included is a bug fix:
insertions downstream of a variant region weren't triggering a stop to the compression.
2013-07-25 14:48:53 -04:00
Eric Banks 0a2b5ddadf More specific fix for the dangling tail edge case with a single leading deletion.
The previous fix was too general (and therefore incorrect) and caused the HC to exception out.
Added "unit" test for this exact case.
2013-07-25 12:24:46 -04:00
Mauricio Carneiro 31ab0824b1 quick indentation fixes to FPGA code 2013-07-24 14:09:49 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
Guillermo del Angel 9dd109b79a Last feature request from Reich/Paavo labs: the allSitePLs feature in UG worked but not quite filled requirements. What's needed is the ability to have all 10 PLs for EVERY site, regardless of whether they are variant or not. Previous version only emitted the 10 PLs in reference sites. Problem is that, if all PLs are emitted in all sites and every single site is quad-allelic (only way to have the PLs printed out in a valid way) then the ability to filter variants and to use the INFO fields may be compromised.
So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before.
Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.
2013-07-18 12:54:52 -04:00