Commit Graph

143 Commits (1cc0b48caab07426a3d54b34db3043ca96a28a4e)

Author SHA1 Message Date
Ryan Poplin 59cef880d1 Updating HC integration tests because experimental, HC-specific annotations have been removed. 2012-11-26 12:20:07 -05:00
Ryan Poplin c3b7dd1374 Misc cleanup in the HaplotypeCaller. Cleaning up unused arguments after recent changes to HC-GenotypingEngine 2012-11-26 12:19:11 -05:00
Ryan Poplin fedc4fde6c Merged bug fix from Stable into Unstable 2012-11-25 21:55:55 -05:00
Ryan Poplin d978cfe835 Soft clipped bases shouldn't be counted in the delocalized BQSR. 2012-11-25 21:55:29 -05:00
Eric Banks 937ac7290f Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results. 2012-11-20 16:13:29 -05:00
Eric Banks ff180a8e02 Significant refactoring of the Haplotype Caller to handle problems with GGA. The main fix is that we now maintain a mapping from 'original' allele to 'Smith-Waterman-based' allele so that we no longer need to do a (buggy) matching throughout the calling process. 2012-11-19 09:09:57 -05:00
Mauricio Carneiro e35fd1c717 Merging CMI-0.5.0 and GATK-2.2 together. 2012-11-14 10:42:03 -05:00
Mauricio Carneiro a17cd54b68 Co-Reduction implementation in ReduceReads
ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples.
Also:
  * Added integrationtest for co-reduction
  * Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly)

DEV-200 #resolve #time 8m
2012-11-14 10:33:21 -05:00
Eric Banks e93d461910 Adding integration test to BQSR for the csv file 2012-11-09 09:11:04 -05:00
Eric Banks 2da76db945 Updating integration tests 2012-11-06 22:23:05 -08:00
Eric Banks 0a2dded093 Fixes for bugs uncovered by unit tests 2012-11-06 16:07:40 -08:00
Eric Banks b07106b3a7 Reimplement the allele biased downsampling to be smarter. Now we don't blindly pull n% of reads off of each allele. Instead, we try all possible genotype conformations for the contaminating sample and choose the one that provides the best genotype for the target sample (based heuristically on allele balance). This method allows us to save some of the reads that belong to the target sample, which should make Daniel M happy. Added unit tests to test the biased downsampling functionality. 2012-11-06 14:39:58 -08:00
Mark DePristo 1444cd753b Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered.  Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state.  This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
2012-11-01 15:34:04 -04:00
Eric Banks f8af8a2355 Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite. 2012-10-31 21:28:07 -04:00
Guillermo del Angel 51a9ce28e1 Merge remote-tracking branch 'unstable/master' into develop 2012-10-31 10:29:48 -04:00
Ryan Poplin 4e661847b2 DelocalizedBaseRecalibrator becomes the BaseRecalibrator. 2012-10-29 12:53:39 -04:00
David Roazen 35483a7eef Update MD5s for PrintReads with BQSR Integration Test
The MD5s for these tests were changed in commit 87435f1074615b2cd016f042980109fd53962c8d
to match the output of a broken version of BaseRecalibration. With the patch in
commit c397102ecc1fd1d2cd8f209a8f358ab4a60b50a7, the output once again matches the
*original* MD5s for these tests, and does not vary as you increase -nct.

Final resolution to GSA-632
2012-10-26 14:25:25 -04:00
Eric Banks b06f689d4b Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-26 02:13:26 -04:00
Eric Banks bf3d61ce82 The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it. 2012-10-26 01:04:51 -04:00
Mark DePristo cc8c12b954 Committing a broken version of BaseRecalibration
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails.  But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
2012-10-25 14:46:35 -04:00
David Roazen 02018ca764 Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable
The old BaseRecalibrator walker is and never will be thread-safe, since it's a
LocusWalker that uses read attributes to track state.

ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe
at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1
for testing purposes, but wait for further testing to be done before using it
for production purposes in multithreaded mode.
2012-10-24 15:22:50 -04:00
Ryan Poplin a27ee26481 updating HC integration test. 2012-10-24 14:08:39 -04:00
Ryan Poplin 094db7bf24 We now require at least 10 samples to merge variants into complex events in the HC. Added a new population based bam for the complex event integration test. 2012-10-24 14:07:36 -04:00
Mauricio Carneiro 4cd1a92358 Updating RR integration tests
Forgot to update the integration tests after merging DEV-117 with optimizations from GATK main repo.
2012-10-23 11:26:26 -04:00
Mauricio Carneiro c210b7cde4 Merge GATK repo into CMI-GATK
Bringing in the following relevant changes:
	* Fixes the indel realigner N-Way out null pointer exception DEV-10
	* Optimizations to ReduceReads that bring the run time to 1/3rd.

Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java

DEV-10 #resolve #time 2m
2012-10-23 10:59:11 -04:00
Mauricio Carneiro bbf7a0fb09 Adding integration test to ReduceReads coreduction
DEV-117 #resolve
2012-10-23 10:56:33 -04:00
Mark DePristo 90f59803fd MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic.  That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele.  The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating.  This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime.  Might be worth it through
2012-10-22 13:47:56 -04:00
Mark DePristo 9f2851d769 Updating UnifiedGenotyperGeneralPloidyIntegrationTest following rebasing
-- Created a JIRA ticket https://jira.broadinstitute.org/browse/GSA-623 for Guillermo to look at the differences as the multi-allelic nature of many sites seems to change with the new more protected infrastructure.  This may be due to implementation issues in the pooled caller, problems with my interface, or could be a genuine improvement.
2012-10-21 20:23:11 -04:00
Mark DePristo d21e42608a Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default 2012-10-21 12:43:46 -04:00
Mark DePristo 0fcd358ace Original EXACT model implementation lives, providing another reference (bi-allelic only) EXACT model
-- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case
-- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations
-- Added systematic unit tests covering this implementation, and comparing it to others
-- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases
-- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right
2012-10-21 12:42:31 -04:00
Mark DePristo eaffb814d3 IndependentExactAFCalc is now the default EXACT model implementation
-- Changed UG / HC to use this one via the StandardCallerArgumentCollection
-- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums
2012-10-21 12:42:31 -04:00
Mark DePristo 326f429270 Bugfixes to make new AFCalc system pass integrationtests
-- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests
-- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base
-- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X.  Added unit test to ensure this behavior is correct
-- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order.  Add contract to ensure this doesn't ever happen again
-- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model
--
2012-10-21 12:42:31 -04:00
Mark DePristo 99c9031cb4 Merge AFCalcResultTracker into StateTracker, cleanup
-- These two classes were really the same, and now they are actually the same!
-- Cleanuped the interfaces, removed duplicate data
-- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed)
-- Moved goodProbability and goodProbabilityVector utilities to MathUtils.  Very useful for contracts!
2012-10-21 12:42:31 -04:00
Guillermo del Angel e9b7324dc1 Merge branch 'master' of ssh://gsa3/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-21 12:38:49 -04:00
Guillermo del Angel 67b9e7319e Fix for integration tests: new criterion in AF exact calculation model to trim alleles based on likelihoods does produce better results and resulting alleles changed in 2 sites at integration tests (and all subsequent sites after this had minor annotation differences due to RankSum dithering) 2012-10-21 12:38:33 -04:00
Ryan Poplin a647f1e076 Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore. 2012-10-20 16:38:18 -04:00
Ryan Poplin b4e69239dd In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map. 2012-10-18 14:31:15 -04:00
Mauricio Carneiro 32ee2c7dff Refactored the compression interface per sample in ReduceReadsa
The CompressionStash is now responsible for keeping track of all intervals that must be kept uncompressed by all samples. In general this is a list generated by a tumor sample that will enforce all normal samples to abide.
  - Updated ReduceReads integration tests
  - Sliding Window is now using the CompressionStash (single sample).

DEV-104 #resolve #time 3m
2012-10-17 16:40:40 -04:00
Mark DePristo fa93681f51 Scalability test for EXACT models 2012-10-17 14:15:11 -04:00
Mark DePristo c74d7061fe Added AFCalcResultUnitTest
-- Ensures that the posteriors remain within reasonable ranges.  Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good.  Now tests ensure that the normalization process preserves log10 precision where possible
-- Updated MathUtils to make this possible
2012-10-16 08:11:06 -04:00
Mark DePristo 9b0ab4e941 Cleanup IndependentAllelesDiploidExactAFCalc
-- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all
-- Added lots of contracts and docs, still more to come.
-- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing
-- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.
2012-10-16 08:11:06 -04:00
Mark DePristo d1511e38ad Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest
-- Superceded by IndependentAFCalc
-- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it.
-- A few misc. bug fixes discovered during running the performance test
2012-10-16 08:11:06 -04:00
Ryan Poplin 25be94fbb8 Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10. 2012-10-15 13:24:32 -04:00
Mark DePristo dcf8af42a8 Finalizing IndependentAllelesDiploidExactAFCalc
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef.  Not ideal.  Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly.  ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
2012-10-15 08:21:03 -04:00
Mark DePristo 6b639f51f0 Finalizing new exact model and tests
-- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior.
-- Tests that theta^n.alt.alleles is being applied correctly
-- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java
-- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine
2012-10-15 07:53:57 -04:00
Mark DePristo 2d72265f7d AFCalcUnit test a more appropriate name 2012-10-15 07:53:57 -04:00
Mark DePristo cb857d1640 AFCalcs must be made by factory method now
-- AFCalcFactory is the only way to make AFCalcs now.  There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions.  The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles.
-- AFCalc no longer has UAC constructor -- only AFCalcFactory does.  Code cleanup throughout
-- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next)
-- It's now possible to run the UG / HC with any of the exact models currently in the system.
-- Code cleanup throughout the system, reorganizing the unit tests in particular
2012-10-15 07:53:56 -04:00
Mark DePristo 6bbe750e03 Continuing work on IndependentAllelesDiploidExactAFCalc
-- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly.  A long way towards the right answer now, but still not there
-- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan
-- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there
-- New convenience method in VariantContext that looks up the allele index in the alleles
2012-10-15 07:53:56 -04:00
Mark DePristo 176b74095d Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation
-- Still not work, but I know what's wrong
-- Many tests disabled, that need to be reanabled
2012-10-15 07:53:56 -04:00
Mark DePristo 91aeddeb5a Steps on the way to a fully described and semantically meaningful AFCalcResult
-- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold
-- Lots of contracts for AFCalcResult
-- Slowly killing off AFCalcResultsTracker
-- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele
-- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system
-- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.
2012-10-15 07:53:56 -04:00