Commit Graph

10845 Commits (32ee2c7dffde3210e2c3b183f5f2fefd3a49af23)

Author SHA1 Message Date
Guillermo del Angel 1977763b08 Set default parameters for several command line inputs based on refdata content on cloud instances 2012-10-15 12:20:08 -04:00
Mark DePristo 57e231610b New framework for EXACT calculations, with new 3 new implementations
-- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper.  The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles)
-- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated.
-- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model.  It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation.  However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct.  This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling.
-- Constrained search model implemented, but will be deleted in a subsequent code cleanup
-- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model.
-- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on.
-- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments.  The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel.
-- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now
-- Created a simple main[] function to run performance tests of the EXACT model.
2012-10-15 08:32:32 -04:00
Mark DePristo dcf8af42a8 Finalizing IndependentAllelesDiploidExactAFCalc
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef.  Not ideal.  Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly.  ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
2012-10-15 08:21:03 -04:00
Mark DePristo 1ac09ca81e More bugfixes on the way to a final push with new Exact model framework
-- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output
-- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping
2012-10-15 07:53:57 -04:00
Mark DePristo 6b639f51f0 Finalizing new exact model and tests
-- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior.
-- Tests that theta^n.alt.alleles is being applied correctly
-- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java
-- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine
2012-10-15 07:53:57 -04:00
Mark DePristo 2d72265f7d AFCalcUnit test a more appropriate name 2012-10-15 07:53:57 -04:00
Mark DePristo cb857d1640 AFCalcs must be made by factory method now
-- AFCalcFactory is the only way to make AFCalcs now.  There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions.  The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles.
-- AFCalc no longer has UAC constructor -- only AFCalcFactory does.  Code cleanup throughout
-- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next)
-- It's now possible to run the UG / HC with any of the exact models currently in the system.
-- Code cleanup throughout the system, reorganizing the unit tests in particular
2012-10-15 07:53:56 -04:00
Mark DePristo 6bbe750e03 Continuing work on IndependentAllelesDiploidExactAFCalc
-- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly.  A long way towards the right answer now, but still not there
-- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan
-- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there
-- New convenience method in VariantContext that looks up the allele index in the alleles
2012-10-15 07:53:56 -04:00
Mark DePristo 176b74095d Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation
-- Still not work, but I know what's wrong
-- Many tests disabled, that need to be reanabled
2012-10-15 07:53:56 -04:00
Mark DePristo 91aeddeb5a Steps on the way to a fully described and semantically meaningful AFCalcResult
-- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold
-- Lots of contracts for AFCalcResult
-- Slowly killing off AFCalcResultsTracker
-- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele
-- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system
-- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.
2012-10-15 07:53:56 -04:00
Mark DePristo 4f1b1c4228 Intermediate commit II on simplifying AFCalcResult
-- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker.  Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10.
-- Changed the way that UnifiedGenotyper decides if the best model is non-ref.  Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult.  This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit).  If you want to understand why come talk to me.  Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.
2012-10-15 07:53:56 -04:00
Mark DePristo 06687bfaf6 Intermediate commit on simplifying AFCalcResult
-- Renamed old class AFCalcResultTracker.  This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free.
-- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef.
-- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure.  This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object
2012-10-15 07:53:56 -04:00
Mark DePristo c82aa01e0e Generalize testing infrastructure to allow us to run specific n.samples calculation 2012-10-15 07:53:55 -04:00
Mark DePristo ec935f76f6 Initial implementation and tests for IndependentAllelesDiploidExactAFCalc
-- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case.  This is very fast, scaling at O(n.alt.alleles x n.samples)
-- Many outstanding TODOs in order to truly pass unit tests
-- Added proper unit tests for the pNonRef calculation, which all of the models pass
2012-10-15 07:53:55 -04:00
Mark DePristo 5a4e2a5fa4 Test code to ensure that pNonRef is being computed correctly for at least 1 genotype, bi and tri allelic 2012-10-15 07:53:55 -04:00
Mark DePristo ee2f12e2ac Simpler naming convention for AlleleFrequencyCalculation => AFCalc 2012-10-15 07:53:55 -04:00
Mark DePristo cf3f9d6ee8 Reorganize and cleanup AFCalculations
-- Now contained in a package called afcalc
-- Extracted standard alone classes from private static classes in ExactAF
-- Most fields are now private, with accessors
-- Overall cleaner organization now
2012-10-15 07:53:55 -04:00
Mark DePristo 13211231c7 Restructure and cleanup ExactAFCalculations
-- Now there's no duplication between exact old and constrained models.  The behavior is controlled by an overloaded abstract function
-- No more static function to access the linear exact model -- you have to create the surrounding class.  Updated code in the system
-- Everything passes unit tests
2012-10-15 07:53:54 -04:00
Mark DePristo 99ad7b2d71 GeneralPloidyExact should use indel max alt alleles 2012-10-15 07:53:54 -04:00
Mark DePristo bf276baca0 Don't try to compute full exact model for > 100 samples 2012-10-15 07:53:54 -04:00
Mark DePristo b924e9ebb4 Add OptimizedDiploidExactAF to PerformanceTesting framework 2012-10-15 07:53:54 -04:00
Mark DePristo f800f3fb88 Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele
-- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)
2012-10-15 07:53:54 -04:00
Mark DePristo efad215edb Greedy version of function to compute the max achievable AC for each alt allele
-- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension.  Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL.  Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value.  It takes values in the order of the alt alleles.
2012-10-15 07:53:54 -04:00
Mark DePristo 7666a58773 Function to compute the max achievable AC for each alt allele
-- Additional minor cleanup of ExactAFCalculation
2012-10-15 07:53:53 -04:00
Mark DePristo b3cb33a416 simple script to run nano schedule main[] 2012-10-15 07:52:02 -04:00
Guillermo del Angel a4767a20be Bug fixes for temp mutect integration 2012-10-13 22:03:41 -04:00
Guillermo del Angel e3a8ed2151 Further bug fixes to merge cancer/germline fastq-bam pipelines 2012-10-13 11:16:14 -04:00
Guillermo del Angel b961f78f49 Temp fixes 2012-10-12 16:14:43 -04:00
Kristian Cibulskis 661fa5b98c added support for indel calling (with non-VCF format output) 2012-10-12 16:02:05 -04:00
Eric Banks a8efa5451a Protect against bad bases users have screwy data (or try to use zipped references) 2012-10-12 15:05:03 -04:00
Guillermo del Angel 7e1657d243 Merge branch 'unstable' of github.com:broadinstitute/cmi-gatk into unstable 2012-10-12 14:49:37 -04:00
Mauricio Carneiro 274ac4836f Allowing the GATK to have non-required outputs
Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation).

This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out.

[#DEV-10 transition:31 resolution:1]
2012-10-12 14:49:16 -04:00
Mauricio Carneiro 05111eeaef Making nContigs parameter hidden in ReduceReads
For now, the het reduction should only be performed for diploids (n=2). We haven't really tested it for other ploidy so it should remain hidden until someone braves it out.
2012-10-12 14:49:15 -04:00
Guillermo del Angel 32e377a0db Fix bugs so that we can pass in 2 simultaneous samples in metadata (no co-cleaning yet but at least we don't need to run pipeline twice) to produce 2 bams. Pasted temp mutect so it's also run at the end of the run 2012-10-12 14:39:28 -04:00
David Roazen da1cffbfca Run performance tests in gsa-engineering queue on gsa4 rather than gsa queue
Running the performance tests on the farm wasn't working out very well --
it's been too long since they've run to completion. Switching back to
running them on gsa4 for now.
2012-10-12 14:21:27 -04:00
Guillermo del Angel dc03a09722 Merge branch 'develop' into unstable 2012-10-12 14:19:42 -04:00
Kristian Cibulskis c1706ef0ef upgraded mutation caller with VCF output
raw indel calls (non filtered,non vcf)
2012-10-12 14:18:12 -04:00
Guillermo del Angel 5971006678 Bug fix when running nondiploid mode in UG with EMIT_ALL_SITES: if site was reference-only, QUAL is produced OK but genotypes were being set to no-call because of unnecessary likelihood normalization. May change integration test md5 which I'll fix later today 2012-10-12 12:45:55 -04:00
Eric Banks 81532a0529 Missing file are user errors. 2012-10-12 09:48:12 -04:00
Eric Banks fa77a83783 Update the out of space error to include another permutation 2012-10-12 09:38:12 -04:00
Eric Banks 85525d9e6e Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit). 2012-10-12 09:19:50 -04:00
Eric Banks ad60300bee Catch malformed BAM files at the source since this is the largest class of errors in Tableau. 2012-10-12 09:07:57 -04:00
Eric Banks 593c8065d9 Fix docs for BadMateFilter 2012-10-12 08:35:45 -04:00
Christopher Hartl 6b9987cf1b Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2012-10-12 00:48:42 -04:00
Christopher Hartl c1211ad3a1 Full test suite of LD-corrected GRM calculation. The correctness of this code is now largely verified. Matches GCTA when no correction is used (up to 6 decimal places). Bed reading relies on a particular test directory that is still local. The rest is all generated in unit test fashion. 2012-10-12 00:46:02 -04:00
David Roazen 3861212dab Fix inefficiency in FilePointer GenomeLoc validation
Validation of GenomeLocs in the FilePointer class was extremely inefficient
when the GenomeLocs were added one at a time rather than all at once.

Appears to mostly fix GSA-604
2012-10-11 19:55:14 -04:00
Guillermo del Angel 47e9d967fe Merging in from cmi-develop branch - staying in this branch for now 2012-10-11 15:35:43 -04:00
Guillermo del Angel 77949ec740 Some fixes to QC commands in pipeline, and workaround for critical engine bug in GATK that makes it hang when doing small targeted BAM's with a whole exome interval list 2012-10-11 15:08:30 -04:00
Guillermo del Angel af5a6fdace Resolve [DEV-7]: add single-sample VCF calling at end of FASTQ-BAM pipeline. Initial steps of [DEV-4]: queue extensions for Picard QC metrics 2012-10-11 11:09:49 -04:00
Mark DePristo 9b19f5ce99 No longer include stack traces for user exceptions in GATK logs
-- Was taking a shocking large amount of space on the server, and slowing down Tableau so much all stack traces had to be disabled
2012-10-10 20:41:03 -04:00