gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	27d8d3f51e	RR optimization: don't recalculate the entire bitset of variant sites for every read added to the sliding window. Instead, reuse as much of the previously calculated bitset as you can (basically from the window start until the start of the new read minus the context size). In some awfully performing regions this cuts down the runtime in half, although in others this doesn't seem to help much (so clearly something else is going on). Note that I still need to fix one last bug here, but it's almost done.	2012-10-19 11:59:34 -04:00
Eric Banks	54f698422c	Better implementation for getSoftEnd() in GATKSAMRecord	2012-10-18 09:01:51 -04:00
Eric Banks	20ffbcc86e	RR optimization: profiling was showing that the BaseCounts class was a major bottleneck because the underlying implementation was a HashMap. Given that the map index was an indexable Enum anyways, it makes a lot more sense to implement as a native array. Knocks 30% off the runtime in bad regions.	2012-10-17 21:44:53 -04:00
Eric Banks	33df1afe0e	More BaseCounts optimizations for RR.	2012-10-17 00:55:44 -04:00
Eric Banks	19e2b5f0d5	RR optimization: since total count in BaseCounts is requested so often, don't keep computing it from scratch each time.	2012-10-17 00:44:23 -04:00
David Roazen	b30e2a5b7d	BQSR: tool to profile the effects of more-granular locking on scalability by # of threads	2012-10-16 14:43:16 -04:00
Mark DePristo	9bcefadd4e	Refactor ExactCallLogger into a separate class -- Update minor integration tests with NanoSchedule due to qual accuracy update	2012-10-16 13:30:09 -04:00
Mark DePristo	c74d7061fe	Added AFCalcResultUnitTest -- Ensures that the posteriors remain within reasonable ranges. Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good. Now tests ensure that the normalization process preserves log10 precision where possible -- Updated MathUtils to make this possible	2012-10-16 08:11:06 -04:00
Mark DePristo	9b0ab4e941	Cleanup IndependentAllelesDiploidExactAFCalc -- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all -- Added lots of contracts and docs, still more to come. -- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing -- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.	2012-10-16 08:11:06 -04:00
Mark DePristo	6bd0ec8de4	Proper likelihoods and posterior probability of the joint allele frequency in IndependentAllelesDiploidExactAFCalc -- Fixed minor numerical stability issue in AFCalcResult -- posterior of joint A/B/C is 1 - (1 - P(D \| AF_b == 0)) x (1 - P(D \| AF_c == 0)), for any number of alleles, obviously. Now computes the joint posterior like this, and then back-calculates likelihoods that generate these posteriors given the priors. It's not pretty but it's the best thing to do	2012-10-16 08:11:06 -04:00
Mark DePristo	d1511e38ad	Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest -- Superceded by IndependentAFCalc -- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it. -- A few misc. bug fixes discovered during running the performance test	2012-10-16 08:11:06 -04:00
Ryan Poplin	31be807664	Updating missed integration test.	2012-10-15 22:31:52 -04:00
Ryan Poplin	d27ae67bb6	Updating the multi-step UG integration test.	2012-10-15 22:30:01 -04:00
David Roazen	cb33f25bfc	Update expected values for HybridSelectionPipelineTest Mark has confirmed that these differences were to be expected given his recent changes.	2012-10-15 18:32:15 -04:00
Ryan Poplin	25be94fbb8	Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10.	2012-10-15 13:24:32 -04:00
Mark DePristo	57e231610b	New framework for EXACT calculations, with new 3 new implementations -- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper. The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles) -- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated. -- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model. It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation. However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct. This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling. -- Constrained search model implemented, but will be deleted in a subsequent code cleanup -- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model. -- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on. -- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments. The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel. -- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now -- Created a simple main[] function to run performance tests of the EXACT model.	2012-10-15 08:32:32 -04:00
Mark DePristo	dcf8af42a8	Finalizing IndependentAllelesDiploidExactAFCalc -- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors -- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef. Not ideal. Finally caught by some tests, but good god it almost made it into the code -- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0 -- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly. ID'd some of the bugs below -- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests -- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.	2012-10-15 08:21:03 -04:00
Mark DePristo	1ac09ca81e	More bugfixes on the way to a final push with new Exact model framework -- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output -- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping	2012-10-15 07:53:57 -04:00
Mark DePristo	6b639f51f0	Finalizing new exact model and tests -- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior. -- Tests that theta^n.alt.alleles is being applied correctly -- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java -- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine	2012-10-15 07:53:57 -04:00
Mark DePristo	2d72265f7d	AFCalcUnit test a more appropriate name	2012-10-15 07:53:57 -04:00
Mark DePristo	cb857d1640	AFCalcs must be made by factory method now -- AFCalcFactory is the only way to make AFCalcs now. There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions. The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles. -- AFCalc no longer has UAC constructor -- only AFCalcFactory does. Code cleanup throughout -- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next) -- It's now possible to run the UG / HC with any of the exact models currently in the system. -- Code cleanup throughout the system, reorganizing the unit tests in particular	2012-10-15 07:53:56 -04:00
Mark DePristo	6bbe750e03	Continuing work on IndependentAllelesDiploidExactAFCalc -- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly. A long way towards the right answer now, but still not there -- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan -- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there -- New convenience method in VariantContext that looks up the allele index in the alleles	2012-10-15 07:53:56 -04:00
Mark DePristo	176b74095d	Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation -- Still not work, but I know what's wrong -- Many tests disabled, that need to be reanabled	2012-10-15 07:53:56 -04:00
Mark DePristo	91aeddeb5a	Steps on the way to a fully described and semantically meaningful AFCalcResult -- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold -- Lots of contracts for AFCalcResult -- Slowly killing off AFCalcResultsTracker -- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele -- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system -- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.	2012-10-15 07:53:56 -04:00
Mark DePristo	4f1b1c4228	Intermediate commit II on simplifying AFCalcResult -- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker. Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10. -- Changed the way that UnifiedGenotyper decides if the best model is non-ref. Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult. This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit). If you want to understand why come talk to me. Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.	2012-10-15 07:53:56 -04:00
Mark DePristo	06687bfaf6	Intermediate commit on simplifying AFCalcResult -- Renamed old class AFCalcResultTracker. This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free. -- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef. -- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure. This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object	2012-10-15 07:53:56 -04:00
Mark DePristo	c82aa01e0e	Generalize testing infrastructure to allow us to run specific n.samples calculation	2012-10-15 07:53:55 -04:00
Mark DePristo	ec935f76f6	Initial implementation and tests for IndependentAllelesDiploidExactAFCalc -- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case. This is very fast, scaling at O(n.alt.alleles x n.samples) -- Many outstanding TODOs in order to truly pass unit tests -- Added proper unit tests for the pNonRef calculation, which all of the models pass	2012-10-15 07:53:55 -04:00
Mark DePristo	5a4e2a5fa4	Test code to ensure that pNonRef is being computed correctly for at least 1 genotype, bi and tri allelic	2012-10-15 07:53:55 -04:00
Mark DePristo	ee2f12e2ac	Simpler naming convention for AlleleFrequencyCalculation => AFCalc	2012-10-15 07:53:55 -04:00
Mark DePristo	cf3f9d6ee8	Reorganize and cleanup AFCalculations -- Now contained in a package called afcalc -- Extracted standard alone classes from private static classes in ExactAF -- Most fields are now private, with accessors -- Overall cleaner organization now	2012-10-15 07:53:55 -04:00
Mark DePristo	13211231c7	Restructure and cleanup ExactAFCalculations -- Now there's no duplication between exact old and constrained models. The behavior is controlled by an overloaded abstract function -- No more static function to access the linear exact model -- you have to create the surrounding class. Updated code in the system -- Everything passes unit tests	2012-10-15 07:53:54 -04:00
Mark DePristo	99ad7b2d71	GeneralPloidyExact should use indel max alt alleles	2012-10-15 07:53:54 -04:00
Mark DePristo	bf276baca0	Don't try to compute full exact model for > 100 samples	2012-10-15 07:53:54 -04:00
Mark DePristo	b924e9ebb4	Add OptimizedDiploidExactAF to PerformanceTesting framework	2012-10-15 07:53:54 -04:00
Mark DePristo	f800f3fb88	Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele -- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)	2012-10-15 07:53:54 -04:00
Mark DePristo	efad215edb	Greedy version of function to compute the max achievable AC for each alt allele -- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension. Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL. Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value. It takes values in the order of the alt alleles.	2012-10-15 07:53:54 -04:00
Mark DePristo	7666a58773	Function to compute the max achievable AC for each alt allele -- Additional minor cleanup of ExactAFCalculation	2012-10-15 07:53:53 -04:00
Mark DePristo	b3cb33a416	simple script to run nano schedule main[]	2012-10-15 07:52:02 -04:00
Eric Banks	a8efa5451a	Protect against bad bases users have screwy data (or try to use zipped references)	2012-10-12 15:05:03 -04:00
David Roazen	da1cffbfca	Run performance tests in gsa-engineering queue on gsa4 rather than gsa queue Running the performance tests on the farm wasn't working out very well -- it's been too long since they've run to completion. Switching back to running them on gsa4 for now.	2012-10-12 14:21:27 -04:00
Guillermo del Angel	5971006678	Bug fix when running nondiploid mode in UG with EMIT_ALL_SITES: if site was reference-only, QUAL is produced OK but genotypes were being set to no-call because of unnecessary likelihood normalization. May change integration test md5 which I'll fix later today	2012-10-12 12:45:55 -04:00
Eric Banks	81532a0529	Missing file are user errors.	2012-10-12 09:48:12 -04:00
Eric Banks	fa77a83783	Update the out of space error to include another permutation	2012-10-12 09:38:12 -04:00
Eric Banks	85525d9e6e	Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit).	2012-10-12 09:19:50 -04:00
Eric Banks	ad60300bee	Catch malformed BAM files at the source since this is the largest class of errors in Tableau.	2012-10-12 09:07:57 -04:00
Eric Banks	593c8065d9	Fix docs for BadMateFilter	2012-10-12 08:35:45 -04:00
Christopher Hartl	6b9987cf1b	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-10-12 00:48:42 -04:00
Christopher Hartl	c1211ad3a1	Full test suite of LD-corrected GRM calculation. The correctness of this code is now largely verified. Matches GCTA when no correction is used (up to 6 decimal places). Bed reading relies on a particular test directory that is still local. The rest is all generated in unit test fashion.	2012-10-12 00:46:02 -04:00
David Roazen	3861212dab	Fix inefficiency in FilePointer GenomeLoc validation Validation of GenomeLocs in the FilePointer class was extremely inefficient when the GenomeLocs were added one at a time rather than all at once. Appears to mostly fix GSA-604	2012-10-11 19:55:14 -04:00

1 2 3 4 5 ...

10761 Commits (27d8d3f51e67699feaafc7aab40a8ed40bc11d4c) All Branches Search

10761 Commits (27d8d3f51e67699feaafc7aab40a8ed40bc11d4c)

All Branches