gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	e199562c25	I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever).	2012-11-27 10:26:17 -05:00
Eric Banks	66cbaaee31	Fixed nasty bug in BQSR csv file creation: numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column). This wasn't caught earlier because there are no integration tests covering the csv. I'll add one into unstable in a sec.	2012-11-09 08:33:55 -05:00
Eric Banks	15b8c08132	Apparently CIGAR elements can have 0 length according to the spec, but 0Ms were causing left alignment of indels to fail. Fixed.	2012-11-06 20:53:33 -08:00
Eric Banks	e1e480a0b9	Bug fix: don't add no-call alleles to the list of ALT alleles being validated.	2012-10-30 14:54:29 -04:00
Eric Banks	c95e893920	Better error message for unused ALT alleles	2012-10-29 21:51:35 -04:00
Eric Banks	be902375ac	'Bug' fix: fix the error message from the vcf validator so people realize that the file fails strict validation but still adheres to the spec.	2012-10-29 16:29:27 -04:00
Mark DePristo	ac5e58a265	Bugfix for GSA-540 / Update metadata maps when adding lines to VCFHeader -- https://jira.broadinstitute.org/browse/GSA-540 -- http://gatkforums.broadinstitute.org/discussion/1433/possible-bug-and-fix-in-java-code-of-vcfheader-org-broadinstitute-sting-utils-codecs-vcf-vcfheader	2012-10-26 16:34:16 -04:00
Mark DePristo	251983b8fb	Add GATK-wide command line argument to control the maximum runtime allowed for the GATK -- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc. -- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update -- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling? -- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.	2012-10-26 13:18:34 -04:00
Eric Banks	b06f689d4b	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-26 02:13:26 -04:00
Eric Banks	91f2c847a3	Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test.	2012-10-26 00:57:40 -04:00
David Roazen	422e16c62e	BaseRecalibration: don't cache instances of ReadCovariates across reads Caching and reusing ReadCovariates instances across reads sounds good in theory, but: -it doesn't work unless you zero out the internal arrays before each read -the internal arrays must be sized proportionally to the maximum POSSIBLE recalibrated read length (5000!!!), instead of the ACTUAL read lengths By contrast, creating a new instance per read is basically equivalent to doing an efficient low-level memset-style clear on a much smaller array (since we use the actual rather than the maximum read length to create it). So this should be faster than caching instances and calling clear() but slower than caching instances and not calling clear(). Credit to Ryan to proposing this approach.	2012-10-25 17:02:55 -04:00
David Roazen	884d031e72	NestedIntegerArray: Pre-allocate only the first two dimensions It turns out that pre-allocating the entire tree was too expensive in terms of memory when using large values for the -mcs and -ics parameters. Pre-allocating the first two dimensions prevents us from ever locking the root node during a put(). Contention between threads over lower levels of the tree should be minimal given that puts are rare compared to gets. Also output dimensions and pre-allocation info at startup. If pre-allocation takes longer than usual this gives the user a sense of what is causing the delay.	2012-10-25 15:17:42 -04:00
Mark DePristo	cc8c12b954	Committing a broken version of BaseRecalibration -- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA. -- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation -- Updating GATKPerformanceOverTime with the new @ClassType annotation	2012-10-25 14:46:35 -04:00
Eric Banks	df9e0b7045	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-25 02:49:54 -04:00
Eric Banks	c6b57fffda	Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing.	2012-10-24 22:52:25 -04:00
Eric Banks	9da7bbf689	Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only.	2012-10-24 15:49:07 -04:00
David Roazen	d9aa9855f8	Better comments in NestedIntegerArray	2012-10-24 15:29:13 -04:00
David Roazen	32a6d7000a	Thread-safe ReadGroupCovariate The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions in multithreaded runs of the BQSR where (for example) the same read group could get inserted into the reverse lookup table twice with different IDs. Should fix the intermittent crash reported in GSA-492.	2012-10-24 15:22:50 -04:00
David Roazen	991658acf4	BQSR: use more granular locking for concurrency control -With this change, BQSR performance scales properly by thread rather than gaining nothing from additional threads. -Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct (NanoScheduler) -Removes high-level locks in the recalibration engines and NestedIntegerArray in favor of maximally-granular locks on and around manipulation of the leaf nodes of the NestedIntegerArray. -NestedIntegerArray now creates all interior nodes upfront rather than on the fly to avoid the need for locking during tree traversals. This uses more memory in the initial part of BQSR runs, but the BQSR would eventually converge to use this memory anyway over the course of a typical run. IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator walker with multiple threads. The BaseRecalibrator walker is and will never be thread-safe, as it's a LocusWalker that uses read attributes to track state information. ONLY the newer DelocalizedBaseRecalibrator can be made thread-safe (and will hopefully be made so in my subsequent commits). This commit addresses performance, not correctness.	2012-10-24 15:22:50 -04:00
Khalid Shakir	fd59e7d5f6	Better error message when generic types are erased from scala collections.	2012-10-22 16:27:31 -04:00
Ryan Poplin	008df54575	Bug fix in GATKSAMRecord.getSoftEnd() for reads that are entirely clipped.	2012-10-22 14:21:52 -04:00
Mark DePristo	99c9031cb4	Merge AFCalcResultTracker into StateTracker, cleanup -- These two classes were really the same, and now they are actually the same! -- Cleanuped the interfaces, removed duplicate data -- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed) -- Moved goodProbability and goodProbabilityVector utilities to MathUtils. Very useful for contracts!	2012-10-21 12:42:31 -04:00
Eric Banks	d44d5b8275	Fix RawHapMapCodec so that it can build indexes. Minor fixes to VCF codec.	2012-10-21 01:29:59 -04:00
Ryan Poplin	a647f1e076	Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore.	2012-10-20 16:38:18 -04:00
Eric Banks	9c088fe3fe	Actually a better implementation of GATKSAMRecord.getSoftStart(). Last commit was all wrong. Oops.	2012-10-19 12:41:24 -04:00
Eric Banks	f08e5a44da	Better implementation of GATKSAMRecord.getSoftStart()	2012-10-19 12:11:18 -04:00
Eric Banks	deca564aef	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-19 12:01:49 -04:00
Eric Banks	54f698422c	Better implementation for getSoftEnd() in GATKSAMRecord	2012-10-18 09:01:51 -04:00
Mauricio Carneiro	b57df6cac8	Bringing CMI changes into the main GATK repo. Merge remote-tracking branch 'cmi/master'	2012-10-17 15:23:19 -04:00
David Roazen	b30e2a5b7d	BQSR: tool to profile the effects of more-granular locking on scalability by # of threads	2012-10-16 14:43:16 -04:00
Mark DePristo	c74d7061fe	Added AFCalcResultUnitTest -- Ensures that the posteriors remain within reasonable ranges. Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good. Now tests ensure that the normalization process preserves log10 precision where possible -- Updated MathUtils to make this possible	2012-10-16 08:11:06 -04:00
Mark DePristo	d1511e38ad	Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest -- Superceded by IndependentAFCalc -- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it. -- A few misc. bug fixes discovered during running the performance test	2012-10-16 08:11:06 -04:00
kshakir	9fcf71c031	Updated google reflections due to stale slf4j version conflicting with other projects also trying to use Queue as a component. Added targets to build.xml to effectively 'mvn install' packaged GATK/Queue from ant. TODO: Versions during 'mvn install' are hardcoded at 0.0.1 until a better versioning scheme that works with maven dependencies has been identified.	2012-10-16 02:22:30 -04:00
kshakir	213cc00abe	Refactored argument matching to support other plugins in addition to file lists. Added plugin support for sending Queue status messages. Argument parsing can store subclasses of java.io.File, for example RemoteFile.	2012-10-15 15:10:45 -04:00
Ryan Poplin	25be94fbb8	Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10.	2012-10-15 13:24:32 -04:00
Mark DePristo	6b639f51f0	Finalizing new exact model and tests -- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior. -- Tests that theta^n.alt.alleles is being applied correctly -- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java -- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine	2012-10-15 07:53:57 -04:00
Mark DePristo	cb857d1640	AFCalcs must be made by factory method now -- AFCalcFactory is the only way to make AFCalcs now. There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions. The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles. -- AFCalc no longer has UAC constructor -- only AFCalcFactory does. Code cleanup throughout -- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next) -- It's now possible to run the UG / HC with any of the exact models currently in the system. -- Code cleanup throughout the system, reorganizing the unit tests in particular	2012-10-15 07:53:56 -04:00
Mark DePristo	6bbe750e03	Continuing work on IndependentAllelesDiploidExactAFCalc -- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly. A long way towards the right answer now, but still not there -- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan -- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there -- New convenience method in VariantContext that looks up the allele index in the alleles	2012-10-15 07:53:56 -04:00
Mark DePristo	176b74095d	Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation -- Still not work, but I know what's wrong -- Many tests disabled, that need to be reanabled	2012-10-15 07:53:56 -04:00
Mark DePristo	91aeddeb5a	Steps on the way to a fully described and semantically meaningful AFCalcResult -- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold -- Lots of contracts for AFCalcResult -- Slowly killing off AFCalcResultsTracker -- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele -- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system -- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.	2012-10-15 07:53:56 -04:00
Mark DePristo	c82aa01e0e	Generalize testing infrastructure to allow us to run specific n.samples calculation	2012-10-15 07:53:55 -04:00
Eric Banks	a8efa5451a	Protect against bad bases users have screwy data (or try to use zipped references)	2012-10-12 15:05:03 -04:00
Eric Banks	81532a0529	Missing file are user errors.	2012-10-12 09:48:12 -04:00
Eric Banks	85525d9e6e	Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit).	2012-10-12 09:19:50 -04:00
Ryan Poplin	2a9ee89c19	Turning on allele trimming for the haplotype caller.	2012-10-10 10:47:26 -04:00
Eric Banks	82e40340c0	Use StringBuilder over StringBuffer	2012-10-07 00:02:15 -04:00
Eric Banks	e7798ddd2a	Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles.	2012-10-06 23:02:36 -04:00
Eric Banks	bfc551f612	Fix for GSA-589: SelectVariants with -number gives biased results. The implementation was not good and it's not worth keeping this busted code around given that we have a working implementation of a fractional random sampling already in place, so I removed it.	2012-10-06 22:39:49 -04:00
Mark DePristo	3663fe1555	Framework for evaluating the performance and scaling of the ExactAF models	2012-10-03 19:55:11 -07:00
Mark DePristo	17ca543937	More ExactModel cleanup -- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time. -- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live -- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs	2012-10-03 19:55:11 -07:00

1 2 3 4 5 ...

1138 Commits (e199562c2521032abd003bd315cbfad294208e93)