Commit Graph

2956 Commits (251983b8fb829c084dfb29b8d60e9bc4e636491e)

Author SHA1 Message Date
Mark DePristo 251983b8fb Add GATK-wide command line argument to control the maximum runtime allowed for the GATK
-- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded.  By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached.  This means that all output files are closed properly, etc.
-- Emits an info message that looks like "INFO  10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s".  Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update
-- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling?
-- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.
2012-10-26 13:18:34 -04:00
Eric Banks ed11b7dab2 Fix UG parallelization test 2012-10-26 12:10:44 -04:00
Eric Banks 7a706ed345 Fix some of the broken integration tests 2012-10-26 11:23:44 -04:00
Eric Banks ebebec7fdb Accidentally left one test disabled 2012-10-26 02:15:32 -04:00
Eric Banks b06f689d4b Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-26 02:13:26 -04:00
Eric Banks a53e03d525 Do not let reduced reads get removed in the contamination down-sampling 2012-10-26 02:13:04 -04:00
Eric Banks bf3d61ce82 The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it. 2012-10-26 01:04:51 -04:00
Eric Banks 91f2c847a3 Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test. 2012-10-26 00:57:40 -04:00
Mark DePristo 6b8b7df651 Queue now understands -nct and requests the appropriate number of cores from LSF, SGE, etc
-- NCT wasn't previously recognized by Queue as needing more processors per machine.  This commit fixes this.  Also a potential cause of poor GATKPerformanceOverTime, in that runs with -nct could flood a node and cause it to have hundreds of cores in contention.
2012-10-25 17:26:58 -04:00
David Roazen 422e16c62e BaseRecalibration: don't cache instances of ReadCovariates across reads
Caching and reusing ReadCovariates instances across reads sounds good in theory, but:

-it doesn't work unless you zero out the internal arrays before each read
-the internal arrays must be sized proportionally to the maximum POSSIBLE
recalibrated read length (5000!!!), instead of the ACTUAL read lengths

By contrast, creating a new instance per read is basically equivalent to doing an
efficient low-level memset-style clear on a much smaller array (since we use the actual
rather than the maximum read length to create it). So this should be faster than caching
instances and calling clear() but slower than caching instances and not calling clear().

Credit to Ryan to proposing this approach.
2012-10-25 17:02:55 -04:00
Ami Levy Moonshine dde3060bb8 add the CEUtrio best practices results (UG + PBT) to the bundle 2012-10-25 15:36:17 -04:00
Ami Levy Moonshine 90b9971033 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-25 15:32:29 -04:00
David Roazen 884d031e72 NestedIntegerArray: Pre-allocate only the first two dimensions
It turns out that pre-allocating the entire tree was too expensive in
terms of memory when using large values for the -mcs and -ics parameters.

Pre-allocating the first two dimensions prevents us from ever locking the
root node during a put(). Contention between threads over lower levels
of the tree should be minimal given that puts are rare compared to gets.

Also output dimensions and pre-allocation info at startup. If pre-allocation
takes longer than usual this gives the user a sense of what is causing the
delay.
2012-10-25 15:17:42 -04:00
Mark DePristo cc8c12b954 Committing a broken version of BaseRecalibration
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails.  But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
2012-10-25 14:46:35 -04:00
Eric Banks e93ff3ea6e Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore. 2012-10-25 12:45:23 -04:00
Eric Banks 6dc7d872ec Fix GenotypeAndValidate to handle SNPs and indels as reported on the forum. Recent changes to the UnifiedArgumentCollection made this stop working. Adding in JIRA to create integration tests for this tool. 2012-10-25 10:06:13 -04:00
Eric Banks c53c55da12 Re-enable tests 2012-10-25 09:37:08 -04:00
Eric Banks e6652f7777 Added integration test for contamination down-sampling 2012-10-25 09:36:05 -04:00
Eric Banks df9e0b7045 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-25 02:49:54 -04:00
Eric Banks 72714ee43e Minor patches to get the contamination down-sampling working for indels. Adding @Hidden logging output for easy debugging. 2012-10-25 02:47:42 -04:00
Eric Banks c6b57fffda Added allele biased down-sampling capabilities to the PerReadAlleleLikelihoodMap object, which means that both the UG and HC can use this functionality. Note that it's only available in protected, so GATK-lite users won't be allowed to enable it. Needs more testing. 2012-10-24 22:52:25 -04:00
Ami Levy Moonshine bcf3582095 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-24 21:50:41 -04:00
Eric Banks 9da7bbf689 Refactoring the PerReadAlleleLikelihoodMap in preparation for adding contntamination downsampling into protected only. 2012-10-24 15:49:07 -04:00
David Roazen d9aa9855f8 Better comments in NestedIntegerArray 2012-10-24 15:29:13 -04:00
David Roazen 02018ca764 Legacy BaseRecalibrator walker is neither TreeReducible nor NanoSchedulable
The old BaseRecalibrator walker is and never will be thread-safe, since it's a
LocusWalker that uses read attributes to track state.

ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe
at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1
for testing purposes, but wait for further testing to be done before using it
for production purposes in multithreaded mode.
2012-10-24 15:22:50 -04:00
David Roazen 32a6d7000a Thread-safe ReadGroupCovariate
The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions
in multithreaded runs of the BQSR where (for example) the same read group could get
inserted into the reverse lookup table twice with different IDs.

Should fix the intermittent crash reported in GSA-492.
2012-10-24 15:22:50 -04:00
David Roazen 991658acf4 BQSR: use more granular locking for concurrency control
-With this change, BQSR performance scales properly by thread rather
 than gaining nothing from additional threads.
-Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct
 (NanoScheduler)
-Removes high-level locks in the recalibration engines and NestedIntegerArray
 in favor of maximally-granular locks on and around manipulation of the leaf
 nodes of the NestedIntegerArray.
-NestedIntegerArray now creates all interior nodes upfront rather than on
 the fly to avoid the need for locking during tree traversals. This uses
 more memory in the initial part of BQSR runs, but the BQSR would eventually
 converge to use this memory anyway over the course of a typical run.

IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator
walker with multiple threads. The BaseRecalibrator walker is and will never be
thread-safe, as it's a LocusWalker that uses read attributes to track
state information. ONLY the newer DelocalizedBaseRecalibrator can be made
thread-safe (and will hopefully be made so in my subsequent commits). This
commit addresses performance, not correctness.
2012-10-24 15:22:50 -04:00
Eric Banks 5b7b42356b Fix bug in GenotypeAndValidate where it doesn't check vc.hasAttribute() before using vc.getAttribute(). 2012-10-24 14:02:50 -04:00
Mark DePristo 6e421a72d6 Add more exhaustive unit tests for input errors to NanoScheduler
-- Resolves issue GSA-515 / Nanoscheduler GSA-605 / Seems that -nct may deadlock as not reproducible
-- It seems that it's not an input error problem (or at least cannot be provoked with unit tests)
-- I'll keep an eye on this later
2012-10-23 20:11:29 -04:00
Mark DePristo f838815343 Updating MD5s for confidence ref site estimation in IndependentAllelesDiploidExactAFCalc
-- Included logic to only add priors for alleles with sufficient evidence to be called polymorphic.  If no alleles are poly make sure to add priors of first allele
2012-10-23 06:47:53 -04:00
Mark DePristo 15b28e61cd Retiring TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine
-- There's been no report of problems with the nano scheduled version of TraverseLoci and TraverseReads, so I'm removing the old versions since they are no longer needed
-- Removing unnecessary intermediate base classes
-- GSA-515 / Nanoscheduler GSA-549 / https://jira.broadinstitute.org/browse/GSA-549
2012-10-22 16:55:06 -04:00
Khalid Shakir fd59e7d5f6 Better error message when generic types are erased from scala collections. 2012-10-22 16:27:31 -04:00
Ryan Poplin 008df54575 Bug fix in GATKSAMRecord.getSoftEnd() for reads that are entirely clipped. 2012-10-22 14:21:52 -04:00
Mark DePristo 90f59803fd MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic.  That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele.  The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating.  This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime.  Might be worth it through
2012-10-22 13:47:56 -04:00
Khalid Shakir 97dc3664c9 Fixed yet another NPE related to the ArgumentTypeDescriptor vs. ArgumentMatchValue. Added integration test based on GSA-621. 2012-10-22 12:05:32 -04:00
Ami Levy Moonshine 1da4ad9607 new class to test the quals of reduced bam vs non-reduced bam 2012-10-22 10:45:53 -04:00
Mark DePristo eb6c9a1a79 Disable EfficiencyMonitoringThreadFactoryUnitTest
-- This is no longer a core GATK activity, and the tests need to run for so long (2 min each) that it's just too painful to run them.  Should be re-eabled if we come to care about this capability again, or if we can run these tests all in parallel in the future.
2012-10-21 12:43:46 -04:00
Mark DePristo 5296de8251 Fix UnifiedArgumentCollection constructor logic error
-- The old way of overloading constructors and calling super didn't work (might have been a consequence of merge).  This is the right way to do the copy constructor with the call to super()
2012-10-21 12:43:46 -04:00
Mark DePristo d21e42608a Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default 2012-10-21 12:43:46 -04:00
Mark DePristo 6b6caf8e3a Bugfix for indel DP calculations using reduced reads
-- Adding tests for SNP and indel calling on reduced BAM
2012-10-21 12:42:32 -04:00
Mark DePristo 0fcd358ace Original EXACT model implementation lives, providing another reference (bi-allelic only) EXACT model
-- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case
-- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations
-- Added systematic unit tests covering this implementation, and comparing it to others
-- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases
-- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right
2012-10-21 12:42:31 -04:00
Mark DePristo 0fb8274507 Fix contact on sorting of AFCalcResults
-- The thing that must be sorted is the pre-theta^N list, which is not checked in the routine that applies the theta^N prior.
2012-10-21 12:42:31 -04:00
Mark DePristo eaffb814d3 IndependentExactAFCalc is now the default EXACT model implementation
-- Changed UG / HC to use this one via the StandardCallerArgumentCollection
-- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums
2012-10-21 12:42:31 -04:00
Mark DePristo 326f429270 Bugfixes to make new AFCalc system pass integrationtests
-- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests
-- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base
-- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X.  Added unit test to ensure this behavior is correct
-- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order.  Add contract to ensure this doesn't ever happen again
-- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model
--
2012-10-21 12:42:31 -04:00
Mark DePristo 695cf83675 More docs and contracts for classes in genotyper.afcalc
-- Future protection of the output of GeneralPloidyExactAFCalc, which produces in some cases bad likelihoods (positive values)
2012-10-21 12:42:31 -04:00
Mark DePristo 99c9031cb4 Merge AFCalcResultTracker into StateTracker, cleanup
-- These two classes were really the same, and now they are actually the same!
-- Cleanuped the interfaces, removed duplicate data
-- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed)
-- Moved goodProbability and goodProbabilityVector utilities to MathUtils.  Very useful for contracts!
2012-10-21 12:42:31 -04:00
Mark DePristo 9c63cee9fc Moving pnrm to UnifiedArgumentCollection so it's available with the HaplotypeCaller 2012-10-21 12:42:31 -04:00
Eric Banks d44d5b8275 Fix RawHapMapCodec so that it can build indexes. Minor fixes to VCF codec. 2012-10-21 01:29:59 -04:00
Eric Banks 841a906f21 Adding a hidden (for now) argument to UG (and HC) that tells the caller that the incoming samples are contaminated by N% and to fix it by aggressively down-sampling all alleles. This actually works. Yes, you read that right: given that we know what N is, we can make good calls on bams that have N% contamination. Only hooked up for SNPS right now. No tests added yet. 2012-10-20 23:31:56 -04:00
Eric Banks 2c624f76c8 Refactoring the Unified (and Standard) Argument Collections because it was really ugly that the subclass had to do all the cloning for the super class. The clone() method is really not recommended best practice in Java anyways, so I changed it so that we use standard overloaded constructors. Confirmed that the Haplotype Caller --help docs do not include UG-specific arguments. 2012-10-20 20:35:54 -04:00