gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Ryan Poplin	25be94fbb8	Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10.	2012-10-15 13:24:32 -04:00
Mark DePristo	dcf8af42a8	Finalizing IndependentAllelesDiploidExactAFCalc -- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors -- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef. Not ideal. Finally caught by some tests, but good god it almost made it into the code -- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0 -- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly. ID'd some of the bugs below -- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests -- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.	2012-10-15 08:21:03 -04:00
Christopher Hartl	6b9987cf1b	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-10-12 00:48:42 -04:00
Eric Banks	36a26a7da6	md5s failed because I forgot to add --no_cmdline_in_header so it is different depending on where you run from. Fixed.	2012-10-07 08:35:55 -04:00
Eric Banks	a5aaa14aaa	Fix for GSA-601: Indels dropped during liftover. This was a true bug that was an effect of the switch over to the non-null representation of alleles in the VariantContext. Unfortunately, this tool didn't have integration tests - but it does now.	2012-10-07 01:19:52 -04:00
Eric Banks	5d6aad67e2	Fix for bug reported on forums: VariantsToTable does not handle lists and nested arrays correctly. Added an integration test to cover printing of PLs.	2012-10-07 00:01:27 -04:00
Eric Banks	e7798ddd2a	Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles.	2012-10-06 23:02:36 -04:00
Eric Banks	e8a6460a33	After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too.	2012-10-05 16:37:42 -04:00
Yossi Farjoun	dc4dcb4140	fixed AD annotation for a ReducedReads BAM file. Added an integration test for this case with a new reduced BAM in private/testdata	2012-10-05 14:20:07 -04:00
Mark DePristo	b6e20e083a	Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact -- Will be removed. Only commiting now to fix public -> private dependency	2012-10-03 20:16:38 -07:00
Mark DePristo	f6a2ca6e7f	Fixes / TODOs for meaningful results with AFCalculationResult -- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity). Forced me to disable reasonable contracts. Needs to be thought through -- exactCallsLog should be optional -- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.	2012-10-03 19:55:12 -07:00
Mark DePristo	50e4a832ea	Generalize framework for evaluating the performance and scaling of the ExactAF models to tri-allelic variants -- Wow, big performance problems with multi-allelic exact model!	2012-10-03 19:55:11 -07:00
Mark DePristo	17ca543937	More ExactModel cleanup -- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time. -- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live -- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs	2012-10-03 19:55:11 -07:00
Mark DePristo	f8ef4332de	Count the number of evaluations in AFResult; expand unit tests -- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples -- Added unittests for priors (flat and human) -- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)	2012-10-03 19:55:11 -07:00
Mark DePristo	33c7841c4d	Add tests for non-informative samples in ExactAFCalculationModel	2012-10-03 19:55:11 -07:00
Mark DePristo	de941ddbbe	Cleanup Exact model, better unit tests -- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result. -- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected -- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way -- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef	2012-10-03 19:55:11 -07:00
Mark DePristo	3e01a76590	Clean up AlleleFrequencyCalculation classes -- Added a true base class that only does truly common tasks (like manage call logging) -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation -- All unit tests pass	2012-10-03 19:55:11 -07:00
Christopher Hartl	1be8a88909	Changes: 1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest. 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested. 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test. 4) Updates to PlinkBed-handling python scripts and utilities. 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.	2012-10-03 16:02:42 -04:00
Christopher Hartl	2508b0f5a7	Merged bug fix from Stable into Unstable	2012-09-29 00:57:43 -04:00
Christopher Hartl	365f1d2429	hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items: 1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts. 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases. 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.	2012-09-29 00:55:31 -04:00
Christopher Hartl	55cdf4f9b7	Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release.	2012-09-27 00:13:32 -04:00
Eric Banks	74bb4e2739	Fixing the VariantContextUtilsUnitTest	2012-09-22 23:24:55 -04:00
Eric Banks	25e3ea879a	Oops, missed this test before when updating md5s	2012-09-22 22:16:35 -04:00
David Roazen	f6a22e5f50	ExperimentalReadShardBalancerUnitTest was being skipped; fixed TestNG skips tests when an exception occurs in a data provider, which is what was happening here. This was due to an AWFUL AWFUL use of a non-final static for ReadShard.MAX_READS. This is fine if you assume only one instance of SAMDataSource, but with multiple tests creating multiple SAMDataSources, and each one overwriting ReadShard.MAX_READS, you have a recipe for problems. As a result of this the test ran fine individually, but not as part of the unit test suite. Quick fix for now to get the tests running -- this "mutable static" interface should really be refactored away though, when I have time.	2012-09-22 01:56:39 -04:00
David Roazen	133085469f	Experimental, downsampler-friendly read shard balancer -Only used when experimental downsampling is enabled -Persists read iterators across shards, creating a new set only when we've exhausted the current BAM file region(s). This prevents the engine from revisiting regions discarded by the downsamplers / filters, as could happen in the old implementation. -SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip out all related code when the engine fork is collapsed. -Defensive implementation that assumes BAM file regions coming out of the BAM Schedule can overlap; should be able to improve performance if we can prove they cannot possibly overlap. -Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back once confidence in the code is gained	2012-09-21 22:17:58 -04:00
Mark DePristo	5d758bf97f	Better run a shorter test -- should take 3 minutes total	2012-09-20 18:54:14 -04:00
Mark DePristo	b5fa848255	Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output -- See https://jira.broadinstitute.org/browse/GSA-573 -- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread. -- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.	2012-09-20 18:45:16 -04:00
Mark DePristo	90b7df46cf	Add invocation count and shorter timeout to NanoSchedulerUnitTest	2012-09-20 18:45:16 -04:00
Mark DePristo	ba9e95a8fe	Revert "Reorganized NanoScheduler so that main thread does the reduces" Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting. This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.	2012-09-20 18:45:15 -04:00
Mark DePristo	7425ab9637	Reorganized NanoScheduler so that main thread does the reduces -- Enables us to run -nt 2 -nct 2 and get meaningful output -- Uses a sleep / poll mechanism. Not ideal -- will look into wait / notify instead.	2012-09-20 18:45:15 -04:00
Eric Banks	747694f7c2	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-20 14:14:58 -04:00
Eric Banks	1316b579f0	Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table.	2012-09-20 14:14:34 -04:00
Christopher Hartl	c492185be6	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-09-20 12:56:07 -04:00
Christopher Hartl	d25579deeb	A couple of minor things. 1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type 2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it, as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the annotation classes (e.g. from either UG or the VariantAnnotator). 3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are: 1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation] 2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all) 3) Write your own dumb stuff [difficutly rating: Annoying] What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.	2012-09-20 12:48:13 -04:00
Eric Banks	2e6f533996	Adding both unit and integration tests to cover the previous edge case of mismatched PLs	2012-09-20 11:55:28 -04:00
Mark DePristo	2267b722b2	Proper error handling in NanoScheduler -- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker -- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately. -- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers -- Added NanoSchedulerUnitTest for input errors -- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results. -- Build.xml no longer copies the unittest results verbosely	2012-09-19 17:03:13 -04:00
Mark DePristo	773af05980	Intermediate commit for proper error handling in the NanoScheduler -- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler -- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled) -- Added unit tests for failing input iterator in NanoScheduler (fails) -- Made ErrorThrowing NanoScheduable	2012-09-19 17:03:13 -04:00
Mark DePristo	33fabb8180	Final V3 version of NanoScheduler -- Fixed basic bugs in tracking of input -> map -> reduce jobs -- Simplified classes -- Expanded unit tests	2012-09-19 17:03:12 -04:00
Mark DePristo	76027d17e6	Add a few more UnitTests for InputProducer -- Cleaned up function calls for clarity	2012-09-19 17:03:12 -04:00
Mark DePristo	7605c6bcc4	Done GSA-515 Nanoscheduler / GSA-557 V3 nanoScheduler algorithm -- V3 + V4 algorithm for NanoScheduler. The newer version uses 1 dedicated input thread and n - 1 map/reduce threads. These MapReduceJobs perform map and a greedy reduce. The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one. We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size. -- This new algorithm should consume N00% CPU power for -nct N value. -- Also a cleaner implementation in general -- Vastly expanded unit tests -- Deleted FutureValue and ReduceThread	2012-09-19 17:03:12 -04:00
Mark DePristo	69e418c3f5	Intermediate commit for v3 NanoScheduling algorithm -- This version works but it blocks much more than I'd expect on input. Merging v2 and v3 to make v4 now	2012-09-19 17:03:12 -04:00
Christopher Hartl	546586b70e	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 10:09:42 -04:00
Mark DePristo	91f3204534	VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself -- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere. -- Removed addMissingSamples from VariantcontextUtils, and calls to it -- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples -- Added unit tests for this behavior in VariantContextWritersUnitTest	2012-09-12 06:46:26 -04:00
Christopher Hartl	5d19fca649	A couple of bug-fixy changes. 1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning, and does the appropriate subsetting. Added integration tests for this. 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).	2012-09-11 23:01:00 -04:00
David Roazen	6fad0f25bb	Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest	2012-09-11 10:47:09 -04:00
Mark DePristo	e25e617d1a	Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency -- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use. So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.	2012-09-11 07:38:34 -04:00
Mark DePristo	2e94a0a201	Refactor TraversalEngine to extract the progress meter functions -- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves. -- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before -- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer -- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler -- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N). -- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well -- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it. -- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.	2012-09-10 20:14:13 -04:00
David Roazen	d2f3d6d22f	Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)" This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.	2012-09-10 15:52:39 -04:00
Menachem Fromer	0b717e2e2e	Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)	2012-09-10 15:32:41 -04:00
Eric Banks	ac8a4dfc2d	The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now).	2012-09-10 15:04:06 -04:00

1 2 3 4 5 ...

1032 Commits (25be94fbb8d7f762e1576d75c7c0a76d46bb45ef)