gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mauricio Carneiro	3504f71b6b	Fixing a null pointer exception bug for DEV-10	2012-10-18 13:58:38 -04:00
Mark DePristo	d3fc797cfe	SelectVariants is actually NOT NanoSchedulable	2012-10-18 10:42:20 -04:00
Mark DePristo	f20fa9d082	SelectVariants is actually NanoSchedulable	2012-10-18 10:27:05 -04:00
Mark DePristo	97abb98c0b	Bugfix for bad nt / nct argument detection in MicroScheduler	2012-10-18 10:27:05 -04:00
Eric Banks	54f698422c	Better implementation for getSoftEnd() in GATKSAMRecord	2012-10-18 09:01:51 -04:00
Ami Levy Moonshine	acc0fb2f7a	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-17 22:16:02 -04:00
Mauricio Carneiro	32ee2c7dff	Refactored the compression interface per sample in ReduceReadsa The CompressionStash is now responsible for keeping track of all intervals that must be kept uncompressed by all samples. In general this is a list generated by a tumor sample that will enforce all normal samples to abide. - Updated ReduceReads integration tests - Sliding Window is now using the CompressionStash (single sample). DEV-104 #resolve #time 3m	2012-10-17 16:40:40 -04:00
Mauricio Carneiro	b57df6cac8	Bringing CMI changes into the main GATK repo. Merge remote-tracking branch 'cmi/master'	2012-10-17 15:23:19 -04:00
Mark DePristo	8288c30e36	Use buffered output for ExactCallLogger	2012-10-17 14:15:11 -04:00
Mark DePristo	c9e7a947c2	Improve interface of ExactCallLogger, use it to have a more informative AFCalcPerformanceTest	2012-10-17 14:15:11 -04:00
David Roazen	b30e2a5b7d	BQSR: tool to profile the effects of more-granular locking on scalability by # of threads	2012-10-16 14:43:16 -04:00
Mark DePristo	9bcefadd4e	Refactor ExactCallLogger into a separate class -- Update minor integration tests with NanoSchedule due to qual accuracy update	2012-10-16 13:30:09 -04:00
Mark DePristo	c74d7061fe	Added AFCalcResultUnitTest -- Ensures that the posteriors remain within reasonable ranges. Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good. Now tests ensure that the normalization process preserves log10 precision where possible -- Updated MathUtils to make this possible	2012-10-16 08:11:06 -04:00
Mark DePristo	9b0ab4e941	Cleanup IndependentAllelesDiploidExactAFCalc -- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all -- Added lots of contracts and docs, still more to come. -- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing -- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.	2012-10-16 08:11:06 -04:00
Mark DePristo	6bd0ec8de4	Proper likelihoods and posterior probability of the joint allele frequency in IndependentAllelesDiploidExactAFCalc -- Fixed minor numerical stability issue in AFCalcResult -- posterior of joint A/B/C is 1 - (1 - P(D \| AF_b == 0)) x (1 - P(D \| AF_c == 0)), for any number of alleles, obviously. Now computes the joint posterior like this, and then back-calculates likelihoods that generate these posteriors given the priors. It's not pretty but it's the best thing to do	2012-10-16 08:11:06 -04:00
Mark DePristo	d1511e38ad	Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest -- Superceded by IndependentAFCalc -- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it. -- A few misc. bug fixes discovered during running the performance test	2012-10-16 08:11:06 -04:00
kshakir	9fcf71c031	Updated google reflections due to stale slf4j version conflicting with other projects also trying to use Queue as a component. Added targets to build.xml to effectively 'mvn install' packaged GATK/Queue from ant. TODO: Versions during 'mvn install' are hardcoded at 0.0.1 until a better versioning scheme that works with maven dependencies has been identified.	2012-10-16 02:22:30 -04:00
Ryan Poplin	31be807664	Updating missed integration test.	2012-10-15 22:31:52 -04:00
Ryan Poplin	d27ae67bb6	Updating the multi-step UG integration test.	2012-10-15 22:30:01 -04:00
kshakir	213cc00abe	Refactored argument matching to support other plugins in addition to file lists. Added plugin support for sending Queue status messages. Argument parsing can store subclasses of java.io.File, for example RemoteFile.	2012-10-15 15:10:45 -04:00
Mauricio Carneiro	80d92e0c63	Allowing the GATK to have non-required outputs Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation). This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out. [#DEV-10 transition:31 resolution:1]	2012-10-15 13:49:08 -04:00
Ryan Poplin	25be94fbb8	Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10.	2012-10-15 13:24:32 -04:00
Ami Levy Moonshine	0d93effa4d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-15 11:19:12 -04:00
Mark DePristo	57e231610b	New framework for EXACT calculations, with new 3 new implementations -- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper. The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles) -- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated. -- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model. It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation. However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct. This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling. -- Constrained search model implemented, but will be deleted in a subsequent code cleanup -- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model. -- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on. -- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments. The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel. -- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now -- Created a simple main[] function to run performance tests of the EXACT model.	2012-10-15 08:32:32 -04:00
Mark DePristo	dcf8af42a8	Finalizing IndependentAllelesDiploidExactAFCalc -- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors -- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef. Not ideal. Finally caught by some tests, but good god it almost made it into the code -- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0 -- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly. ID'd some of the bugs below -- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests -- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.	2012-10-15 08:21:03 -04:00
Mark DePristo	1ac09ca81e	More bugfixes on the way to a final push with new Exact model framework -- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output -- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping	2012-10-15 07:53:57 -04:00
Mark DePristo	6b639f51f0	Finalizing new exact model and tests -- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior. -- Tests that theta^n.alt.alleles is being applied correctly -- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java -- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine	2012-10-15 07:53:57 -04:00
Mark DePristo	cb857d1640	AFCalcs must be made by factory method now -- AFCalcFactory is the only way to make AFCalcs now. There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions. The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles. -- AFCalc no longer has UAC constructor -- only AFCalcFactory does. Code cleanup throughout -- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next) -- It's now possible to run the UG / HC with any of the exact models currently in the system. -- Code cleanup throughout the system, reorganizing the unit tests in particular	2012-10-15 07:53:56 -04:00
Mark DePristo	6bbe750e03	Continuing work on IndependentAllelesDiploidExactAFCalc -- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly. A long way towards the right answer now, but still not there -- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan -- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there -- New convenience method in VariantContext that looks up the allele index in the alleles	2012-10-15 07:53:56 -04:00
Mark DePristo	176b74095d	Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation -- Still not work, but I know what's wrong -- Many tests disabled, that need to be reanabled	2012-10-15 07:53:56 -04:00
Mark DePristo	91aeddeb5a	Steps on the way to a fully described and semantically meaningful AFCalcResult -- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold -- Lots of contracts for AFCalcResult -- Slowly killing off AFCalcResultsTracker -- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele -- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system -- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.	2012-10-15 07:53:56 -04:00
Mark DePristo	4f1b1c4228	Intermediate commit II on simplifying AFCalcResult -- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker. Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10. -- Changed the way that UnifiedGenotyper decides if the best model is non-ref. Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult. This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit). If you want to understand why come talk to me. Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.	2012-10-15 07:53:56 -04:00
Mark DePristo	06687bfaf6	Intermediate commit on simplifying AFCalcResult -- Renamed old class AFCalcResultTracker. This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free. -- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef. -- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure. This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object	2012-10-15 07:53:56 -04:00
Mark DePristo	c82aa01e0e	Generalize testing infrastructure to allow us to run specific n.samples calculation	2012-10-15 07:53:55 -04:00
Mark DePristo	ec935f76f6	Initial implementation and tests for IndependentAllelesDiploidExactAFCalc -- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case. This is very fast, scaling at O(n.alt.alleles x n.samples) -- Many outstanding TODOs in order to truly pass unit tests -- Added proper unit tests for the pNonRef calculation, which all of the models pass	2012-10-15 07:53:55 -04:00
Mark DePristo	ee2f12e2ac	Simpler naming convention for AlleleFrequencyCalculation => AFCalc	2012-10-15 07:53:55 -04:00
Mark DePristo	cf3f9d6ee8	Reorganize and cleanup AFCalculations -- Now contained in a package called afcalc -- Extracted standard alone classes from private static classes in ExactAF -- Most fields are now private, with accessors -- Overall cleaner organization now	2012-10-15 07:53:55 -04:00
Mark DePristo	13211231c7	Restructure and cleanup ExactAFCalculations -- Now there's no duplication between exact old and constrained models. The behavior is controlled by an overloaded abstract function -- No more static function to access the linear exact model -- you have to create the surrounding class. Updated code in the system -- Everything passes unit tests	2012-10-15 07:53:54 -04:00
Mark DePristo	f800f3fb88	Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele -- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)	2012-10-15 07:53:54 -04:00
Mark DePristo	efad215edb	Greedy version of function to compute the max achievable AC for each alt allele -- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension. Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL. Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value. It takes values in the order of the alt alleles.	2012-10-15 07:53:54 -04:00
Mark DePristo	7666a58773	Function to compute the max achievable AC for each alt allele -- Additional minor cleanup of ExactAFCalculation	2012-10-15 07:53:53 -04:00
Eric Banks	a8efa5451a	Protect against bad bases users have screwy data (or try to use zipped references)	2012-10-12 15:05:03 -04:00
Eric Banks	81532a0529	Missing file are user errors.	2012-10-12 09:48:12 -04:00
Eric Banks	fa77a83783	Update the out of space error to include another permutation	2012-10-12 09:38:12 -04:00
Eric Banks	85525d9e6e	Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit).	2012-10-12 09:19:50 -04:00
Eric Banks	ad60300bee	Catch malformed BAM files at the source since this is the largest class of errors in Tableau.	2012-10-12 09:07:57 -04:00
Eric Banks	593c8065d9	Fix docs for BadMateFilter	2012-10-12 08:35:45 -04:00
Christopher Hartl	6b9987cf1b	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-10-12 00:48:42 -04:00
David Roazen	3861212dab	Fix inefficiency in FilePointer GenomeLoc validation Validation of GenomeLocs in the FilePointer class was extremely inefficient when the GenomeLocs were added one at a time rather than all at once. Appears to mostly fix GSA-604	2012-10-11 19:55:14 -04:00
Ami Levy Moonshine	ef3882f439	PhaseByTransmission: small typo /n. variantCallQC_summaryTablesOnly.R: small changes (more to come) /n GeneralCallingPipeline.scala: the new pipeline script. It is not as clean as I want it to be, but it works. I still going to work on it a little bit more. Also, it does not include yet: (1) the RR step (2) need better eval step (3) need to include other targets (currently it eork on the CEU Trio)	2012-10-11 14:51:41 -04:00
Ryan Poplin	08b8ce6903	Fixing merge conflicts related to the comment formatting in the BQSR.	2012-10-10 16:03:58 -04:00
Ryan Poplin	45717349dc	Fixing BQSR bug reported on the forum for reads that begin with insertions.	2012-10-10 16:01:37 -04:00
Ryan Poplin	2a9ee89c19	Turning on allele trimming for the haplotype caller.	2012-10-10 10:47:26 -04:00
Ryan Poplin	b543bddbb7	Fixing merge conflicts related to the comment formatting in the BQSR.	2012-10-08 10:23:08 -04:00
Ryan Poplin	b3cc04976f	Fixing BQSR bug reported on the forum for reads that being with insertions.	2012-10-08 10:18:29 -04:00
Eric Banks	36a26a7da6	md5s failed because I forgot to add --no_cmdline_in_header so it is different depending on where you run from. Fixed.	2012-10-07 08:35:55 -04:00
Eric Banks	a5aaa14aaa	Fix for GSA-601: Indels dropped during liftover. This was a true bug that was an effect of the switch over to the non-null representation of alleles in the VariantContext. Unfortunately, this tool didn't have integration tests - but it does now.	2012-10-07 01:19:52 -04:00
Eric Banks	82e40340c0	Use StringBuilder over StringBuffer	2012-10-07 00:02:15 -04:00
Eric Banks	5d6aad67e2	Fix for bug reported on forums: VariantsToTable does not handle lists and nested arrays correctly. Added an integration test to cover printing of PLs.	2012-10-07 00:01:27 -04:00
Eric Banks	e7798ddd2a	Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles.	2012-10-06 23:02:36 -04:00
Eric Banks	bfc551f612	Fix for GSA-589: SelectVariants with -number gives biased results. The implementation was not good and it's not worth keeping this busted code around given that we have a working implementation of a fractional random sampling already in place, so I removed it.	2012-10-06 22:39:49 -04:00
Eric Banks	e8a6460a33	After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too.	2012-10-05 16:37:42 -04:00
Eric Banks	52326942cf	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-10-05 16:15:07 -04:00
Eric Banks	04853252a0	Possible fix for reduced reads coming from the HaplotypeCaller in the AD	2012-10-05 16:15:04 -04:00
Yossi Farjoun	d419a33ed1	* Added an integration test for AD annotation in the Haplotype caller. * Corrected FS Anotation for UG as for AD. * HC still does not annotate ReducedReads correctly (for FS nor AD)	2012-10-05 15:23:59 -04:00
Yossi Farjoun	dc4dcb4140	fixed AD annotation for a ReducedReads BAM file. Added an integration test for this case with a new reduced BAM in private/testdata	2012-10-05 14:20:07 -04:00
Eric Banks	c66ef17cd0	Add a separate max alt alleles argument for indels that defaults to 2 instead of 3. PLEASE TAKE NOTE.	2012-10-04 13:52:14 -04:00
Mark DePristo	b6e20e083a	Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact -- Will be removed. Only commiting now to fix public -> private dependency	2012-10-03 20:16:38 -07:00
Mark DePristo	51cafa73e6	Removing public -> private dependency	2012-10-03 20:05:03 -07:00
Mark DePristo	f6a2ca6e7f	Fixes / TODOs for meaningful results with AFCalculationResult -- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity). Forced me to disable reasonable contracts. Needs to be thought through -- exactCallsLog should be optional -- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.	2012-10-03 19:55:12 -07:00
Mark DePristo	50e4a832ea	Generalize framework for evaluating the performance and scaling of the ExactAF models to tri-allelic variants -- Wow, big performance problems with multi-allelic exact model!	2012-10-03 19:55:11 -07:00
Mark DePristo	3663fe1555	Framework for evaluating the performance and scaling of the ExactAF models	2012-10-03 19:55:11 -07:00
Mark DePristo	17ca543937	More ExactModel cleanup -- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time. -- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live -- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs	2012-10-03 19:55:11 -07:00
Mark DePristo	f8ef4332de	Count the number of evaluations in AFResult; expand unit tests -- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples -- Added unittests for priors (flat and human) -- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)	2012-10-03 19:55:11 -07:00
Mark DePristo	33c7841c4d	Add tests for non-informative samples in ExactAFCalculationModel	2012-10-03 19:55:11 -07:00
Mark DePristo	de941ddbbe	Cleanup Exact model, better unit tests -- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result. -- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected -- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way -- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef	2012-10-03 19:55:11 -07:00
Mark DePristo	3e01a76590	Clean up AlleleFrequencyCalculation classes -- Added a true base class that only does truly common tasks (like manage call logging) -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation -- All unit tests pass	2012-10-03 19:55:11 -07:00
Mark DePristo	1c52db4cdd	Add exactCallsLog output file to ExactModel and StandardCallerArgumentCollection -- This allows us to log all of the information about the exact model call (alleles, priors, PLs, result, and runtime) to a file for later debugging / optimization	2012-10-03 19:55:11 -07:00
Christopher Hartl	ca31ddf2a5	Allow VCFs without PLs to be converted to a bed file with genotypes other than no-call (by setting the minimum GQ to <=0). Performance enhancements to GRM suite.	2012-10-03 21:36:35 -04:00
Christopher Hartl	1be8a88909	Changes: 1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest. 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested. 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test. 4) Updates to PlinkBed-handling python scripts and utilities. 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.	2012-10-03 16:02:42 -04:00
David Roazen	118e974731	GATK Engine: special-case "monolithic" FilePointers, and allow them to represent multiple contigs Sometimes the GATK engine creates a single monolithic FilePointer representing all regions in all BAM files. In such cases, the monolithic FilePointer is the only FilePointer emitted by the BAMScheduler, and it's safe to allow it to contain regions and intervals from multiple contigs. This fixes support for reading unindexed BAM files (since an unindexed BAM is one case in which the engine creates a monolithic FilePointer).	2012-10-02 15:30:03 -04:00
David Roazen	a96ed385df	ReadShard.getReadsSpan(): handle case where shard contains only unmapped mates Nasty, nasty bug -- if we were extremely unlucky with shard boundaries, we might end up with a shard containing only unmapped mates of mapped reads. In this case, ReadShard.getReadsSpan() would not behave correctly, since the shard as a whole would be marked "mapped" (since it refers to mapped intervals) yet consist only of unmapped mates of mapped reads located within those intervals.	2012-10-02 13:50:00 -04:00
David Roazen	ac87ed47bb	BQSR: allow logging recal table updates to a file For testing/debugging purposes only	2012-10-01 14:18:34 -04:00
Christopher Hartl	2508b0f5a7	Merged bug fix from Stable into Unstable	2012-09-29 00:57:43 -04:00
Christopher Hartl	365f1d2429	hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items: 1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts. 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases. 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.	2012-09-29 00:55:31 -04:00
David Roazen	e740977994	GATK Engine: do not merge FilePointers that span multiple contigs This affects both the non-experimental and experimental engine paths, and so may break tests, but this is a necessary change.	2012-09-27 18:02:25 -04:00
David Roazen	e82946e5c9	ExperimentalReadShardBalancer: create one monolithic FilePointer per contig Merge all FilePointers for each contig into a single, merged, optimized FilePointer representing all regions to visit in all BAM files for a given contig. This helps us in several ways: -It allows us to create a single, persistent set of iterators for each contig, finally and definitively eliminating all Shard/FilePointer boundary issues for the new experimental ReadWalker downsampling -We no longer need to track low-level file positions in the sharding system (which was no longer possible anyway given the new experimental downsampling system) -We no longer revisit BAM file chunks that we've visited in the past -- all BAM file access is purely sequential -We no longer need to constantly recreate our full chain of read iterators There are also potential dangers: -We hold more BAM index data in memory at once. Given that we merge and optimize the index data during the merge, and only hold one contig's worth of data at a time, this does not appear to be a major issue. TODO: confirm this! -With a huge number of samples and intervals, the FilePointer merge operation might become expensive. With the latest implementation, this does not appear to be an issue even with a huge number of intervals (for one sample, at least), but if it turns out to be a problem for > 1 sample there are things we can do. Still TODO: unit tests for the new FilePointer.union() method	2012-09-27 14:47:54 -04:00
Christopher Hartl	55cdf4f9b7	Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release.	2012-09-27 00:13:32 -04:00
Eric Banks	caa431c367	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-24 21:46:36 -04:00
David Roazen	0b488cce66	ExperimentalReadShardBalancer: close() exhausted iterators Fixes a truly awful SAMReaders resource leak reported by Eric -- thanks Eric!	2012-09-24 14:52:59 -04:00
Mark DePristo	9fd30d6f1c	When writing the initial commit for nt + nct I realized this class was really just a ThreadGroupOutputTracker -- The code is cleaner and the logical more obvious now.	2012-09-24 14:15:36 -04:00
Mark DePristo	3e8d992828	Remove bad error test from MicroScheduler, as it's no longer applicable.	2012-09-24 14:15:36 -04:00
Mark DePristo	a6b3497eac	Fixes GSA-515 Nanoscheduler GSA-577 -nt and -nct together appear to not close resources properly -- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker. -- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map. Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead) -- Removed unnecessary debug statement in GenomeLocParser -- nt and nct officially work together now	2012-09-24 14:15:35 -04:00
Mark DePristo	4749fc114f	Temp. disable -nt > 1 and -nct > 1 while bugs are worked out	2012-09-24 14:15:35 -04:00
Mark DePristo	09bbd2c4c3	Include exception in VCFWriter when one is found when rethrowing as ReviewedStingException	2012-09-24 14:15:35 -04:00
Mark DePristo	10a6b57be6	Fix thread name: should be master executor not input	2012-09-24 14:15:35 -04:00
Eric Banks	9464dfdbf2	Don't penalize the reduced reads for spanning deletions (when surrounding base quals are Q2s)	2012-09-24 14:06:07 -04:00
Eric Banks	1509153b4b	Adding my little walker to assess reduced bam coverage against the original bam because it's turning out to be very useful.	2012-09-23 00:47:40 -04:00
Eric Banks	74bb4e2739	Fixing the VariantContextUtilsUnitTest	2012-09-22 23:24:55 -04:00
Eric Banks	25e3ea879a	Oops, missed this test before when updating md5s	2012-09-22 22:16:35 -04:00
David Roazen	f6a22e5f50	ExperimentalReadShardBalancerUnitTest was being skipped; fixed TestNG skips tests when an exception occurs in a data provider, which is what was happening here. This was due to an AWFUL AWFUL use of a non-final static for ReadShard.MAX_READS. This is fine if you assume only one instance of SAMDataSource, but with multiple tests creating multiple SAMDataSources, and each one overwriting ReadShard.MAX_READS, you have a recipe for problems. As a result of this the test ran fine individually, but not as part of the unit test suite. Quick fix for now to get the tests running -- this "mutable static" interface should really be refactored away though, when I have time.	2012-09-22 01:56:39 -04:00
David Roazen	e077347cc2	Re-allow running the GATK with experimental downsampling It's now possible to run with experimental downsampling enabled using the --enable_experimental_downsampling engine argument. This is scheduled to become the GATK-wide default next week after diff engine output for failing tests has been examined.	2012-09-21 23:20:46 -04:00
David Roazen	34eed20aa6	PerSampleDownsamplingReadsIterator: fix for incorrect use of DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL Notify all downsamplers in our pool of the current global genomic position every DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL position changes, not every single positional change after that threshold is first reached.	2012-09-21 22:43:39 -04:00
David Roazen	133085469f	Experimental, downsampler-friendly read shard balancer -Only used when experimental downsampling is enabled -Persists read iterators across shards, creating a new set only when we've exhausted the current BAM file region(s). This prevents the engine from revisiting regions discarded by the downsamplers / filters, as could happen in the old implementation. -SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip out all related code when the engine fork is collapsed. -Defensive implementation that assumes BAM file regions coming out of the BAM Schedule can overlap; should be able to improve performance if we can prove they cannot possibly overlap. -Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back once confidence in the code is gained	2012-09-21 22:17:58 -04:00
Guillermo del Angel	ab8fa8f359	Bug fix: AlleleCount stratification in VariantEval didn't support higher ploidy and was producing bad tables	2012-09-21 20:48:12 -04:00
Mark DePristo	5d758bf97f	Better run a shorter test -- should take 3 minutes total	2012-09-20 18:54:14 -04:00
Mark DePristo	b5fa848255	Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output -- See https://jira.broadinstitute.org/browse/GSA-573 -- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread. -- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.	2012-09-20 18:45:16 -04:00
Mark DePristo	90b7df46cf	Add invocation count and shorter timeout to NanoSchedulerUnitTest	2012-09-20 18:45:16 -04:00
Mark DePristo	ba9e95a8fe	Revert "Reorganized NanoScheduler so that main thread does the reduces" Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting. This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.	2012-09-20 18:45:15 -04:00
Mark DePristo	7425ab9637	Reorganized NanoScheduler so that main thread does the reduces -- Enables us to run -nt 2 -nct 2 and get meaningful output -- Uses a sleep / poll mechanism. Not ideal -- will look into wait / notify instead.	2012-09-20 18:45:15 -04:00
Eric Banks	747694f7c2	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-20 14:14:58 -04:00
Eric Banks	1316b579f0	Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table.	2012-09-20 14:14:34 -04:00
Christopher Hartl	c492185be6	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-09-20 12:56:07 -04:00
Christopher Hartl	d25579deeb	A couple of minor things. 1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type 2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it, as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the annotation classes (e.g. from either UG or the VariantAnnotator). 3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are: 1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation] 2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all) 3) Write your own dumb stuff [difficutly rating: Annoying] What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.	2012-09-20 12:48:13 -04:00
Eric Banks	2e6f533996	Adding both unit and integration tests to cover the previous edge case of mismatched PLs	2012-09-20 11:55:28 -04:00
Eric Banks	4b7edc72d1	Fixing edge case bug in the Exact model (both standard and generalized) where we could abort prematurely in the special case of multiple polymorphic alleles and samples with widely different depths of coverage (e.g. exome and low-pass). In these cases it was possible to call the site bi-allelic when in fact it was multi-allelic (but it wouldn't cause it to create a monomorphic call).	2012-09-20 10:59:42 -04:00
Ryan Poplin	ccb65a03e8	sorry, non-ASCII characters annoy some computers.	2012-09-20 10:14:48 -04:00
Mark DePristo	087247f1f0	Allow longs and doubles in recalibration report to allow some backward compatibility	2012-09-19 19:23:44 -04:00
Mark DePristo	2267b722b2	Proper error handling in NanoScheduler -- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker -- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately. -- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers -- Added NanoSchedulerUnitTest for input errors -- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results. -- Build.xml no longer copies the unittest results verbosely	2012-09-19 17:03:13 -04:00
Mark DePristo	773af05980	Intermediate commit for proper error handling in the NanoScheduler -- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler -- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled) -- Added unit tests for failing input iterator in NanoScheduler (fails) -- Made ErrorThrowing NanoScheduable	2012-09-19 17:03:13 -04:00
Mark DePristo	d2046b67b1	Remove problematic @Ensures from InputProducer. -- We need to figure out why CoFoJa is broken in the NanoScheduler	2012-09-19 17:03:13 -04:00
Mark DePristo	33fabb8180	Final V3 version of NanoScheduler -- Fixed basic bugs in tracking of input -> map -> reduce jobs -- Simplified classes -- Expanded unit tests	2012-09-19 17:03:12 -04:00
Mark DePristo	5734d756b5	Remove problematic @Invariant from EOFMarkedValue	2012-09-19 17:03:12 -04:00
Mark DePristo	aa9a1e8122	Warn GATK user if the number of requested threads > available processors on the machine	2012-09-19 17:03:12 -04:00
Mark DePristo	76027d17e6	Add a few more UnitTests for InputProducer -- Cleaned up function calls for clarity	2012-09-19 17:03:12 -04:00
Mark DePristo	7605c6bcc4	Done GSA-515 Nanoscheduler / GSA-557 V3 nanoScheduler algorithm -- V3 + V4 algorithm for NanoScheduler. The newer version uses 1 dedicated input thread and n - 1 map/reduce threads. These MapReduceJobs perform map and a greedy reduce. The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one. We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size. -- This new algorithm should consume N00% CPU power for -nct N value. -- Also a cleaner implementation in general -- Vastly expanded unit tests -- Deleted FutureValue and ReduceThread	2012-09-19 17:03:12 -04:00
Mark DePristo	69e418c3f5	Intermediate commit for v3 NanoScheduling algorithm -- This version works but it blocks much more than I'd expect on input. Merging v2 and v3 to make v4 now	2012-09-19 17:03:12 -04:00
Ryan Poplin	7a7103a757	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-19 10:39:18 -04:00
Ryan Poplin	0ea543e1fd	Removing testing scaffolding from delocalized BQSR. The output recal table reports the data as doubles instead of integers. This changes the mapping-based BQSR integration tests. Final intermediate push before delocalized BQSR replaces previous BQSR.	2012-09-19 10:39:06 -04:00
Ami Levy Moonshine	ccc3f4ff8d	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-17 09:58:27 -04:00
Ami Levy Moonshine	ee0b17d98f	typo in VE	2012-09-17 09:51:51 -04:00
Eric Banks	86be50f18d	Add note to docs that the --list argument requires full command-line	2012-09-14 10:58:44 -04:00
Eric Banks	0206e09a6a	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 15:18:27 -04:00
Eric Banks	d94d0d15c2	Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout.	2012-09-12 15:15:40 -04:00
Eric Banks	4bb7a99f08	Given that all classes implementing output stubs already have getters for the underlying OutputStream and File, it makes sense to unify that functionality into the Stub interface. Now it is possible to have an Engine utility method that iterates over all registered stubs to find the one representing a given OutputStream and return the File associated with it.	2012-09-12 11:51:44 -04:00
Eric Banks	994a4ff387	Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings).	2012-09-12 11:24:53 -04:00
Christopher Hartl	96be1cbea9	My own integration test isn't passing with a clean checkout. This fix to the walker ought to do it.	2012-09-12 10:11:06 -04:00
Christopher Hartl	546586b70e	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 10:09:42 -04:00
Mark DePristo	bfbf1686cd	Fixed nasty bug with defaulting to diploid no-call genotypes -- For the pooled caller we were writing diploid no-calls even when other samples were haploid. Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing. -- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all. -- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)	2012-09-12 07:08:03 -04:00
Mark DePristo	d1ba17df5d	Fixed nasty bug in BCF2 writer for case where all genotypes are missing -- Previous code was looking for a -1 result from maxPloidy() but the result as actually 0, so instead of writing a diploid no call we were actually writing "unavailable" genotypes, and failing the BCF == VCF test in integration tests. Fixed.	2012-09-12 06:46:27 -04:00
Mark DePristo	91f3204534	VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself -- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere. -- Removed addMissingSamples from VariantcontextUtils, and calls to it -- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples -- Added unit tests for this behavior in VariantContextWritersUnitTest	2012-09-12 06:46:26 -04:00
Christopher Hartl	5d19fca649	A couple of bug-fixy changes. 1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning, and does the appropriate subsetting. Added integration tests for this. 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).	2012-09-11 23:01:00 -04:00
David Roazen	6fad0f25bb	Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest	2012-09-11 10:47:09 -04:00
Mark DePristo	e25e617d1a	Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency -- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use. So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.	2012-09-11 07:38:34 -04:00
Mark DePristo	d6e42d839c	Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly.	2012-09-10 20:14:14 -04:00
Mark DePristo	641c6a361e	Fix nasty memory leak in new data thread x cpu thread parallelism -- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up. The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests -- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information -- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.	2012-09-10 20:14:14 -04:00
Mark DePristo	195cf6df7e	Attempting to fix out of memory errors with new traversal engine creator	2012-09-10 20:14:14 -04:00
Mark DePristo	f713d400e2	Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together -- Can now say -nt 4 and -nct 4 to get 16 threads running for you! -- TraversalEngines are now ThreadLocal variables in the MicroScheduler. -- Misc. code cleanup, final variables, some contracts.	2012-09-10 20:14:14 -04:00
Mark DePristo	233f70f8ba	Final cleanup of TraversalProgressMeters, moved to utils.progressmeter -- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter. Now just takes "nRecordsProcessed" as an argument to print reads. Completely removes dependence on complex data structures from TraversalProgressMeter. Can be used to measure progress on any task with processing units in genomic locations. -- a fairly simple, class with no dependency on GATK engine or other features. -- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.	2012-09-10 20:14:14 -04:00
Mark DePristo	2e94a0a201	Refactor TraversalEngine to extract the progress meter functions -- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves. -- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before -- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer -- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler -- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N). -- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well -- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it. -- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.	2012-09-10 20:14:13 -04:00
David Roazen	d2f3d6d22f	Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)" This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.	2012-09-10 15:52:39 -04:00
Menachem Fromer	0b717e2e2e	Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)	2012-09-10 15:32:41 -04:00
Eric Banks	ac8a4dfc2d	The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now).	2012-09-10 15:04:06 -04:00
Eric Banks	d7499e0642	Updating the rank sum test documentation	2012-09-09 22:17:36 -04:00
Eric Banks	8ca205f1a9	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-07 14:26:06 -04:00
Eric Banks	b1677fc719	Fixed JIRA GSA-520 for Guillermo: when intervals with zero coverage were present, DiagnoseTargets was trying to merge them with the next interval (even if non-overlapping) which would cause problems later on when it checked to make sure that intervals were strictly overlapping.	2012-09-07 14:25:57 -04:00
Geraldine Van der Auwera	3f2a4379af	Added forum API version stub to base URL for posting GATKDocs This will prevent bugs from occurring when Vanilla make changes to the API as described here: http://vanillaforums.com/blog/api#configuration Based on the bug that broke the website Guide section on 9/6/12, the GATKDocs posting system will probably break in the next release if this is not applied as a bug fix.	2012-09-07 11:49:02 -04:00
Eric Banks	ed3d9b050f	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-07 11:45:09 -04:00
Eric Banks	3dc248a49d	Adding another test	2012-09-07 11:41:38 -04:00
Ryan Poplin	81b27f9db2	auto-merging to latest version	2012-09-07 11:36:47 -04:00
Eric Banks	41a8a304a0	Catch masked OutOfMemory errors as User Errors	2012-09-07 11:27:00 -04:00
Mark DePristo	f25bf0f927	EfficiencyMonitoringThreadFactoryUnitTests thing keeps timing out unnecessary	2012-09-07 11:03:00 -04:00
Mark DePristo	d62eca5d92	Update GATKPerformanceOverTime to measure -nt and -nct	2012-09-07 10:47:29 -04:00
Mark DePristo	bf87de8a25	UnitTests for ReducerThread and InputProducer -- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order	2012-09-07 09:51:32 -04:00
Mark DePristo	8c0e3b1e0c	UnitTests for InputProducer	2012-09-07 09:15:16 -04:00
Mark DePristo	c503884958	GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper -- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry. -- Restructured the NS code for clarity. Docs everywhere. -- This is considered version 1.0	2012-09-07 09:15:16 -04:00
Mark DePristo	9d12935986	Intermediate commit for new hyper parallel NanoScheduler -- There's a logic bug now but I'll go to squash it...	2012-09-07 09:15:16 -04:00
Eric Banks	576c7280d9	Extensions to the ErrorThrowing framework for testing purposes	2012-09-06 22:03:18 -04:00
David Roazen	cb84a6473f	Downsampling: experimental engine integration -Off by default; engine fork isolates new code paths from old code paths, so no integration tests change yet -Experimental implementation is currently BROKEN due to a serious issue involving file spans. No one can/should use the experimental features until I've patched this issue. -There are temporarily two independent versions of LocusIteratorByState. Anyone changing one version should port the change to the other (if possible), and anyone adding unit tests for one version should add the same unit tests for the other (again, if possible). This situation will hopefully be extremely temporary, and last only until the experimental implementation is proven.	2012-09-06 15:03:27 -04:00
Eric Banks	6df6c1abd5	Fix for PBT to stop NPE when there are no likelihoods present	2012-09-06 13:14:18 -04:00
Mark DePristo	5ab5d8dee8	Give EfficiencyMonitoringThreadFactoryUnitTest longer to complete its tests	2012-09-05 22:08:34 -04:00
Mark DePristo	1b064805ed	Renaming -cnt to -nct for consistency	2012-09-05 21:13:19 -04:00
Mark DePristo	228bac75e4	By default do only NT tests in integration tests	2012-09-05 20:57:49 -04:00
Mark DePristo	574a8f710b	Add static boolean controlled output of individual map call timing to nanoSecond resolution	2012-09-05 17:40:02 -04:00
Mark DePristo	e11915aa0a	GSA-515 Nanoscheduler GSA-550 ThreadSafeMapReduce shouldn't be super interface of TreeReducible	2012-09-05 17:37:56 -04:00
Mark DePristo	c5f1ceaa95	All read and loci traversals go through NanoScheduler now -- The NanoScheduler is doing a good job at tracking important information like time spent in map/reduce/input etc. -- Can be disabled with static boolean in MicroScheduler if we have problems -- See GSA-515 Nanoscheduler GSA-549 Retire TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine	2012-09-05 16:38:21 -04:00
Mark DePristo	dddf148a59	Fixed bug in ThreadAllocation getTotalNumberOfThreads -- It isnt data + cpu its data * cpu threads.	2012-09-05 16:35:32 -04:00
Mark DePristo	225f3a0ebe	Update integration test system to allow us to differentiate between testing data and cpu parallelism	2012-09-05 16:35:00 -04:00
Mark DePristo	9bf1d138d9	New GATK argument interface for data and cpu threads -- Closes GSA-515 Nanoscheduler GSA-542 Good interface to nanoScheduler -- Old -nt means dataThreads -- New -cnt (--num_cpu_threads_per_data_thread) gives you n cpu threads for each data thread in the system -- Cleanup logic for handling data and cpu threading in HMS, LMS, and MS -- GATKRunReport reports the total number of threads in use by the GATK, not just the nt value -- Removed the io,cpu tags for nt. Stupid system if you ask me. Cleaned up the GenomeAnalysisEngine and ThreadAllocation handling to be totally straightforward now	2012-09-05 15:45:24 -04:00
Mark DePristo	1e55475adc	NanoScheduler uses ExecutorService to run input reader thread	2012-09-05 15:45:24 -04:00
Mark DePristo	71d9ebcb0d	Fix bug (introduced by me) that didn't include contig in progress meter	2012-09-05 15:45:24 -04:00
Mark DePristo	c822b7c760	Fix long-standing NPE in LMS due to inappropriate timing of initialization	2012-09-05 15:45:24 -04:00
Mark DePristo	a997c99806	Initial NanoScheduler with input producer thread	2012-09-05 15:45:24 -04:00
Mark DePristo	03dd470ec1	Test for progressFunction in NanoScheduler; bugfix for single threaded fast path	2012-09-05 15:45:23 -04:00
Mark DePristo	8cdeb51b78	Cleanup printProgress in TraversalEngine -- Separate updating cumulative traversal metrics from printing progress. There's now an updateCumulativeMetrics function and a printProgress() that only takes a current position -- printProgress now soles relies on the time since the last progress to decide if it will print or not. No longer uses the number of cycles, since this isn't reliable in the case of nano scheduling -- GenomeAnalysisEngine now maintains a pointer to the master cumulative metrics. getCumulativeMetrics never returns null, which was handled in some parts of the code but not others. -- Update all of the traversals to use the new updateCumulativeMetrics, printProgress model -- Added progress callback to nano scheduler. Every bufferSize elements this callback is invoked, allowing us to smoothly update the progress meter in the NanoScheduler -- Rename MapFunction to NanoSchedulerMap and the same for reduce.	2012-09-05 15:45:23 -04:00
Mark DePristo	d503ed97ab	Mark I NanoScheduling TraverseLoci -- Refactored TraverseLoci into old linear version and nano scheduling version -- Temp. GATK argument to say how many nano threads to use -- Can efficiently scale to 3 threads before blocking on input	2012-09-05 15:45:23 -04:00
Mark DePristo	757e6a0160	Making Pileup thread-safe -- Old version relied on out printstream magically sorting output, new version puts the print in reduce	2012-09-05 15:45:23 -04:00
Mark DePristo	d7105223fe	More debugging output for NanoScheduler when debugging is enabled	2012-09-05 15:45:23 -04:00
Mark DePristo	9823102c0c	TraverseReadsNano supports walker.filter and walker.done -- Instead of returning directly the result of map(), returns a MapResult object with the value and a reduceMe flag. -- Reduce function respects the reduceMe flag -- Code cleanup and more documentation	2012-09-05 15:45:23 -04:00
Mark DePristo	1a8f5fc374	Trivial cleanup of NanoScheduler	2012-09-05 15:45:23 -04:00
Mark DePristo	6a5a70cdf1	Done GSA-539: SimpleTimer should use System.nanoTime for nanoSecond resolution	2012-09-05 15:45:23 -04:00
Mark DePristo	59109d5eeb	NanoScheduler tracks time outside of its execute call	2012-09-05 15:45:23 -04:00
Mark DePristo	800a27c3a7	NanoScheduler tracks time within input, map, and reduce -- Helpful for understanding where the time goes to each bit of the code. -- Controlled by a local static boolean, to avoid the potential overhead in general	2012-09-05 15:45:23 -04:00
Mark DePristo	7087b22ea3	No debugging output (even conditional) for ReadTransformers in PrintReads	2012-09-05 15:45:23 -04:00
Mark DePristo	e01258b261	NanoScheduler now supports printProgress. Bugfixes to printProgress -- TraverseReadsNano prints progress at the end of each traversal unit -- Fix bugs in TraversalEngine printProgress -- Synchronize the method so we don't get multiple logged outputs when two or more HMSs call printProgress before initialization at the start! -- Fix the logic for mustPrint, which actually had the logic of mustNotPrint. Now we see the done log line that was always supposed to be there -- Fix output formatting, as the done() line was incorrectly shifting over the % complete by 1 char as 100.0% didn't fit in %4.1f -- Add clearer doc on -PF argument so that people know that the performance log can be generated to standard out if one wants	2012-09-05 15:45:23 -04:00
Mark DePristo	6055101df8	NanoScheduler no longer groups inputs, each map() call is interlaced now -- Maximizes the efficiency of the threads -- Simplifies interface (yea!) -- Reduces number of combinatorial tests that need to be performed	2012-09-05 15:45:22 -04:00
Mark DePristo	e3b4cc02aa	Done GSA-282: Unindexed traversals crash if a read goes off the end of a contig -- Already fixed in the codebase. Added unindexed bam and integration tests to ensure this is fine going forward.	2012-09-05 15:45:22 -04:00
Yossi Farjoun	ad5fa449e7	fixed a typo in the string comment	2012-09-05 14:46:10 -04:00
Ryan Poplin	84a83fd3f3	fixing typo	2012-09-05 10:41:03 -04:00
Eric Banks	fc06f39411	Fixed docs for Pileup walker	2012-09-05 09:55:34 -04:00

... 2 3 4 5 6 ...

3045 Commits (4ced2e4ffc7d457cb9a8aad4c4aa2cb3cd3fb705)