Commit Graph

2594 Commits (251983b8fb829c084dfb29b8d60e9bc4e636491e)

Author SHA1 Message Date
Guillermo del Angel 3db38c5a93 Bug fix: inbreeding coeff shouldn't be computed in ref-only sites 2012-10-18 15:42:14 -04:00
Ami Levy Moonshine a0381f15af Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-18 14:49:34 -04:00
Ryan Poplin b4e69239dd In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map. 2012-10-18 14:31:15 -04:00
Mauricio Carneiro 3504f71b6b Fixing a null pointer exception bug for DEV-10 2012-10-18 13:58:38 -04:00
Mark DePristo d3fc797cfe SelectVariants is actually *NOT* NanoSchedulable 2012-10-18 10:42:20 -04:00
Mark DePristo f20fa9d082 SelectVariants is actually NanoSchedulable 2012-10-18 10:27:05 -04:00
Mark DePristo 97abb98c0b Bugfix for bad nt / nct argument detection in MicroScheduler 2012-10-18 10:27:05 -04:00
Eric Banks 54f698422c Better implementation for getSoftEnd() in GATKSAMRecord 2012-10-18 09:01:51 -04:00
Ami Levy Moonshine acc0fb2f7a Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-17 22:16:02 -04:00
Mauricio Carneiro b57df6cac8 Bringing CMI changes into the main GATK repo.
Merge remote-tracking branch 'cmi/master'
2012-10-17 15:23:19 -04:00
Mark DePristo 8288c30e36 Use buffered output for ExactCallLogger 2012-10-17 14:15:11 -04:00
Mark DePristo c9e7a947c2 Improve interface of ExactCallLogger, use it to have a more informative AFCalcPerformanceTest 2012-10-17 14:15:11 -04:00
David Roazen b30e2a5b7d BQSR: tool to profile the effects of more-granular locking on scalability by # of threads 2012-10-16 14:43:16 -04:00
Mark DePristo 9bcefadd4e Refactor ExactCallLogger into a separate class
-- Update minor integration tests with NanoSchedule due to qual accuracy update
2012-10-16 13:30:09 -04:00
Mark DePristo c74d7061fe Added AFCalcResultUnitTest
-- Ensures that the posteriors remain within reasonable ranges.  Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good.  Now tests ensure that the normalization process preserves log10 precision where possible
-- Updated MathUtils to make this possible
2012-10-16 08:11:06 -04:00
Mark DePristo 9b0ab4e941 Cleanup IndependentAllelesDiploidExactAFCalc
-- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all
-- Added lots of contracts and docs, still more to come.
-- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing
-- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.
2012-10-16 08:11:06 -04:00
Mark DePristo 6bd0ec8de4 Proper likelihoods and posterior probability of the joint allele frequency in IndependentAllelesDiploidExactAFCalc
-- Fixed minor numerical stability issue in AFCalcResult
-- posterior of joint A/B/C is 1 - (1 - P(D | AF_b == 0)) x (1 - P(D | AF_c == 0)), for any number of alleles, obviously.  Now computes the joint posterior like this, and then back-calculates likelihoods that generate these posteriors given the priors.  It's not pretty but it's the best thing to do
2012-10-16 08:11:06 -04:00
Mark DePristo d1511e38ad Removing ConstrainedAFCalculationModel; AFCalcPerformanceTest
-- Superceded by IndependentAFCalc
-- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it.
-- A few misc. bug fixes discovered during running the performance test
2012-10-16 08:11:06 -04:00
kshakir 9fcf71c031 Updated google reflections due to stale slf4j version conflicting with other projects also trying to use Queue as a component.
Added targets to build.xml to effectively 'mvn install' packaged GATK/Queue from ant.
TODO: Versions during 'mvn install' are hardcoded at 0.0.1 until a better versioning scheme that works with maven dependencies has been identified.
2012-10-16 02:22:30 -04:00
kshakir 213cc00abe Refactored argument matching to support other plugins in addition to file lists.
Added plugin support for sending Queue status messages.
Argument parsing can store subclasses of java.io.File, for example RemoteFile.
2012-10-15 15:10:45 -04:00
Mauricio Carneiro 80d92e0c63 Allowing the GATK to have non-required outputs
Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation).

This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out.

[#DEV-10 transition:31 resolution:1]
2012-10-15 13:49:08 -04:00
Ryan Poplin 25be94fbb8 Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10. 2012-10-15 13:24:32 -04:00
Ami Levy Moonshine 0d93effa4d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-15 11:19:12 -04:00
Mark DePristo 57e231610b New framework for EXACT calculations, with new 3 new implementations
-- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper.  The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles)
-- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated.
-- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model.  It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation.  However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct.  This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling.
-- Constrained search model implemented, but will be deleted in a subsequent code cleanup
-- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model.
-- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on.
-- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments.  The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel.
-- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now
-- Created a simple main[] function to run performance tests of the EXACT model.
2012-10-15 08:32:32 -04:00
Mark DePristo dcf8af42a8 Finalizing IndependentAllelesDiploidExactAFCalc
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef.  Not ideal.  Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly.  ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
2012-10-15 08:21:03 -04:00
Mark DePristo 1ac09ca81e More bugfixes on the way to a final push with new Exact model framework
-- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output
-- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping
2012-10-15 07:53:57 -04:00
Mark DePristo 6b639f51f0 Finalizing new exact model and tests
-- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior.
-- Tests that theta^n.alt.alleles is being applied correctly
-- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java
-- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine
2012-10-15 07:53:57 -04:00
Mark DePristo cb857d1640 AFCalcs must be made by factory method now
-- AFCalcFactory is the only way to make AFCalcs now.  There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions.  The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles.
-- AFCalc no longer has UAC constructor -- only AFCalcFactory does.  Code cleanup throughout
-- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next)
-- It's now possible to run the UG / HC with any of the exact models currently in the system.
-- Code cleanup throughout the system, reorganizing the unit tests in particular
2012-10-15 07:53:56 -04:00
Mark DePristo 6bbe750e03 Continuing work on IndependentAllelesDiploidExactAFCalc
-- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly.  A long way towards the right answer now, but still not there
-- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan
-- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there
-- New convenience method in VariantContext that looks up the allele index in the alleles
2012-10-15 07:53:56 -04:00
Mark DePristo 176b74095d Intermediate commit on the path to getting a working IndependentAllelesDiploidExact calculation
-- Still not work, but I know what's wrong
-- Many tests disabled, that need to be reanabled
2012-10-15 07:53:56 -04:00
Mark DePristo 91aeddeb5a Steps on the way to a fully described and semantically meaningful AFCalcResult
-- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold
-- Lots of contracts for AFCalcResult
-- Slowly killing off AFCalcResultsTracker
-- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele
-- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system
-- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.
2012-10-15 07:53:56 -04:00
Mark DePristo 4f1b1c4228 Intermediate commit II on simplifying AFCalcResult
-- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker.  Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10.
-- Changed the way that UnifiedGenotyper decides if the best model is non-ref.  Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult.  This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit).  If you want to understand why come talk to me.  Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.
2012-10-15 07:53:56 -04:00
Mark DePristo 06687bfaf6 Intermediate commit on simplifying AFCalcResult
-- Renamed old class AFCalcResultTracker.  This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free.
-- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef.
-- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure.  This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object
2012-10-15 07:53:56 -04:00
Mark DePristo c82aa01e0e Generalize testing infrastructure to allow us to run specific n.samples calculation 2012-10-15 07:53:55 -04:00
Mark DePristo ec935f76f6 Initial implementation and tests for IndependentAllelesDiploidExactAFCalc
-- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case.  This is very fast, scaling at O(n.alt.alleles x n.samples)
-- Many outstanding TODOs in order to truly pass unit tests
-- Added proper unit tests for the pNonRef calculation, which all of the models pass
2012-10-15 07:53:55 -04:00
Mark DePristo ee2f12e2ac Simpler naming convention for AlleleFrequencyCalculation => AFCalc 2012-10-15 07:53:55 -04:00
Mark DePristo cf3f9d6ee8 Reorganize and cleanup AFCalculations
-- Now contained in a package called afcalc
-- Extracted standard alone classes from private static classes in ExactAF
-- Most fields are now private, with accessors
-- Overall cleaner organization now
2012-10-15 07:53:55 -04:00
Mark DePristo 13211231c7 Restructure and cleanup ExactAFCalculations
-- Now there's no duplication between exact old and constrained models.  The behavior is controlled by an overloaded abstract function
-- No more static function to access the linear exact model -- you have to create the surrounding class.  Updated code in the system
-- Everything passes unit tests
2012-10-15 07:53:54 -04:00
Mark DePristo f800f3fb88 Optimized diploid exact AF calculation uses maxACs to stop the calculation by maxAC by allele
-- Added unit tests to ensure the approximation isn't so far from our reference implementation (DiploidExactAFCalculation)
2012-10-15 07:53:54 -04:00
Mark DePristo efad215edb Greedy version of function to compute the max achievable AC for each alt allele
-- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension.  Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL.  Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value.  It takes values in the order of the alt alleles.
2012-10-15 07:53:54 -04:00
Mark DePristo 7666a58773 Function to compute the max achievable AC for each alt allele
-- Additional minor cleanup of ExactAFCalculation
2012-10-15 07:53:53 -04:00
Eric Banks a8efa5451a Protect against bad bases users have screwy data (or try to use zipped references) 2012-10-12 15:05:03 -04:00
Eric Banks 81532a0529 Missing file are user errors. 2012-10-12 09:48:12 -04:00
Eric Banks fa77a83783 Update the out of space error to include another permutation 2012-10-12 09:38:12 -04:00
Eric Banks 85525d9e6e Make Geraldine's life easier: from now on we treat problems where a temp file cannot be found when running the GATK with multiple threads as User Errors (since they are 99.9% of the time). This is an extremely large class of errors in Tableau and on the forums. Helpful error message tells users exactly what we tell them on the forums anyways (Geraldine: feel free to edit). 2012-10-12 09:19:50 -04:00
Eric Banks ad60300bee Catch malformed BAM files at the source since this is the largest class of errors in Tableau. 2012-10-12 09:07:57 -04:00
Eric Banks 593c8065d9 Fix docs for BadMateFilter 2012-10-12 08:35:45 -04:00
Christopher Hartl 6b9987cf1b Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2012-10-12 00:48:42 -04:00
David Roazen 3861212dab Fix inefficiency in FilePointer GenomeLoc validation
Validation of GenomeLocs in the FilePointer class was extremely inefficient
when the GenomeLocs were added one at a time rather than all at once.

Appears to mostly fix GSA-604
2012-10-11 19:55:14 -04:00
Ami Levy Moonshine ef3882f439 PhaseByTransmission: small typo /n. variantCallQC_summaryTablesOnly.R: small changes (more to come) /n GeneralCallingPipeline.scala: the new pipeline script. It is not as clean as I want it to be, but it works. I still going to work on it a little bit more. Also, it does not include yet: (1) the RR step (2) need better eval step (3) need to include other targets (currently it eork on the CEU Trio) 2012-10-11 14:51:41 -04:00
Ryan Poplin 08b8ce6903 Fixing merge conflicts related to the comment formatting in the BQSR. 2012-10-10 16:03:58 -04:00
Ryan Poplin 45717349dc Fixing BQSR bug reported on the forum for reads that begin with insertions. 2012-10-10 16:01:37 -04:00
Ryan Poplin 2a9ee89c19 Turning on allele trimming for the haplotype caller. 2012-10-10 10:47:26 -04:00
Ryan Poplin b543bddbb7 Fixing merge conflicts related to the comment formatting in the BQSR. 2012-10-08 10:23:08 -04:00
Ryan Poplin b3cc04976f Fixing BQSR bug reported on the forum for reads that being with insertions. 2012-10-08 10:18:29 -04:00
Eric Banks a5aaa14aaa Fix for GSA-601: Indels dropped during liftover. This was a true bug that was an effect of the switch over to the non-null representation of alleles in the VariantContext. Unfortunately, this tool didn't have integration tests - but it does now. 2012-10-07 01:19:52 -04:00
Eric Banks 82e40340c0 Use StringBuilder over StringBuffer 2012-10-07 00:02:15 -04:00
Eric Banks 5d6aad67e2 Fix for bug reported on forums: VariantsToTable does not handle lists and nested arrays correctly. Added an integration test to cover printing of PLs. 2012-10-07 00:01:27 -04:00
Eric Banks e7798ddd2a Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles. 2012-10-06 23:02:36 -04:00
Eric Banks bfc551f612 Fix for GSA-589: SelectVariants with -number gives biased results. The implementation was not good and it's not worth keeping this busted code around given that we have a working implementation of a fractional random sampling already in place, so I removed it. 2012-10-06 22:39:49 -04:00
Eric Banks e8a6460a33 After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too. 2012-10-05 16:37:42 -04:00
Eric Banks 52326942cf Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-10-05 16:15:07 -04:00
Eric Banks 04853252a0 Possible fix for reduced reads coming from the HaplotypeCaller in the AD 2012-10-05 16:15:04 -04:00
Yossi Farjoun d419a33ed1 * Added an integration test for AD annotation in the Haplotype caller.
* Corrected FS Anotation for UG as for AD.
* HC still does not annotate ReducedReads correctly (for FS nor AD)
2012-10-05 15:23:59 -04:00
Yossi Farjoun dc4dcb4140 fixed AD annotation for a ReducedReads BAM file. Added an integration test for this case with a new reduced BAM in private/testdata 2012-10-05 14:20:07 -04:00
Eric Banks c66ef17cd0 Add a separate max alt alleles argument for indels that defaults to 2 instead of 3. PLEASE TAKE NOTE. 2012-10-04 13:52:14 -04:00
Mark DePristo b6e20e083a Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact
-- Will be removed.  Only commiting now to fix public -> private dependency
2012-10-03 20:16:38 -07:00
Mark DePristo 51cafa73e6 Removing public -> private dependency 2012-10-03 20:05:03 -07:00
Mark DePristo f6a2ca6e7f Fixes / TODOs for meaningful results with AFCalculationResult
-- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity).  Forced me to disable reasonable contracts.  Needs to be thought through
-- exactCallsLog should be optional
-- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.
2012-10-03 19:55:12 -07:00
Mark DePristo 50e4a832ea Generalize framework for evaluating the performance and scaling of the ExactAF models to tri-allelic variants
-- Wow, big performance problems with multi-allelic exact model!
2012-10-03 19:55:11 -07:00
Mark DePristo 3663fe1555 Framework for evaluating the performance and scaling of the ExactAF models 2012-10-03 19:55:11 -07:00
Mark DePristo 17ca543937 More ExactModel cleanup
-- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array.  This is way heavy-weight compared to just making the array each time.
-- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object.  That's the place it should really live
-- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP.  Added TODOs
2012-10-03 19:55:11 -07:00
Mark DePristo f8ef4332de Count the number of evaluations in AFResult; expand unit tests
-- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples
-- Added unittests for priors (flat and human)
-- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)
2012-10-03 19:55:11 -07:00
Mark DePristo de941ddbbe Cleanup Exact model, better unit tests
-- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic).  More assert statements to ensure quality of the result.
-- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts.  Made mutation functions all protected
-- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function.  reset is a protected method now, so it's all cleaner and nicer this way
-- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef
2012-10-03 19:55:11 -07:00
Mark DePristo 3e01a76590 Clean up AlleleFrequencyCalculation classes
-- Added a true base class that only does truly common tasks (like manage call logging)
   -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract
   -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact
   -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation
   -- All unit tests pass
2012-10-03 19:55:11 -07:00
Mark DePristo 1c52db4cdd Add exactCallsLog output file to ExactModel and StandardCallerArgumentCollection
-- This allows us to log all of the information about the exact model call (alleles, priors, PLs, result, and runtime) to a file for later debugging / optimization
2012-10-03 19:55:11 -07:00
Christopher Hartl ca31ddf2a5 Allow VCFs without PLs to be converted to a bed file with genotypes other than no-call (by setting the minimum GQ to <=0). Performance enhancements to GRM suite. 2012-10-03 21:36:35 -04:00
Christopher Hartl 1be8a88909 Changes:
1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest.
 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested.
 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test.
 4) Updates to PlinkBed-handling python scripts and utilities.
 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.
2012-10-03 16:02:42 -04:00
David Roazen 118e974731 GATK Engine: special-case "monolithic" FilePointers, and allow them to represent multiple contigs
Sometimes the GATK engine creates a single monolithic FilePointer representing all regions
in all BAM files. In such cases, the monolithic FilePointer is the only FilePointer emitted
by the BAMScheduler, and it's safe to allow it to contain regions and intervals from multiple
contigs.

This fixes support for reading unindexed BAM files (since an unindexed BAM is one case
in which the engine creates a monolithic FilePointer).
2012-10-02 15:30:03 -04:00
David Roazen a96ed385df ReadShard.getReadsSpan(): handle case where shard contains only unmapped mates
Nasty, nasty bug -- if we were extremely unlucky with shard boundaries, we might
end up with a shard containing only unmapped mates of mapped reads. In this case,
ReadShard.getReadsSpan() would not behave correctly, since the shard as a whole would
be marked "mapped" (since it refers to mapped intervals) yet consist only of unmapped
mates of mapped reads located within those intervals.
2012-10-02 13:50:00 -04:00
David Roazen ac87ed47bb BQSR: allow logging recal table updates to a file
For testing/debugging purposes only
2012-10-01 14:18:34 -04:00
Christopher Hartl 2508b0f5a7 Merged bug fix from Stable into Unstable 2012-09-29 00:57:43 -04:00
Christopher Hartl 365f1d2429 hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items:
1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts.
 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases.
 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.
2012-09-29 00:55:31 -04:00
David Roazen e740977994 GATK Engine: do not merge FilePointers that span multiple contigs
This affects both the non-experimental and experimental engine paths, and so
may break tests, but this is a necessary change.
2012-09-27 18:02:25 -04:00
David Roazen e82946e5c9 ExperimentalReadShardBalancer: create one monolithic FilePointer per contig
Merge all FilePointers for each contig into a single, merged, optimized FilePointer
representing all regions to visit in all BAM files for a given contig.

This helps us in several ways:

-It allows us to create a single, persistent set of iterators for each contig,
 finally and definitively eliminating all Shard/FilePointer boundary issues for
 the new experimental ReadWalker downsampling

-We no longer need to track low-level file positions in the sharding system (which
 was no longer possible anyway given the new experimental downsampling system)

-We no longer revisit BAM file chunks that we've visited in the past -- all BAM
 file access is purely sequential

-We no longer need to constantly recreate our full chain of read iterators

There are also potential dangers:

-We hold more BAM index data in memory at once. Given that we merge and optimize
 the index data during the merge, and only hold one contig's worth of data at a
 time, this does not appear to be a major issue. TODO: confirm this!

-With a huge number of samples and intervals, the FilePointer merge operation
 might become expensive. With the latest implementation, this does not
 appear to be an issue even with a huge number of intervals (for one sample, at least),
 but if it turns out to be a problem for > 1 sample there are things we can do.

Still TODO: unit tests for the new FilePointer.union() method
2012-09-27 14:47:54 -04:00
Christopher Hartl 55cdf4f9b7 Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release. 2012-09-27 00:13:32 -04:00
Eric Banks caa431c367 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-24 21:46:36 -04:00
David Roazen 0b488cce66 ExperimentalReadShardBalancer: close() exhausted iterators
Fixes a truly awful SAMReaders resource leak reported by Eric -- thanks Eric!
2012-09-24 14:52:59 -04:00
Mark DePristo 9fd30d6f1c When writing the initial commit for nt + nct I realized this class was really just a ThreadGroupOutputTracker
-- The code is cleaner and the logical more obvious now.
2012-09-24 14:15:36 -04:00
Mark DePristo 3e8d992828 Remove bad error test from MicroScheduler, as it's no longer applicable. 2012-09-24 14:15:36 -04:00
Mark DePristo a6b3497eac Fixes GSA-515 Nanoscheduler GSA-577 -nt and -nct together appear to not close resources properly
-- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker.
-- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map.  Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead)
-- Removed unnecessary debug statement in GenomeLocParser
-- nt and nct officially work together now
2012-09-24 14:15:35 -04:00
Mark DePristo 4749fc114f Temp. disable -nt > 1 and -nct > 1 while bugs are worked out 2012-09-24 14:15:35 -04:00
Mark DePristo 09bbd2c4c3 Include exception in VCFWriter when one is found when rethrowing as ReviewedStingException 2012-09-24 14:15:35 -04:00
Mark DePristo 10a6b57be6 Fix thread name: should be master executor not input 2012-09-24 14:15:35 -04:00
Eric Banks 9464dfdbf2 Don't penalize the reduced reads for spanning deletions (when surrounding base quals are Q2s) 2012-09-24 14:06:07 -04:00
Eric Banks 1509153b4b Adding my little walker to assess reduced bam coverage against the original bam because it's turning out to be very useful. 2012-09-23 00:47:40 -04:00
David Roazen f6a22e5f50 ExperimentalReadShardBalancerUnitTest was being skipped; fixed
TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.

This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.

Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
2012-09-22 01:56:39 -04:00
David Roazen e077347cc2 Re-allow running the GATK with experimental downsampling
It's now possible to run with experimental downsampling enabled
using the --enable_experimental_downsampling engine argument.

This is scheduled to become the GATK-wide default next week after
diff engine output for failing tests has been examined.
2012-09-21 23:20:46 -04:00
David Roazen 34eed20aa6 PerSampleDownsamplingReadsIterator: fix for incorrect use of DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL
Notify all downsamplers in our pool of the current global genomic position every
DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL position changes, not every single
positional change after that threshold is first reached.
2012-09-21 22:43:39 -04:00
David Roazen 133085469f Experimental, downsampler-friendly read shard balancer
-Only used when experimental downsampling is enabled

-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.

-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.

-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.

-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
2012-09-21 22:17:58 -04:00