Commit Graph

10567 Commits (bebd5c14b85561ba361ed168407e36c0f52b9e1d)

Author SHA1 Message Date
Guillermo del Angel bebd5c14b8 Update general ploidy md5's due to bad merge of md5's in previous commit, and new shortened interval definition for EMIT_ALL_CONFIDENT_SITES was buggy 2012-09-18 20:12:15 -04:00
Ami Levy Moonshine ccc3f4ff8d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-17 09:58:27 -04:00
Ami Levy Moonshine ebf609f757 new R script for summmary tables of the pipeline 2012-09-17 09:57:10 -04:00
Ami Levy Moonshine ee0b17d98f typo in VE 2012-09-17 09:51:51 -04:00
Guillermo del Angel ca010160a9 Merge fix 2012-09-14 14:05:21 -04:00
Guillermo del Angel 6b37350bc0 Two hairy bugs in pool caller: a) Site error model wasn't counting errors in insertions correctly - Alleles passed in had padded ref byte, but event base in PileupElement doesn't have it. As a result, mismatch rate was grossly overestimated with insertions and we missed several calls we should have made. Integration test reflects changes. b) Adding a ref GL to the exact model is correct mathematically but AFResult wasn't filled properly. As a result, QUAL was junk in pure ref sites, and in all other sites the last ref GL introduced wasn't properly updating Pr(AF>0). c) Added integration test that covers -out_mode EMIT_ALL_CONFIDENT_SITES. Not fully sure if the math is 100% correct (for both diploid and generalized case) but at least now diploid and non-diploid cases behave similarly. md5 of this new test will fail since it's taking me a long time to run so I'll update from Bamboo output shortly 2012-09-14 13:13:22 -04:00
Eric Banks 86be50f18d Add note to docs that the --list argument requires full command-line 2012-09-14 10:58:44 -04:00
Menachem Fromer 182344ad89 Merge branch 'master' of ssh://gsa3.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 23:56:44 -04:00
Menachem Fromer 3d3578b1de Deal with empty Seq 2012-09-12 23:54:41 -04:00
Eric Banks 0206e09a6a Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 15:18:27 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Ryan Poplin bc1e03a6d8 Adding HC integration test for _structural_ insertions and deletions. 2012-09-12 12:25:39 -04:00
Ryan Poplin faad2972d6 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 12:23:24 -04:00
Ryan Poplin 849a2b8839 Adding HC integration test for _structural_ insertions and deletions. 2012-09-12 12:23:00 -04:00
Eric Banks 4bb7a99f08 Given that all classes implementing output stubs already have getters for the underlying OutputStream and File, it makes sense to unify that functionality into the Stub interface. Now it is possible to have an Engine utility method that iterates over all registered stubs to find the one representing a given OutputStream and return the File associated with it. 2012-09-12 11:51:44 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
Christopher Hartl 96be1cbea9 My own integration test isn't passing with a clean checkout. This fix to the walker ought to do it. 2012-09-12 10:11:06 -04:00
Christopher Hartl 546586b70e Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 10:09:42 -04:00
Mark DePristo bfbf1686cd Fixed nasty bug with defaulting to diploid no-call genotypes
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid.  Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Mark DePristo d1ba17df5d Fixed nasty bug in BCF2 writer for case where all genotypes are missing
-- Previous code was looking for a -1 result from maxPloidy() but the result as actually 0, so instead of writing a diploid no call we were actually writing "unavailable" genotypes, and failing the BCF == VCF test in integration tests.  Fixed.
2012-09-12 06:46:27 -04:00
Mark DePristo 91f3204534 VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself
-- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere.
-- Removed addMissingSamples from VariantcontextUtils, and calls to it
-- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples
-- Added unit tests for this behavior in VariantContextWritersUnitTest
2012-09-12 06:46:26 -04:00
Menachem Fromer d3bdb9c67e Choose queue based on assumed run time expectation 2012-09-12 03:36:57 -04:00
Menachem Fromer 5764f1037c Added control of memory for matrix merging 2012-09-12 03:01:01 -04:00
Menachem Fromer 625fb25eca Updated import 2012-09-12 02:17:24 -04:00
Menachem Fromer 2ea28499e2 Merge branch 'master' of ssh://gsa3.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 01:58:53 -04:00
Menachem Fromer 5cb08fd17c Added XHMM option to outputTargetsBySamples 2012-09-12 01:58:04 -04:00
Christopher Hartl 5d19fca649 A couple of bug-fixy changes.
1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now
    checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to
    all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning,
    and does the appropriate subsetting. Added integration tests for this.

 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method
    as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and
    added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).
2012-09-11 23:01:00 -04:00
Ryan Poplin c23b794904 I find these per-readgroup plots to be useful. Not sure why there were turned off by default. 2012-09-11 14:31:59 -04:00
Guillermo del Angel 0dd745bb9b Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-11 11:01:41 -04:00
Guillermo del Angel 13831106d5 Fix GSA-535: storing likelihoods in allele map was busted when running HaplotypeCaller, only the last likelihood of a haplotype was being stored, as opposed to the max likelihood of all haplotypes mapping to an allele 2012-09-11 11:01:26 -04:00
David Roazen 6fad0f25bb Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest 2012-09-11 10:47:09 -04:00
Mark DePristo e25e617d1a Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency
-- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use.  So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.
2012-09-11 07:38:34 -04:00
Mark DePristo 64ee0a10fe Fix bad include in package.scala 2012-09-10 20:14:31 -04:00
Mark DePristo d6e42d839c Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly. 2012-09-10 20:14:14 -04:00
Mark DePristo 641c6a361e Fix nasty memory leak in new data thread x cpu thread parallelism
-- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up.  The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests
-- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information
-- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.
2012-09-10 20:14:14 -04:00
Mark DePristo 195cf6df7e Attempting to fix out of memory errors with new traversal engine creator 2012-09-10 20:14:14 -04:00
Mark DePristo f713d400e2 Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together
-- Can now say -nt 4 and -nct 4 to get 16 threads running for you!
-- TraversalEngines are now ThreadLocal variables in the MicroScheduler.
-- Misc. code cleanup, final variables, some contracts.
2012-09-10 20:14:14 -04:00
Mark DePristo 233f70f8ba Final cleanup of TraversalProgressMeters, moved to utils.progressmeter
-- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter.  Now just takes "nRecordsProcessed" as an argument to print reads.  Completely removes dependence on complex data structures from TraversalProgressMeter.  Can be used to measure progress on any task with processing units in genomic locations.
-- a fairly simple, class with no dependency on GATK engine or other features.
-- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.
2012-09-10 20:14:14 -04:00
Mark DePristo 934bc5eb7a Better printing of log2 values for nt, as the coord_trans in ggplot isn't as nice a log2(value) display 2012-09-10 20:14:14 -04:00
Mark DePristo 2e94a0a201 Refactor TraversalEngine to extract the progress meter functions
-- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance.  The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves.
-- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class.  I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time.  It should be possible to write some nice tests for it now.  By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before
-- Cleaned up the start up of the progress meter.  It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer
-- Simplified and made clear the interface for shutting down the traversal engines.  There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over.  Nano traversals now properly shut down (was subtle bug I undercovered here).  The printing of on traversal done metering is now handled by MicroScheduler
-- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N).
-- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome.  Useful for progress meter but also probably for other infrastructure as well
-- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful.  The generic bean is just a shell interface with nothing in it.
-- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.
2012-09-10 20:14:13 -04:00
Mark DePristo 4a84ff4fce Fix a nasty bug in reading GATK reports with a single line
-- Old version would break during reading with (as usual) a cryptic error message
-- Fixed by avoiding collapsing into a single vector type from a matrix when you subset to a single row.  I believe this code confirms thats R is truly the worst programming language ever
2012-09-10 20:14:13 -04:00
Menachem Fromer 5bce3a738d Updated import 2012-09-10 17:01:47 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
Menachem Fromer 449b89bd34 Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:08:44 -04:00
Menachem Fromer 6ab305fa4f Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:06:07 -04:00
Eric Banks ac8a4dfc2d The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now). 2012-09-10 15:04:06 -04:00
Guillermo del Angel 10c720cbba Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-10 09:56:47 -04:00
Eric Banks d7499e0642 Updating the rank sum test documentation 2012-09-09 22:17:36 -04:00
Guillermo del Angel 2d4b00833b Bug fix for logging likelihoods in new read allele map: reads which were filtered out were being excluded from map, but they should be included in annotations 2012-09-09 20:35:45 -04:00