Commit Graph

10548 Commits (bfbf1686cd0f71c94dea59c84b6c74c71f0ae1af)

Author SHA1 Message Date
Mark DePristo bfbf1686cd Fixed nasty bug with defaulting to diploid no-call genotypes
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid.  Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Mark DePristo d1ba17df5d Fixed nasty bug in BCF2 writer for case where all genotypes are missing
-- Previous code was looking for a -1 result from maxPloidy() but the result as actually 0, so instead of writing a diploid no call we were actually writing "unavailable" genotypes, and failing the BCF == VCF test in integration tests.  Fixed.
2012-09-12 06:46:27 -04:00
Mark DePristo 91f3204534 VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself
-- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere.
-- Removed addMissingSamples from VariantcontextUtils, and calls to it
-- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples
-- Added unit tests for this behavior in VariantContextWritersUnitTest
2012-09-12 06:46:26 -04:00
Menachem Fromer d3bdb9c67e Choose queue based on assumed run time expectation 2012-09-12 03:36:57 -04:00
Menachem Fromer 5764f1037c Added control of memory for matrix merging 2012-09-12 03:01:01 -04:00
Menachem Fromer 625fb25eca Updated import 2012-09-12 02:17:24 -04:00
Menachem Fromer 2ea28499e2 Merge branch 'master' of ssh://gsa3.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 01:58:53 -04:00
Menachem Fromer 5cb08fd17c Added XHMM option to outputTargetsBySamples 2012-09-12 01:58:04 -04:00
Ryan Poplin c23b794904 I find these per-readgroup plots to be useful. Not sure why there were turned off by default. 2012-09-11 14:31:59 -04:00
Guillermo del Angel 0dd745bb9b Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-11 11:01:41 -04:00
Guillermo del Angel 13831106d5 Fix GSA-535: storing likelihoods in allele map was busted when running HaplotypeCaller, only the last likelihood of a haplotype was being stored, as opposed to the max likelihood of all haplotypes mapping to an allele 2012-09-11 11:01:26 -04:00
David Roazen 6fad0f25bb Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest 2012-09-11 10:47:09 -04:00
Mark DePristo e25e617d1a Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency
-- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use.  So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.
2012-09-11 07:38:34 -04:00
Mark DePristo 64ee0a10fe Fix bad include in package.scala 2012-09-10 20:14:31 -04:00
Mark DePristo d6e42d839c Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly. 2012-09-10 20:14:14 -04:00
Mark DePristo 641c6a361e Fix nasty memory leak in new data thread x cpu thread parallelism
-- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up.  The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests
-- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information
-- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.
2012-09-10 20:14:14 -04:00
Mark DePristo 195cf6df7e Attempting to fix out of memory errors with new traversal engine creator 2012-09-10 20:14:14 -04:00
Mark DePristo f713d400e2 Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together
-- Can now say -nt 4 and -nct 4 to get 16 threads running for you!
-- TraversalEngines are now ThreadLocal variables in the MicroScheduler.
-- Misc. code cleanup, final variables, some contracts.
2012-09-10 20:14:14 -04:00
Mark DePristo 233f70f8ba Final cleanup of TraversalProgressMeters, moved to utils.progressmeter
-- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter.  Now just takes "nRecordsProcessed" as an argument to print reads.  Completely removes dependence on complex data structures from TraversalProgressMeter.  Can be used to measure progress on any task with processing units in genomic locations.
-- a fairly simple, class with no dependency on GATK engine or other features.
-- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.
2012-09-10 20:14:14 -04:00
Mark DePristo 934bc5eb7a Better printing of log2 values for nt, as the coord_trans in ggplot isn't as nice a log2(value) display 2012-09-10 20:14:14 -04:00
Mark DePristo 2e94a0a201 Refactor TraversalEngine to extract the progress meter functions
-- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance.  The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves.
-- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class.  I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time.  It should be possible to write some nice tests for it now.  By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before
-- Cleaned up the start up of the progress meter.  It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer
-- Simplified and made clear the interface for shutting down the traversal engines.  There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over.  Nano traversals now properly shut down (was subtle bug I undercovered here).  The printing of on traversal done metering is now handled by MicroScheduler
-- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N).
-- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome.  Useful for progress meter but also probably for other infrastructure as well
-- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful.  The generic bean is just a shell interface with nothing in it.
-- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.
2012-09-10 20:14:13 -04:00
Mark DePristo 4a84ff4fce Fix a nasty bug in reading GATK reports with a single line
-- Old version would break during reading with (as usual) a cryptic error message
-- Fixed by avoiding collapsing into a single vector type from a matrix when you subset to a single row.  I believe this code confirms thats R is truly the worst programming language ever
2012-09-10 20:14:13 -04:00
Menachem Fromer 5bce3a738d Updated import 2012-09-10 17:01:47 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
Menachem Fromer 449b89bd34 Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:08:44 -04:00
Menachem Fromer 6ab305fa4f Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:06:07 -04:00
Eric Banks ac8a4dfc2d The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now). 2012-09-10 15:04:06 -04:00
Guillermo del Angel 10c720cbba Merge branch 'master' of ssh://gsa4/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-10 09:56:47 -04:00
Eric Banks d7499e0642 Updating the rank sum test documentation 2012-09-09 22:17:36 -04:00
Guillermo del Angel 2d4b00833b Bug fix for logging likelihoods in new read allele map: reads which were filtered out were being excluded from map, but they should be included in annotations 2012-09-09 20:35:45 -04:00
Ryan Poplin 3dd0f59765 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-09 14:41:24 -04:00
Ryan Poplin 36913706c0 Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events. 2012-09-09 13:47:54 -04:00
Ryan Poplin 688fc9fb56 Bug fix in HC GenotypingEngine to ensure that all the merged complex events get properly added to the priority list used by VariantContextUtils when combining multiallelic events. 2012-09-09 10:36:09 -04:00
Eric Banks 8ca205f1a9 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-07 14:26:06 -04:00
Eric Banks b1677fc719 Fixed JIRA GSA-520 for Guillermo: when intervals with zero coverage were present, DiagnoseTargets was trying to merge them with the next interval (even if non-overlapping) which would cause problems later on when it checked to make sure that intervals were strictly overlapping. 2012-09-07 14:25:57 -04:00
Geraldine Van der Auwera 3f2a4379af Added forum API version stub to base URL for posting GATKDocs
This will prevent bugs from occurring when Vanilla make changes to the API
    as described here: http://vanillaforums.com/blog/api#configuration
    Based on the bug that broke the website Guide section on 9/6/12,
    the GATKDocs posting system will probably break in the next release if
    this is not applied as a bug fix.
2012-09-07 11:49:02 -04:00
Eric Banks ed3d9b050f Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-07 11:45:09 -04:00
Eric Banks 3dc248a49d Adding another test 2012-09-07 11:41:38 -04:00
Ryan Poplin 81b27f9db2 auto-merging to latest version 2012-09-07 11:36:47 -04:00
Eric Banks 41a8a304a0 Catch masked OutOfMemory errors as User Errors 2012-09-07 11:27:00 -04:00
Mark DePristo f25bf0f927 EfficiencyMonitoringThreadFactoryUnitTests thing keeps timing out unnecessary 2012-09-07 11:03:00 -04:00
Mark DePristo d62eca5d92 Update GATKPerformanceOverTime to measure -nt and -nct 2012-09-07 10:47:29 -04:00
Mark DePristo bcdbc751fe Update GATKPerformanceOverTime to measure -nt and -nct 2012-09-07 10:43:55 -04:00
Mark DePristo bf87de8a25 UnitTests for ReducerThread and InputProducer
-- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order
2012-09-07 09:51:32 -04:00
Mark DePristo 8c0e3b1e0c UnitTests for InputProducer 2012-09-07 09:15:16 -04:00
Mark DePristo c503884958 GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper
-- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry.
-- Restructured the NS code for clarity.  Docs everywhere.
-- This is considered version 1.0
2012-09-07 09:15:16 -04:00
Mark DePristo 9d12935986 Intermediate commit for new hyper parallel NanoScheduler
-- There's a logic bug now but I'll go to squash it...
2012-09-07 09:15:16 -04:00
Eric Banks 576c7280d9 Extensions to the ErrorThrowing framework for testing purposes 2012-09-06 22:03:18 -04:00
David Roazen cb84a6473f Downsampling: experimental engine integration
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet

-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.

-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00