Commit Graph

2766 Commits (bebd5c14b85561ba361ed168407e36c0f52b9e1d)

Author SHA1 Message Date
Ami Levy Moonshine ccc3f4ff8d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-17 09:58:27 -04:00
Ami Levy Moonshine ee0b17d98f typo in VE 2012-09-17 09:51:51 -04:00
Eric Banks 86be50f18d Add note to docs that the --list argument requires full command-line 2012-09-14 10:58:44 -04:00
Eric Banks 0206e09a6a Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 15:18:27 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Eric Banks 4bb7a99f08 Given that all classes implementing output stubs already have getters for the underlying OutputStream and File, it makes sense to unify that functionality into the Stub interface. Now it is possible to have an Engine utility method that iterates over all registered stubs to find the one representing a given OutputStream and return the File associated with it. 2012-09-12 11:51:44 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
Christopher Hartl 96be1cbea9 My own integration test isn't passing with a clean checkout. This fix to the walker ought to do it. 2012-09-12 10:11:06 -04:00
Christopher Hartl 546586b70e Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-12 10:09:42 -04:00
Mark DePristo bfbf1686cd Fixed nasty bug with defaulting to diploid no-call genotypes
-- For the pooled caller we were writing diploid no-calls even when other samples were haploid.  Changed maxPloidy function to return a defaultPloidy, rather than 0, in the case where all samples are missing.
-- VCF/BCF Writers now create missing genotypes with the ploidy of other samples, or 2 if none are available at all.
-- Updating integration tests for general ploidy, as previously we wrote ./. even when other calls were 0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/0/1/1/1/1/1, but now we write ./././././././././././././././././././././././. (ugly but correct)
2012-09-12 07:08:03 -04:00
Mark DePristo d1ba17df5d Fixed nasty bug in BCF2 writer for case where all genotypes are missing
-- Previous code was looking for a -1 result from maxPloidy() but the result as actually 0, so instead of writing a diploid no call we were actually writing "unavailable" genotypes, and failing the BCF == VCF test in integration tests.  Fixed.
2012-09-12 06:46:27 -04:00
Mark DePristo 91f3204534 VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself
-- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere.
-- Removed addMissingSamples from VariantcontextUtils, and calls to it
-- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples
-- Added unit tests for this behavior in VariantContextWritersUnitTest
2012-09-12 06:46:26 -04:00
Christopher Hartl 5d19fca649 A couple of bug-fixy changes.
1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now
    checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to
    all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning,
    and does the appropriate subsetting. Added integration tests for this.

 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method
    as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and
    added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).
2012-09-11 23:01:00 -04:00
David Roazen 6fad0f25bb Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest 2012-09-11 10:47:09 -04:00
Mark DePristo e25e617d1a Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency
-- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use.  So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.
2012-09-11 07:38:34 -04:00
Mark DePristo d6e42d839c Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly. 2012-09-10 20:14:14 -04:00
Mark DePristo 641c6a361e Fix nasty memory leak in new data thread x cpu thread parallelism
-- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up.  The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests
-- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information
-- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.
2012-09-10 20:14:14 -04:00
Mark DePristo 195cf6df7e Attempting to fix out of memory errors with new traversal engine creator 2012-09-10 20:14:14 -04:00
Mark DePristo f713d400e2 Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together
-- Can now say -nt 4 and -nct 4 to get 16 threads running for you!
-- TraversalEngines are now ThreadLocal variables in the MicroScheduler.
-- Misc. code cleanup, final variables, some contracts.
2012-09-10 20:14:14 -04:00
Mark DePristo 233f70f8ba Final cleanup of TraversalProgressMeters, moved to utils.progressmeter
-- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter.  Now just takes "nRecordsProcessed" as an argument to print reads.  Completely removes dependence on complex data structures from TraversalProgressMeter.  Can be used to measure progress on any task with processing units in genomic locations.
-- a fairly simple, class with no dependency on GATK engine or other features.
-- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.
2012-09-10 20:14:14 -04:00
Mark DePristo 2e94a0a201 Refactor TraversalEngine to extract the progress meter functions
-- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance.  The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves.
-- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class.  I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time.  It should be possible to write some nice tests for it now.  By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before
-- Cleaned up the start up of the progress meter.  It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer
-- Simplified and made clear the interface for shutting down the traversal engines.  There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over.  Nano traversals now properly shut down (was subtle bug I undercovered here).  The printing of on traversal done metering is now handled by MicroScheduler
-- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N).
-- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome.  Useful for progress meter but also probably for other infrastructure as well
-- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful.  The generic bean is just a shell interface with nothing in it.
-- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.
2012-09-10 20:14:13 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
Eric Banks ac8a4dfc2d The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now). 2012-09-10 15:04:06 -04:00
Eric Banks d7499e0642 Updating the rank sum test documentation 2012-09-09 22:17:36 -04:00
Eric Banks 8ca205f1a9 Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-07 14:26:06 -04:00
Eric Banks b1677fc719 Fixed JIRA GSA-520 for Guillermo: when intervals with zero coverage were present, DiagnoseTargets was trying to merge them with the next interval (even if non-overlapping) which would cause problems later on when it checked to make sure that intervals were strictly overlapping. 2012-09-07 14:25:57 -04:00
Geraldine Van der Auwera 3f2a4379af Added forum API version stub to base URL for posting GATKDocs
This will prevent bugs from occurring when Vanilla make changes to the API
    as described here: http://vanillaforums.com/blog/api#configuration
    Based on the bug that broke the website Guide section on 9/6/12,
    the GATKDocs posting system will probably break in the next release if
    this is not applied as a bug fix.
2012-09-07 11:49:02 -04:00
Eric Banks ed3d9b050f Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-07 11:45:09 -04:00
Eric Banks 3dc248a49d Adding another test 2012-09-07 11:41:38 -04:00
Ryan Poplin 81b27f9db2 auto-merging to latest version 2012-09-07 11:36:47 -04:00
Eric Banks 41a8a304a0 Catch masked OutOfMemory errors as User Errors 2012-09-07 11:27:00 -04:00
Mark DePristo f25bf0f927 EfficiencyMonitoringThreadFactoryUnitTests thing keeps timing out unnecessary 2012-09-07 11:03:00 -04:00
Mark DePristo d62eca5d92 Update GATKPerformanceOverTime to measure -nt and -nct 2012-09-07 10:47:29 -04:00
Mark DePristo bf87de8a25 UnitTests for ReducerThread and InputProducer
-- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order
2012-09-07 09:51:32 -04:00
Mark DePristo 8c0e3b1e0c UnitTests for InputProducer 2012-09-07 09:15:16 -04:00
Mark DePristo c503884958 GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper
-- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry.
-- Restructured the NS code for clarity.  Docs everywhere.
-- This is considered version 1.0
2012-09-07 09:15:16 -04:00
Mark DePristo 9d12935986 Intermediate commit for new hyper parallel NanoScheduler
-- There's a logic bug now but I'll go to squash it...
2012-09-07 09:15:16 -04:00
Eric Banks 576c7280d9 Extensions to the ErrorThrowing framework for testing purposes 2012-09-06 22:03:18 -04:00
David Roazen cb84a6473f Downsampling: experimental engine integration
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet

-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.

-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
2012-09-06 15:03:27 -04:00
Eric Banks 6df6c1abd5 Fix for PBT to stop NPE when there are no likelihoods present 2012-09-06 13:14:18 -04:00
Mark DePristo 5ab5d8dee8 Give EfficiencyMonitoringThreadFactoryUnitTest longer to complete its tests 2012-09-05 22:08:34 -04:00
Mark DePristo 1b064805ed Renaming -cnt to -nct for consistency 2012-09-05 21:13:19 -04:00
Mark DePristo 228bac75e4 By default do only NT tests in integration tests 2012-09-05 20:57:49 -04:00
Mark DePristo 574a8f710b Add static boolean controlled output of individual map call timing to nanoSecond resolution 2012-09-05 17:40:02 -04:00
Mark DePristo e11915aa0a GSA-515 Nanoscheduler GSA-550 ThreadSafeMapReduce shouldn't be super interface of TreeReducible 2012-09-05 17:37:56 -04:00
Mark DePristo c5f1ceaa95 All read and loci traversals go through NanoScheduler now
-- The NanoScheduler is doing a good job at tracking important information like time spent in map/reduce/input etc.
-- Can be disabled with static boolean in MicroScheduler if we have problems
-- See GSA-515 Nanoscheduler GSA-549 Retire TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine
2012-09-05 16:38:21 -04:00
Mark DePristo dddf148a59 Fixed bug in ThreadAllocation getTotalNumberOfThreads
-- It isnt data + cpu its data * cpu threads.
2012-09-05 16:35:32 -04:00
Mark DePristo 225f3a0ebe Update integration test system to allow us to differentiate between testing data and cpu parallelism 2012-09-05 16:35:00 -04:00
Mark DePristo 9bf1d138d9 New GATK argument interface for data and cpu threads
-- Closes GSA-515 Nanoscheduler GSA-542 Good interface to nanoScheduler
-- Old -nt means dataThreads
-- New -cnt (--num_cpu_threads_per_data_thread) gives you n cpu threads for each data thread in the system
-- Cleanup logic for handling data and cpu threading in HMS, LMS, and MS
-- GATKRunReport reports the total number of threads in use by the GATK, not just the nt value
-- Removed the io,cpu tags for nt.  Stupid system if you ask me.  Cleaned up the GenomeAnalysisEngine and ThreadAllocation handling to be totally straightforward now
2012-09-05 15:45:24 -04:00