Commit Graph

2625 Commits (9de8077eebe9f1ceef2caa8da8170db35acc6692)

Author SHA1 Message Date
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Mark DePristo 1999b95754 Work around for GSA-513: ClassCastException in VariantEval 2012-08-23 18:14:49 -04:00
Christopher Hartl f1166d6d00 Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case. 2012-08-23 11:43:19 -07:00
Mark DePristo 857b11b26f Done with GSA-506: Add nt and efficiency information to GATKRunReport
-- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO
-- See https://jira.broadinstitute.org/browse/GSA-506 for more details
2012-08-23 09:59:53 -04:00
Mark DePristo 0b735884db Cleanup code in VariantContext 2012-08-23 09:59:53 -04:00
Eric Banks e5df91aa23 Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers. 2012-08-22 20:17:39 -04:00
Mark DePristo 63af0cbcba Cleanup GATK efficiency monitor classes
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable.  That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
2012-08-22 16:48:02 -04:00
Mark DePristo e1293f0ef2 GSA-507: Thread monitoring refactored so it can work without a thread factory
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor.  Includes master thread in monitor, meaning that reduce is now included for both schedulers
2012-08-22 16:48:01 -04:00
Mark DePristo f876c51277 Separately track time spent doing user and system CPU work
-- Allows us to ID (by proxy) time spent doing IO
-- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State
-- Reliable unit tests across mac and unix
2012-08-22 16:48:01 -04:00
Mark DePristo 18060f237b Add thread efficiency monitoring to GATK HMS
-- See https://jira.broadinstitute.org/browse/GSA-502
-- New command line argument -mt enables thread monitoring
-- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like:

for BQSR – known to be inefficient locking
INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m
INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s)
INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work

for CountLoci
INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s)
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m)
INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s)
INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work
2012-08-22 16:48:01 -04:00
Guillermo del Angel 1aa856e0e3 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 15:53:47 -04:00
Guillermo del Angel e29469eeeb Forgot to update 2 integration test md5's (in this cases, changes are legit because of the code revamp of AD, it's simpler if AD is not output when a site is not variant, as genotype DP conveys the same information) 2012-08-22 15:53:33 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Eric Banks 944e1c299d Docs for --keepOriginalAC were wrong in SelectVariants 2012-08-22 13:07:13 -04:00
Eric Banks 2409aa9bfd Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 12:54:43 -04:00
Eric Banks 94540ccc27 Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case. 2012-08-22 12:54:29 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Guillermo del Angel 7df0abf49b Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 11:36:41 -04:00
Eric Banks 9e76e8aa0b Just noticed that the efficient conversion to uppercase method is redundant since it's already implemented efficiently in Picard; let's just have a single implementation. 2012-08-22 11:26:08 -04:00
Christopher Hartl 20601f034e Updating the checkType() function to include the new StructuralIndel variant type. Fixes outstanding broken integration test. 2012-08-22 07:33:10 -07:00
Eric Banks c7ce3e1cf5 Merged bug fix from Stable into Unstable 2012-08-22 00:24:40 -04:00
Eric Banks 03017855e4 WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test. 2012-08-22 00:24:01 -04:00
Mark DePristo 6ce8016ae7 GSA-491: Add hidden tag to GATK that propagates to the GATK logs 2012-08-21 14:44:18 -04:00
Guillermo del Angel 6a8cf1c84a Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations 2012-08-21 14:35:40 -04:00
Guillermo del Angel d0644b3565 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:35:23 -04:00
Ryan Poplin 94e7f677ad Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-21 10:21:47 -04:00
Guillermo del Angel 418ace463a More merge conflict resolution 2012-08-21 10:15:52 -04:00
Ryan Poplin 10961db3ce Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous. 2012-08-21 09:35:55 -04:00
Ryan Poplin 605acaae9c Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous. 2012-08-21 09:33:58 -04:00
Ryan Poplin 55b7949d68 Another round of FindBugs fixes. Comparator doesn't implement Serializable. 2012-08-21 09:20:55 -04:00
Christopher Hartl ba8622ff0d number of stashed changes are lurking in here. In order of importance:
- Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker.
 - Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length.
 - Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR)
 - InsertSizeDistribution changed to use the new gatk report output (it was previously broken)
 - RetrogeneDiscovery changed to be compatible with the new gatk report
 - A maxIndelSize argument added to SelectVariants
 - ByTranscriptEvaluator rewritten for cleanliness
 - VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL
 - Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; )

Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this *also* fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".
2012-08-21 07:08:58 -04:00
Eric Banks 3dfe8df262 Merged bug fix from Stable into Unstable 2012-08-20 23:12:58 -04:00
Eric Banks 40d5efc804 Fix for Adam K's reported bug: we weren't handling reads that were entirely insertions properly in LIBS. Specifically, the event bases were off-by-one (which was disasterous in Adam's case with a 1bp read). Added a unit test to cover this case. 2012-08-20 23:12:41 -04:00
Eric Banks 286b658fab Re-enabling parallelism in the BaseRecalibrator now that the release is out. 2012-08-20 21:25:14 -04:00
Guillermo del Angel 7bbd2a7a20 Fixing merge conflicts 2012-08-20 20:38:25 -04:00
Guillermo del Angel 2041cb853c New implementation of AD - ignore now non-informative reads based on per-read likelihoods 2012-08-20 20:31:34 -04:00
Ryan Poplin 77fbaec044 Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior. 2012-08-20 16:55:00 -04:00
Ryan Poplin 5e28bca630 Another round of FindBugs fixes. Should be static inner class. 2012-08-20 16:15:48 -04:00
Ryan Poplin 5db3bd6fd2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 15:28:57 -04:00
Ryan Poplin 464d49509a Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller. 2012-08-20 15:28:39 -04:00
Eric Banks 4450d66c64 Fixing the docs for DP and AD 2012-08-20 15:10:24 -04:00
Ryan Poplin c67d708c51 Bug fix in HaplotypeCaller for non-regular bases in the reference or reads. Those events don't get created any more. Bug fix for advanced GenotypeFullActiveRegion mode: custom variant annotations created by the HC don't make sense when in this mode so don't try to calculate them. 2012-08-20 13:41:08 -04:00
Guillermo del Angel 5b5fee56cf Next iteration of new VA interface: extend changes to per-genotype annotations as well. Will allow to have AD correctly implemented at last (that change not done yet) 2012-08-20 12:52:15 -04:00
Eric Banks 154f65e0de Temporarily disabling multi-threaded usage of BaseRecalibrator for performance reasons. 2012-08-20 12:43:17 -04:00
Guillermo del Angel c384677917 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-20 10:27:25 -04:00
Eric Banks 97b191f578 Thanks to Guillermo I was able to isolate an instance of where the MLEAC > AN. It turns out that this is valid, e.g. when PLs are all 0s for a sample we no-call it but it's allowed to factor into the MLE (since that's the contract with the exact model). Removing the check in UG and instead protecting for it in the AlleleCount stratification. 2012-08-20 01:16:23 -04:00
Guillermo del Angel 963ad03f8b Second step of interface cleanup for variant annotator: several bug fixes, don't hash pileup elements to Maps because the hashCode() for a pileup element is not implemented and strange things can happen. Still several things to do, not done yet 2012-08-19 21:18:18 -04:00