Commit Graph

10523 Commits (d2f3d6d22ff72ef29c98ce0591092469c33ca42e)

Author SHA1 Message Date
Christopher Hartl b59948709f Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated.
A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption.

MVLikelihoodRatio now uses the trio methods from SampleDB.
2012-08-25 08:48:27 -07:00
Mark DePristo 0996bbd548 Comments for Chris on cleanup 2012-08-24 16:04:58 -04:00
Mark DePristo 649b82ce85 Merge branch 'nanoScheduler'
Conflicts:
	private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala
2012-08-24 15:59:36 -04:00
Mark DePristo 801b910b9e GATKPerformanceOverTime is finalized (mark II)
-- Make BQSR run longer
-- Use Dinuc not context covariates for BQSR v1
2012-08-24 15:57:48 -04:00
Mark DePristo 62aa0ac77e GATKPerformanceOverTime is finalized
-- Update BQSR to run v1 and v2.  Use new single read group extracted BAM
-- Bug fixes
2012-08-24 15:57:48 -04:00
Mark DePristo 3bbdccb0ae Refactor and cleanup GATKPerformanceOverTime
-- Use single read group BAM file for BQSR
-- Implement terrible (but clever) hack to support BQSR v1 and v2 in a single Scala class.
2012-08-24 15:57:48 -04:00
Mark DePristo 9f0eff4c4c MySQLdb required to run analyzeRunReports, despite my best efforts 2012-08-24 15:57:48 -04:00
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Christopher Hartl 752f44c332 Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510 2012-08-24 12:25:11 -07:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Eric Banks 0545664f91 Fix ClassCastException seen in Tableau errors 2012-08-24 13:45:48 -04:00
Mark DePristo b3fd74f0c4 HaplotypeCaller forbids BAQ 2012-08-24 13:25:05 -04:00
Eric Banks 740520c23b Fix BQSR docs 2012-08-24 13:20:10 -04:00
Ryan Poplin 5f8574bd15 Fixing typo in error message. 2012-08-24 10:48:41 -04:00
Mark DePristo c689d6dcac GATKPerformanceOverTime is finalized
-- Update BQSR to run v1 and v2.  Use new single read group extracted BAM
-- Bug fixes
2012-08-24 09:20:32 -04:00
Mark DePristo 8371362f3c Refactor and cleanup GATKPerformanceOverTime
-- Use single read group BAM file for BQSR
-- Implement terrible (but clever) hack to support BQSR v1 and v2 in a single Scala class.
2012-08-23 21:11:15 -04:00
Mark DePristo b6cc615890 MySQLdb required to run analyzeRunReports, despite my best efforts 2012-08-23 21:08:32 -04:00
Mark DePristo 1999b95754 Work around for GSA-513: ClassCastException in VariantEval 2012-08-23 18:14:49 -04:00
Christopher Hartl f1166d6d00 Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case. 2012-08-23 11:43:19 -07:00
Mark DePristo 2ae5ec5596 Update for GSA-506: Add nt and efficiency information to GATKRunReport
-- Python log upload now includes efficiency information in GATKLogs DB
2012-08-23 12:53:22 -04:00
Mark DePristo 857b11b26f Done with GSA-506: Add nt and efficiency information to GATKRunReport
-- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO
-- See https://jira.broadinstitute.org/browse/GSA-506 for more details
2012-08-23 09:59:53 -04:00
Mark DePristo 0b735884db Cleanup code in VariantContext 2012-08-23 09:59:53 -04:00
Mark DePristo d973863039 GATKPerformanceOverTime includes longer running tests for select variants and variant eval 2012-08-23 09:59:53 -04:00
Eric Banks e5df91aa23 Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers. 2012-08-22 20:17:39 -04:00
Mark DePristo 95a1337285 Merge branch 'threadMonitors'
Conflicts:
	private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala
2012-08-22 16:54:47 -04:00
Mark DePristo 63af0cbcba Cleanup GATK efficiency monitor classes
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable.  That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
2012-08-22 16:48:02 -04:00
Mark DePristo 1d47d2b573 Fix GATKPerformanceOverTime for BQSR file path error 2012-08-22 16:48:02 -04:00
Mark DePristo e1293f0ef2 GSA-507: Thread monitoring refactored so it can work without a thread factory
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor.  Includes master thread in monitor, meaning that reduce is now included for both schedulers
2012-08-22 16:48:01 -04:00
Mark DePristo f876c51277 Separately track time spent doing user and system CPU work
-- Allows us to ID (by proxy) time spent doing IO
-- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State
-- Reliable unit tests across mac and unix
2012-08-22 16:48:01 -04:00
Mark DePristo 18060f237b Add thread efficiency monitoring to GATK HMS
-- See https://jira.broadinstitute.org/browse/GSA-502
-- New command line argument -mt enables thread monitoring
-- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like:

for BQSR – known to be inefficient locking
INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m
INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s)
INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work

for CountLoci
INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s)
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m)
INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s)
INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work
2012-08-22 16:48:01 -04:00
Mark DePristo 27842ba448 run_performance_tests use bsub and gsa
-- Confirmed that running on gsa queue is fine with sufficient iterations (3)
2012-08-22 16:48:01 -04:00
Mark DePristo d7a6cd99cd Expand intervals processed for many GATKPerformanceOverTime commands
-- For the high NT tests the total runtime may be too short to really assess nt efficiency vs. start up costs.  Reworked underlying test data and intervals so that most tests run in 10-20 hrs for -nt 1.
2012-08-22 16:48:01 -04:00
Guillermo del Angel 1aa856e0e3 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 15:53:47 -04:00
Guillermo del Angel e29469eeeb Forgot to update 2 integration test md5's (in this cases, changes are legit because of the code revamp of AD, it's simpler if AD is not output when a site is not variant, as genotype DP conveys the same information) 2012-08-22 15:53:33 -04:00
Menachem Fromer b1b9c0b132 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 15:26:39 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Eric Banks 944e1c299d Docs for --keepOriginalAC were wrong in SelectVariants 2012-08-22 13:07:13 -04:00
Eric Banks 2409aa9bfd Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 12:54:43 -04:00
Eric Banks 94540ccc27 Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case. 2012-08-22 12:54:29 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Guillermo del Angel 7df0abf49b Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 11:36:41 -04:00
Eric Banks 9e76e8aa0b Just noticed that the efficient conversion to uppercase method is redundant since it's already implemented efficiently in Picard; let's just have a single implementation. 2012-08-22 11:26:08 -04:00
Christopher Hartl 20601f034e Updating the checkType() function to include the new StructuralIndel variant type. Fixes outstanding broken integration test. 2012-08-22 07:33:10 -07:00
Eric Banks c7ce3e1cf5 Merged bug fix from Stable into Unstable 2012-08-22 00:24:40 -04:00
Eric Banks 03017855e4 WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test. 2012-08-22 00:24:01 -04:00
Mark DePristo 1acf18aa25 run_performance_tests use bsub and gsa
-- Confirmed that running on gsa queue is fine with sufficient iterations (3)
2012-08-21 16:26:12 -04:00
Mark DePristo cb9ba4f660 Expand intervals processed for many GATKPerformanceOverTime commands
-- For the high NT tests the total runtime may be too short to really assess nt efficiency vs. start up costs.  Reworked underlying test data and intervals so that most tests run in 10-20 hrs for -nt 1.
2012-08-21 16:25:31 -04:00
Mark DePristo 1d707e7b31 Linear and Quadratic fits for GATKPerformanceOverTime.R 2012-08-21 14:44:18 -04:00