Commit Graph

2655 Commits (e12ae65d33b3e6fd009fcd47eef3f90ed4e75a12)

Author SHA1 Message Date
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 6d6ca090c6 RecalDatums now hold doubles so the test for equality needs an epsilon. 2012-08-28 16:00:52 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Eric Banks e74c527d47 Register the depricated walkers as depricated starting in v2.2 so that users get a helpful error message 2012-08-28 10:19:18 -04:00
Eric Banks 67d348a31d Retiring the alignment walkers and related integration test since we don't want to support them anymore. 2012-08-28 10:16:49 -04:00
Mark DePristo 0f4acaae1b Update MD5s with new FS score 2012-08-28 08:06:47 -04:00
Mark DePristo 2996693c9f FisherStrand now computed with and without filtering low-qual bases, and least significant pvalue is kept
-- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a
systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens)
-- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be
a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the
preference for introducing an indel vs. a mismatch.
-- This implementation allows us to have our cake and eat it to by computing both p-values, and
taking the maximum one (i.e., least significant).
-- No integration tests updated yet -- still exploring the consequences of this change
2012-08-28 08:06:47 -04:00
Eric Banks bedcdbdc5f Fixing merge conflict 2012-08-27 12:16:51 -04:00
Eric Banks 3d476487c6 LIBS is totally busted for deletions. Putting a check in AD for bad pileup event bases so that we don't produce busted alleles. We must fix LIBS ASAP. 2012-08-27 12:13:12 -04:00
Mark DePristo 63a9ae817a Ensure thread-safety of CachingIndexedFastaSequenceFile
-- Cosmetic cleanup of ReadReferenceView
-- TraverseReadsNano provides the reference context, since it's thread-safe
-- Cleanup CachingIndexedFastaSequenceFile.  Add docs, remove unnecessary setters
-- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.
2012-08-27 12:11:54 -04:00
Mark DePristo e5b1f1c7f4 Add simple main function to unit test so we can run the nano scheduler test from the command line 2012-08-27 12:11:54 -04:00
Khalid Shakir 2d1ea7124b One less Queue command line requirement: -tempDir now defaults to .queue/tmp.
Also moved queueScatterGather to .queue/scatterGather.
2012-08-27 12:04:50 -04:00
Mark DePristo 68c5142d2d numThreads > 1 any time you have -nt > 1 silly 2012-08-26 14:36:13 -04:00
Mark DePristo faacacd6c0 Increase runtime of nano scheduler tests to 1 min 2012-08-26 08:42:58 -04:00
Mark DePristo 846e0c11bc Add TimeOuts to new threading tests, in case there's a underlying deadlock 2012-08-26 08:18:43 -04:00
Mark DePristo fde9824765 Optimizations for parallel read walkers
-- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone
-- Nicer debugging output in NanoScheduler
-- ReadShard has a getBufferSize() method now
2012-08-25 17:21:12 -04:00
Mark DePristo 5066b14335 Parallel FlagStat 2012-08-25 17:21:12 -04:00
Mark DePristo af540888f1 Limited version of parallel read walkers
-- Currently doesn't support accessing reference or ROD data
-- Parallel versions of PrintReads and CountReads
2012-08-25 17:21:12 -04:00
Mark DePristo e060b148e2 Minor cleanup of TraverseReads 2012-08-25 17:21:11 -04:00
Mark DePristo 275a5e5439 More tests for NanoScheduler
-- Add more contracts
-- Test in the UnitTest that the reduce is being called in the correct order
2012-08-25 17:21:11 -04:00
Christopher Hartl 6db0988898 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-25 15:40:32 -04:00
Christopher Hartl db2e88c7cb Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test. 2012-08-25 12:38:23 -07:00
Mark DePristo 59b5913b54 Merged bug fix from Stable into Unstable 2012-08-25 14:53:22 -04:00
Mark DePristo dcc972a557 Usability cleanup for BQSR
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community.  They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
2012-08-25 14:53:00 -04:00
Christopher Hartl b59948709f Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated.
A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption.

MVLikelihoodRatio now uses the trio methods from SampleDB.
2012-08-25 08:48:27 -07:00
Mark DePristo 0996bbd548 Comments for Chris on cleanup 2012-08-24 16:04:58 -04:00
Mark DePristo 649b82ce85 Merge branch 'nanoScheduler'
Conflicts:
	private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala
2012-08-24 15:59:36 -04:00
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Christopher Hartl 752f44c332 Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510 2012-08-24 12:25:11 -07:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Eric Banks 0545664f91 Fix ClassCastException seen in Tableau errors 2012-08-24 13:45:48 -04:00
Eric Banks 740520c23b Fix BQSR docs 2012-08-24 13:20:10 -04:00
Mark DePristo 1999b95754 Work around for GSA-513: ClassCastException in VariantEval 2012-08-23 18:14:49 -04:00
Christopher Hartl f1166d6d00 Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case. 2012-08-23 11:43:19 -07:00
Mark DePristo 857b11b26f Done with GSA-506: Add nt and efficiency information to GATKRunReport
-- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO
-- See https://jira.broadinstitute.org/browse/GSA-506 for more details
2012-08-23 09:59:53 -04:00
Mark DePristo 0b735884db Cleanup code in VariantContext 2012-08-23 09:59:53 -04:00
Eric Banks e5df91aa23 Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers. 2012-08-22 20:17:39 -04:00
Mark DePristo 63af0cbcba Cleanup GATK efficiency monitor classes
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable.  That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
2012-08-22 16:48:02 -04:00
Mark DePristo e1293f0ef2 GSA-507: Thread monitoring refactored so it can work without a thread factory
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor.  Includes master thread in monitor, meaning that reduce is now included for both schedulers
2012-08-22 16:48:01 -04:00
Mark DePristo f876c51277 Separately track time spent doing user and system CPU work
-- Allows us to ID (by proxy) time spent doing IO
-- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State
-- Reliable unit tests across mac and unix
2012-08-22 16:48:01 -04:00
Mark DePristo 18060f237b Add thread efficiency monitoring to GATK HMS
-- See https://jira.broadinstitute.org/browse/GSA-502
-- New command line argument -mt enables thread monitoring
-- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like:

for BQSR – known to be inefficient locking
INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m
INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m)
INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s)
INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work

for CountLoci
INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8
INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s)
INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m)
INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s)
INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work
2012-08-22 16:48:01 -04:00
Guillermo del Angel 1aa856e0e3 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 15:53:47 -04:00
Guillermo del Angel e29469eeeb Forgot to update 2 integration test md5's (in this cases, changes are legit because of the code revamp of AD, it's simpler if AD is not output when a site is not variant, as genotype DP conveys the same information) 2012-08-22 15:53:33 -04:00
Ryan Poplin fe3069b278 Merged bug fix from Stable into Unstable 2012-08-22 14:40:34 -04:00
Ryan Poplin e5cfdb4811 Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt() 2012-08-22 14:39:35 -04:00
Ryan Poplin 63213e8eb5 Expanding the HaplotypeCaller integration tests to cover a wider range of data 2012-08-22 14:18:44 -04:00
Eric Banks 944e1c299d Docs for --keepOriginalAC were wrong in SelectVariants 2012-08-22 13:07:13 -04:00
Eric Banks 2409aa9bfd Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 12:54:43 -04:00
Eric Banks 94540ccc27 Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case. 2012-08-22 12:54:29 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00