Commit Graph

945 Commits (9de8077eebe9f1ceef2caa8da8170db35acc6692)

Author SHA1 Message Date
Mark DePristo 9de8077eeb Working (efficient?) implementation of NanoScheduler
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
2012-08-24 15:34:23 -04:00
Mark DePristo d6e6b30caf Initial implementation of GSA-515: Nanoscheduler
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).

Done!

CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
2012-08-24 14:07:44 -04:00
Christopher Hartl f1166d6d00 Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case. 2012-08-23 11:43:19 -07:00
Mark DePristo 63af0cbcba Cleanup GATK efficiency monitor classes
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable.  That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
2012-08-22 16:48:02 -04:00
Mark DePristo e1293f0ef2 GSA-507: Thread monitoring refactored so it can work without a thread factory
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor.  Includes master thread in monitor, meaning that reduce is now included for both schedulers
2012-08-22 16:48:01 -04:00
Mark DePristo f876c51277 Separately track time spent doing user and system CPU work
-- Allows us to ID (by proxy) time spent doing IO
-- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State
-- Reliable unit tests across mac and unix
2012-08-22 16:48:01 -04:00
Guillermo del Angel 1aa856e0e3 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 15:53:47 -04:00
Guillermo del Angel e29469eeeb Forgot to update 2 integration test md5's (in this cases, changes are legit because of the code revamp of AD, it's simpler if AD is not output when a site is not variant, as genotype DP conveys the same information) 2012-08-22 15:53:33 -04:00
Eric Banks 2409aa9bfd Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-22 12:54:43 -04:00
Eric Banks 94540ccc27 Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case. 2012-08-22 12:54:29 -04:00
Guillermo del Angel 901f47d8af Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still 2012-08-22 11:38:51 -04:00
Eric Banks c7ce3e1cf5 Merged bug fix from Stable into Unstable 2012-08-22 00:24:40 -04:00
Eric Banks 03017855e4 WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test. 2012-08-22 00:24:01 -04:00
Christopher Hartl ba8622ff0d number of stashed changes are lurking in here. In order of importance:
- Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker.
 - Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length.
 - Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR)
 - InsertSizeDistribution changed to use the new gatk report output (it was previously broken)
 - RetrogeneDiscovery changed to be compatible with the new gatk report
 - A maxIndelSize argument added to SelectVariants
 - ByTranscriptEvaluator rewritten for cleanliness
 - VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL
 - Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; )

Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this *also* fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".
2012-08-21 07:08:58 -04:00
Eric Banks 40d5efc804 Fix for Adam K's reported bug: we weren't handling reads that were entirely insertions properly in LIBS. Specifically, the event bases were off-by-one (which was disasterous in Adam's case with a 1bp read). Added a unit test to cover this case. 2012-08-20 23:12:41 -04:00
Mark DePristo 9121b98167 CombineVariants outputs the first non-MISSING qual, not the maximum
-- When merging multiple VCF records at a site, the combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value.  The previous behavior was to take the max QUAL, which resulted in sometime strange downstream confusion.
2012-08-19 10:29:38 -04:00
Mauricio Carneiro d16cb68539 Updated and more thorough version of the BadCigar read filter
* No reads with Hard/Soft clips in the middle of the cigar
   * No reads starting with deletions (with or without preceding clips)
   * No reads ending in deletions (with or without follow-up clips)
   * No reads that are fully hard or soft clipped
   * No reads that have consecutive indels in the cigar (II, DD, ID or DI)

 Also added systematic test for good cigars and iterative test for bad cigars.
2012-08-17 17:05:27 -04:00
Eric Banks 611d9b61e2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-16 13:05:36 -04:00
Eric Banks 2df04dc48a Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests. 2012-08-16 13:05:17 -04:00
Mark DePristo 4e42988c66 GSA-485: Remove repairVCFHeader from GATK codebase
-- Removed half-a*ssed attempt to automatically repair VCF files with bad headers, which allowed users to provide a replacement header overwriting the file's actually header on the fly.  Not a good idea, really.  Eric has promised to create a utility that walks through a VCF file and creates a meaningful header field based on the file's contents (if this ever becomes a priority)
2012-08-16 13:03:13 -04:00
Mark DePristo a9a1c499fd Update md5 in VariantRecalibrationWalkers test for BCF2 -- only encoding differences 2012-08-16 13:03:13 -04:00
Mark DePristo c0a31b2e5b CombineVariants parallel integration tests
-- All tests but one (using old bad VCF3 input) run unmodified with parallel code.
-- Disabled UNSAFE_VCF_PROCESSING for all but that test, which changes md5s because the output files have fixed headers
-- Minor optimizations to simpleMerge
2012-08-15 21:13:16 -04:00
Mark DePristo 669c43031a BCF2 optimizations; parallel CombineVariants
-- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header.  Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently)
-- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils.  Fixed bug in edge case that never occurred in practice
-- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right)
-- More ways to access the data in VCFHeader
-- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output
-- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary.  We just guess that on average we need ~10 elements for the attribute map
-- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet
-- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement
2012-08-15 21:13:16 -04:00
Mark DePristo dafa7e3885 Temporarily disable StateMonitoringThreadTests while I get them reliably working across platforms 2012-08-15 21:13:16 -04:00
Mark DePristo d70fd18900 Minor increase in tolerance to sum of states in UnitTest for StateMonitoringThreadFactory 2012-08-15 21:13:15 -04:00
Mark DePristo ae4d4482ac Parallel combine variants!
-- CombineVariants is now TreeReducible!
-- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format
2012-08-15 21:13:15 -04:00
Mark DePristo 9459e6203a Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Expanded unit tests
-- Support for clean logging of results to logger
-- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse
-- Added docs and contracts to StateMonitoringThreadFactory
2012-08-15 21:13:15 -04:00
Mark DePristo be3230a1fd Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates
-- Created makeCombinations utility function (very useful!).  Moved template from VariantContextTestProvider
-- UnitTests for basic functionality
2012-08-15 21:13:15 -04:00
Eric Banks 87e41c83c5 In AlleleCount stratification, check to make sure the AC (or MLEAC) is valid (i.e. not higher than number of chromosomes) and throw a User Error if it isn't. Added a test for bad AC. 2012-08-14 15:02:30 -04:00
Eric Banks 8e3774fb0e Fixing behavior of the --regenotype argument in SelectVariants to properly run in GenotypeGivenAlleles mode. Added integration tests to cover recent SV changes. 2012-08-14 14:21:42 -04:00
Eric Banks 34b62fa092 Two changes to SelectVariants: 1) don't add DP INFO annotation if DP wasn't used in the input VCF (it was adding DP=0 previously). 2) If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the VC. 2012-08-14 12:54:31 -04:00
Mark DePristo aab417c94d Fix missing argument in unittest 2012-08-12 13:58:14 -04:00
Ami Levy Moonshine 6fefdaf428 "update integration tests in CombineVariantsIntegrationTest"
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-10 17:00:35 -04:00
Ami Levy Moonshine 4968daf0a5 update integration tests at CombineVariantsIntegrationTest 2012-08-10 16:58:05 -04:00
Eric Banks eca9613356 Adding support of X and = CIGAR operators to the GATK 2012-08-10 14:54:07 -04:00
Mark DePristo 9a0dda71d4 BCF2 optimizations
-- All low-level reads throw IOException instead of catching it directly.  This allows us to not try/catch in readByte, improving performance by 5% or so
-- Optimize encodeTypeDescriptor with final variables.  Avoid using Math.min instead do inline comparison
-- Inlined willOverflow directly in its single use
2012-08-09 16:36:18 -04:00
Mark DePristo cda8d944b7 Bugfixes for BCF with VQSR
-- Old version converted doubles directly from strings.  New version uses VariantContext getAttributeAsDouble() that looks at the values directly to determine how to convert from Object to Double (via Double.valueOf, (Double), or (Double)(Integer)).
-- getAttributeAsDouble() is now smart in converting integers to doubles as needed
-- Removed unnecessary logging info in BCF2Codec
-- Added integration tests to ensure that VQSR works end-to-end with BCF2 using sites version of the file khalid sent to me
-- Added vqsr.bcf_test.snps.unfiltered.bcf file for this integration test
2012-08-07 17:22:39 -04:00
Ryan Poplin 15085bf03e The UnifiedGenotyper now makes use of base insertion and base deletion quality scores if they exist in the reads. 2012-08-07 13:58:22 -04:00
Mark DePristo 00858f16a6 Deleting empty unit test for AdaptiveContexts 2012-08-06 12:58:13 -04:00
Ryan Poplin b8709d8c67 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-08-06 11:41:28 -04:00
Ryan Poplin b7eec2fd0e Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly. 2012-08-05 12:29:10 -04:00
Mark DePristo e1bba91836 Ready for full-scale evaluation adaptive BQSR contexts
-- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees
-- Docs, algorithm descriptions, etc so that it makes sense what's going on
-- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR
 -- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)
2012-08-03 16:02:53 -04:00
Ryan Poplin 8817fc70d1 Merged bug fix from Stable into Unstable 2012-08-03 10:45:01 -04:00
Ryan Poplin f40d0a0a28 Updating VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller. Integration tests change because of the MNPs in dbSNP. 2012-08-03 10:44:36 -04:00
Mark DePristo fb5dabce18 Update BCF2 to include a minor version number so we can rev (and report errors) with BCF2
-- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results
-- Added new class BCFVersion that consolidates all of the version management of BCF
2012-08-02 17:30:30 -04:00
Mark DePristo c3c3d18611 Update BCF2 to put PASS as offset 0 not at the end
-- Unfortunately this commit breaks backward compatibility with all existing BCF2 files...
2012-08-01 17:09:22 -04:00
Mark DePristo ccac77d888 Bugfix for incorrect allele counting in IndelSummary
-- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs
-- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles
-- Update a few pieces of code to get previous correct behavior
-- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct)
-- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default
2012-08-01 15:45:12 -04:00
Joel Thibault 2b25df3d53 Add removeProgramRecords argument
* Add unit test for the removeProgramRecords
2012-08-01 15:33:05 -04:00
Eric Banks ab53d73459 Quick fix to user error catching 2012-07-31 15:50:32 -04:00
Eric Banks 10111450aa Fixed AlignmentUtils bug for handling Ns in the CIGAR string. Added a UG integration test that calls a BAM with such reads (provided by a user on GetSatisfaction). 2012-07-31 15:37:22 -04:00