Commit Graph

9052 Commits (37d979d98d192f7269084300a67a3c600f070af6)

Author SHA1 Message Date
Mark DePristo 37d979d98d GATK performance over time includes GATK 1.5 2012-03-18 19:49:26 -04:00
Ryan Poplin 943b1d34f8 intermediate commit to aid in debugging HC / exact model changes. HC integration tests will still fail 2012-03-18 15:50:27 -04:00
Ryan Poplin c4f4d16490 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-18 14:27:42 -04:00
Eric Banks 9223e451a3 Merged bug fix from Stable into Unstable 2012-03-18 00:54:19 -04:00
Eric Banks 5c5d8e7cd3 Minor: cleaner way of turning off index-on-the-fly checking in case we want to turn it back on. 2012-03-18 00:53:29 -04:00
Eric Banks 344a938a70 When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array. 2012-03-18 00:36:30 -04:00
Ryan Poplin 4f2f1cbca9 misc optimizations to the HMM code related to allocating and initializing the big state space arrays 2012-03-17 14:07:11 -04:00
Guillermo del Angel a27a9ccba2 Merged bug fix from Stable into Unstable 2012-03-16 21:15:30 -04:00
Guillermo del Angel a05a7f287d TMP: disable checking of whether on the fly index is equal to index after run completed 2012-03-16 21:14:45 -04:00
Eric Banks 539d51f324 Resolving conflicts 2012-03-16 14:36:07 -04:00
Eric Banks be9e48ba29 Merged bug fix from Stable into Unstable 2012-03-16 14:33:53 -04:00
Eric Banks a7578e85e8 Rewriting a few of the indel integration tests for multi-allelics. The old tests were running b37 calls against a b36 reference, so the calls were all ref. The new tests are run against the pilot1 data and then those calls are fed back into the the same bam to test genotype given alleles, with a sprinkling of bi- and tri-allelics. 2012-03-16 14:21:27 -04:00
Mauricio Carneiro ec4a870a0f Added @PG tag to ReduceReads
Pulled out the functionality from Indel Realigner and Table Recalibrator into Utils.setupWriter to make everyone else's life's easier if they want to include the PG tag in their walkers.
2012-03-16 14:09:07 -04:00
Mauricio Carneiro e4cbeddf2d adding on-the-fly recalibration test data 2012-03-16 13:18:16 -04:00
Mauricio Carneiro 3bfca0ccfd BitSet implementation of the on-the-fly recalibration using the CSV format file.
Infrastructure:
   * Added static interface to all different clipping algorithms of low quality tail clipping
   * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
   * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
   * EventType is now an independent enum with added capabilities. All functionality is now centralized.

 BQSR and RecalibrateBases:
   * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
   * Refactored the object creation to take advantage of the compact key structure
   * Replaced nested hash maps with single hash maps indexed by bitsets
   * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
   * Excluded contexts with N's from the output file.
   * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
   * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
   * Added the covariate ID (for optional covariates) to the output for disambiguation purposes
   * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
   * Reduced memory usage of the BQSR script to 4

 Tests:
   * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
   * Added tests for keys without optional covariates
   * Added tests for on-the-fly recalibration (but more tests are necessary)
2012-03-16 13:02:15 -04:00
Mauricio Carneiro ca11ab39e7 BitSets keys to lower BQSR's memory footprint
Infrastructure:
	* Generic BitSet implementation with any precision (up to long)
	* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
	* Memoized implementation of the BitSet utils for better performance.
	* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
	* Replace log/sqrt with bitwise logic to get rid of numerical issues

 BQSR:
	* All covariates output BitSets and have the functionality to decode them back into Object values.
	* Covariates are responsible for determining the size of the key they will use (number of bits).
	* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
	* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead

 Tests:
	* Unit tests added to every method of BitSetUtils
	* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
	* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
2012-03-16 13:01:48 -04:00
Eric Banks 7424041a17 Updating integration tests to deal with the new GL framework. Now multi-allelic indel calls are correct. 2012-03-16 12:50:39 -04:00
Eric Banks dce6b91f7d Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing. 2012-03-16 11:14:37 -04:00
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Ryan Poplin e86ce8f3d6 updating HaplotypeCaller integration tests to reflect all the recent changes. 2012-03-15 14:56:35 -04:00
Ryan Poplin 0c6b34e9df Fixing a bug identified by the ActivityProfile unit tests 2012-03-15 14:24:30 -04:00
Ryan Poplin 252b830aa8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-15 11:56:04 -04:00
Ryan Poplin 0fa5a7af05 Adding contracts and unit tests for HaplotypeCaller GenotypingEngine 2012-03-15 11:55:48 -04:00
Ryan Poplin c1f454fbe6 cleaning up and expanding LikelihoodCalculationEngine unit tests 2012-03-15 08:53:11 -04:00
Mauricio Carneiro c865950923 fixing my typo on the md5. 2012-03-14 22:00:03 -04:00
Ryan Poplin 1212a65140 Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine 2012-03-14 21:26:01 -04:00
Ryan Poplin 1429ddcf55 Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine 2012-03-14 21:25:43 -04:00
Mauricio Carneiro c045542442 ReduceReads default downsampling strategy is now NORMAL
Adaptive downsampler had an undesirable behavior in strange regions of the genome. This is a temporary fix, both downsamplers will be made obsolete when engine's positional downsampler gets generalized to read walkers.
2012-03-14 17:29:47 -04:00
Mark DePristo 7c5cdb51c2 UnitTests for ActivityProfile and minor ART cleanup
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
2012-03-14 17:26:37 -04:00
Mark DePristo e440c9be98 Clean up logic for adding reads to ART cache
-- No longer has duplicate code
2012-03-14 17:26:37 -04:00
Mark DePristo 5bcb5c7433 Preliminary refactoring of ART
-- Refactored ART into clearer, simpler procedures.  Attempted to merge shared code into utility classes.
-- Added some docs
-- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive
-- Separated band-pass filtering from creation of active regions.  Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions
-- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions()
-- Many TODOs in ActivityProfile.
2012-03-14 17:26:37 -04:00
Mark DePristo e73406b9b5 CountReadsInActiveRegions now emits a detailed GATK report
-- This report details which intervals are coming in and how many reads they contain
-- Added integration test to verify that the intervals aren't changing, before heading into the ART refactor
2012-03-14 17:26:37 -04:00
Mark DePristo 86eed6de07 Updating 1000G summary table to use new CNVs list 2012-03-14 17:26:36 -04:00
Ryan Poplin 66411ea1e9 misc minor cleanup 2012-03-14 16:10:25 -04:00
Ryan Poplin 1da8928407 HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes. 2012-03-14 15:22:21 -04:00
Guillermo del Angel eca055ccad Add option in ValidationAmplicons to only output SNPs and INDELs, ignoring complex variants (or SVs, etc.) 2012-03-14 14:26:40 -04:00
Eric Banks f7c2c818fe Exact model memory optimization: instead of having a later matrix column pull in data from earlier ones (requiring us to keep them around until all dependencies are hit), the earlier columns push data into their dependents immediately and then are removed. This does trade off speed a little bit (because we need to call approximateLog10Sum each time we add to a dependent instead of once in an array at the end). Note that this commit would normally not get pushed into stable, but I'm about to make a very disruptive push into stable that would make merging this from unstable a nightmare. 2012-03-14 14:02:36 -04:00
Mark DePristo 8e96969744 Support for exception-class in analyzeRunReports.py 2012-03-14 12:27:11 -04:00
Mark DePristo 6a40ca6bec Merged bug fix from Stable into Unstable 2012-03-14 12:19:33 -04:00
Mark DePristo bb2c10b785 Capture the class of the exception in GATKRunReport
-- As suggested by David.
2012-03-14 12:16:22 -04:00
Ryan Poplin 78a4e7e45e Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner. 2012-03-14 12:05:05 -04:00
Eric Banks 47e5a80d0f Trivial submission script that's useful to have for next time 2012-03-14 10:17:14 -04:00
Mark DePristo 06340a3c48 Code cleanup now that we have Tableau analysis
-- Stop looking at exceptions in daily digest
-- Remove old code analyzeRunReports that wasn't being maintained
2012-03-14 08:06:31 -04:00
Mark DePristo 841d200688 Always use long format 2012-03-13 17:05:29 -04:00
Eric Banks 77243d0df1 Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me 2012-03-13 16:31:51 -04:00
Eric Banks f76da1efd2 Updating md5s because MultiallelicSummary is now standard 2012-03-13 16:31:13 -04:00
Eric Banks ae65d86b81 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-13 16:26:51 -04:00
Eric Banks 568a1362f5 Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me 2012-03-13 16:19:15 -04:00
Ryan Poplin 2d5ca8bcfe Adding my AnalyzeCovariates R script for Mauricio to use 2012-03-13 13:05:10 -04:00
Eric Banks 6e18ecfc9a Adding integration test to cover errors from my previous commit (GENOTYPE_GIVEN_ALLELE bugs reported by Sara Pulit and Chris Hartl) 2012-03-13 12:43:40 -04:00