Commit Graph

636 Commits (097ed4ecc46b09497a7b868db6ceca8c5fafc6e7)

Author SHA1 Message Date
Mark DePristo 097ed4ecc4 Memory usage optimizations and safety improvements to StratNode and StratificationManager
-- Added memory and safety optimizations to StratNode and StratificationManager.  Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation.  The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement.  Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo c8086a79e3 New StratificationManager based VariantEval passes unmodified integration tests
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo 8971b54b21 Phase II of Stratification manager
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V.  All key functions are implemented.  Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo 9f1cd0ff66 Lots of new functionality for StratificationStates manager
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo a3d896d80e Part I of creating a fast state space lookup for VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates).  This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Eric Banks 6b49af253b Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there. 2012-03-30 10:33:30 -04:00
Eric Banks 16bef191c6 UG integration tests updated. A handful of sites are lost because there are only 5 indels and one starts at the beginning of the read so it no longer passes our min threshold (now consistent with GGA), but mostly the depth changes ever so slightly once in a while between extended and normal pileups (I think the normal pileups are correct). I have looked thoroughly in IGV at ALL differences and am happy with the new results. As an aside, the AD is now calculated more accurately for indels. 2012-03-30 01:35:49 -04:00
Mauricio Carneiro f80bd4276a fixed estimated Q reported calculation in the gatherer 2012-03-29 12:28:43 -04:00
Guillermo del Angel a0843f125e Forgot to add file itself for new unit test 2012-03-28 21:08:18 -04:00
Roger Zurawicki 63cf7ec7ec Added more primitives to GATK Report Column Type
- The Integer column type now accepts byte and shorts
 - Updated Unit Tests and added a new testParse() test

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Guillermo del Angel d2586911a4 Forgot to add tolerance to new MathUtils unit tests 2012-03-28 08:18:36 -04:00
Guillermo del Angel b4a7c0d98d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-27 15:01:03 -04:00
Guillermo del Angel 343a061b1c Fix merge issues when incorporating new AF calculations changes 2012-03-27 15:00:44 -04:00
Mauricio Carneiro 1b75663178 BQSR Gatherer implementation and integration tests
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
   * optmized empirical qual calculation when merging recalibration reports
   * centralized the quality score quantization functionalities
   * unified the creating/loading of all the key manager/hash table structures.
   * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
   * added integration tests for BQSR and on-the-fly recalibration
2012-03-27 13:50:22 -05:00
Eric Banks c07a577ba3 Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously). 2012-03-27 00:27:44 -05:00
Guillermo del Angel e8bb8ade1a Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-26 16:42:03 -04:00
Guillermo del Angel 1a2a4848e8 Added integration test for ValidationSiteSelector, correct MD5's 2012-03-26 16:39:55 -04:00
Mark DePristo 34ea443cdb Better algorithm for choosing which indel alleles are present in samples
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5.  This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc.  If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general.  We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
2012-03-26 16:28:49 -04:00
Guillermo del Angel db54c2625f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 09:53:35 -04:00
Guillermo del Angel deb4586559 Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1 2012-03-24 21:49:43 -04:00
Mauricio Carneiro 9f74969e3a BQSR with GATKReport implementation
* restructured BQSR to report recalibrated tables.
   * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
   * linked quality score quantization to the BQSR stage, outputting a quantization histogram
   * included the arguments used in BQSR to the GATK Report
   * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities

On-the-fly recalibration with GATK Report

   * loads all tables from the GATKReport using existing infrastructure (with minor updates)
   * implemented initialiazation of the covariates using BQSR's argument list
   * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
   * applied quality quantization to the base recalibration
   * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
2012-03-23 15:42:32 -04:00
Mauricio Carneiro f421062b55 Updated read group covariate to use sample.lane instead of the id
Added Unit test.
2012-03-23 15:24:07 -04:00
Mark DePristo e4ec90cfce Merged bug fix from Stable into Unstable 2012-03-23 11:27:34 -04:00
Mark DePristo ff26f2bf68 HierarchicalMicroScheduler no longer attempts to wrap exceptions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors.  Now HMS simply checks whether the throwable it received on error was a RuntimeException.  If so, it is stored and rethrow without wrapping later.  If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
2012-03-23 11:27:21 -04:00
Mark DePristo 6df96644d9 Unified, standard IndelSummary metrics for VariantEval
-- Now you always get SNP and indel metrics with VariantEval!
--   Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
2012-03-22 21:24:37 -04:00
Menachem Fromer b9b9219ac7 Added respectPhaseInInput flag to RBP and integration tests 2012-03-22 17:40:21 -04:00
Eric Banks 8c09ff9459 Merged bug fix from Stable into Unstable 2012-03-21 12:44:43 -04:00
Eric Banks 07c3bd32b3 Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention... 2012-03-21 12:42:13 -04:00
Ryan Poplin 9e10779fa7 Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests. 2012-03-21 08:45:42 -04:00
Mauricio Carneiro 0e93cf5297 Taking care of bad cigars in the GATK
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
   * added Unit tests for BadCigarFilter
   * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
   * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
2012-03-20 14:32:57 -04:00
Mauricio Carneiro 633b5c687d Fixing MD5's (new GATKReport header was missing from old md5's) 2012-03-19 15:28:45 -04:00
Roger Zurawicki 7afb333811 GATK Report code cleanup
- Updated the documentation on the code
 - Made the table.write() method private and updated necessary files.
 - Added a constructor to GATKReport that takes GATKReportTables
 - Optimized my code

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-19 11:53:57 -04:00
Mauricio Carneiro 0d4ea30d6d Updating the BQSR Gatherer to the new file format
This is important for quick turnaround in the analysis cycle of the new covariates. Also added a dummy unit test that doesn't really test anything (disabled), but helps in debugging.
2012-03-19 09:02:27 -04:00
Ryan Poplin 943b1d34f8 intermediate commit to aid in debugging HC / exact model changes. HC integration tests will still fail 2012-03-18 15:50:27 -04:00
Eric Banks 9223e451a3 Merged bug fix from Stable into Unstable 2012-03-18 00:54:19 -04:00
Eric Banks 5c5d8e7cd3 Minor: cleaner way of turning off index-on-the-fly checking in case we want to turn it back on. 2012-03-18 00:53:29 -04:00
Guillermo del Angel a27a9ccba2 Merged bug fix from Stable into Unstable 2012-03-16 21:15:30 -04:00
Guillermo del Angel a05a7f287d TMP: disable checking of whether on the fly index is equal to index after run completed 2012-03-16 21:14:45 -04:00
Eric Banks 539d51f324 Resolving conflicts 2012-03-16 14:36:07 -04:00
Eric Banks be9e48ba29 Merged bug fix from Stable into Unstable 2012-03-16 14:33:53 -04:00
Eric Banks a7578e85e8 Rewriting a few of the indel integration tests for multi-allelics. The old tests were running b37 calls against a b36 reference, so the calls were all ref. The new tests are run against the pilot1 data and then those calls are fed back into the the same bam to test genotype given alleles, with a sprinkling of bi- and tri-allelics. 2012-03-16 14:21:27 -04:00
Mauricio Carneiro 3bfca0ccfd BitSet implementation of the on-the-fly recalibration using the CSV format file.
Infrastructure:
   * Added static interface to all different clipping algorithms of low quality tail clipping
   * Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
   * Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
   * EventType is now an independent enum with added capabilities. All functionality is now centralized.

 BQSR and RecalibrateBases:
   * On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
   * Refactored the object creation to take advantage of the compact key structure
   * Replaced nested hash maps with single hash maps indexed by bitsets
   * Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
   * Excluded contexts with N's from the output file.
   * Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
   * Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
   * Added the covariate ID (for optional covariates) to the output for disambiguation purposes
   * Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
   * Reduced memory usage of the BQSR script to 4

 Tests:
   * Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
   * Added tests for keys without optional covariates
   * Added tests for on-the-fly recalibration (but more tests are necessary)
2012-03-16 13:02:15 -04:00
Mauricio Carneiro ca11ab39e7 BitSets keys to lower BQSR's memory footprint
Infrastructure:
	* Generic BitSet implementation with any precision (up to long)
	* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
	* Memoized implementation of the BitSet utils for better performance.
	* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
	* Replace log/sqrt with bitwise logic to get rid of numerical issues

 BQSR:
	* All covariates output BitSets and have the functionality to decode them back into Object values.
	* Covariates are responsible for determining the size of the key they will use (number of bits).
	* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
	* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead

 Tests:
	* Unit tests added to every method of BitSetUtils
	* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
	* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
2012-03-16 13:01:48 -04:00
Eric Banks 7424041a17 Updating integration tests to deal with the new GL framework. Now multi-allelic indel calls are correct. 2012-03-16 12:50:39 -04:00
Eric Banks dce6b91f7d Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing. 2012-03-16 11:14:37 -04:00
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Ryan Poplin 0c6b34e9df Fixing a bug identified by the ActivityProfile unit tests 2012-03-15 14:24:30 -04:00
Ryan Poplin 252b830aa8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-15 11:56:04 -04:00
Ryan Poplin 0fa5a7af05 Adding contracts and unit tests for HaplotypeCaller GenotypingEngine 2012-03-15 11:55:48 -04:00
Mark DePristo 7c5cdb51c2 UnitTests for ActivityProfile and minor ART cleanup
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
2012-03-14 17:26:37 -04:00