Commit Graph

894 Commits (e00ed8bc5e2e2c9bf7376c4926fe1359de1a738c)

Author SHA1 Message Date
Mark DePristo e00ed8bc5e Cleanup BQSR classes
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration.  It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers.  As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces.  This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Guillermo del Angel e6b326c189 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 21:32:19 -04:00
Guillermo del Angel 6c9d3ec155 Remerge after changes to allele construction code. More cleanups/fixes to artificial read pileup provider 2012-07-30 21:32:03 -04:00
Ryan Poplin 13591b169f Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-30 12:13:24 -04:00
Guillermo del Angel 5b9a1af7fe Intermediate fix for pool GL unit test: fix up artificial read pileup provider to give consistent data. b) Increase downsampling in pool integration tests with reference sample, and shorten MT tests so they don't last too long 2012-07-30 09:56:10 -04:00
Eric Banks 7630c929a7 Re-enabling the unit tests for reverse allele clipping 2012-07-29 22:24:56 -04:00
Eric Banks b07bf1950b Adding an integration test for another feature that I snuck in during a previous commit: we now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them (this had been turned off because the previous version used Strings to do the uppercasing whereas we stick with byte operations now). 2012-07-29 22:19:49 -04:00
Eric Banks c4ae9c6cfb With the new Allele representation we can finally handle complex events (because they aren't so complex anymore). One place this manifests itself is with the strict VCF validation (ValidateVariants used to skip these events but doesn't anymore) so I've added a new test with complex events to the VV integration test. 2012-07-29 19:22:02 -04:00
Eric Banks 99b15b2b3a Final checkpoint: all tests pass. Note that there were bugs in the PoolGenotypeLikelihoodsUnitTest that needed fixing and eventually led to my needing to disable one of the tests (with a note for Guillermo to look into it). Also note that while I have moved over the GATK to use the new non-null representation of Alleles, I didn't remove all of the now-superfluous code throughout to do padding checking on merges; we'll need to do this on a subsequent push. 2012-07-29 01:07:59 -04:00
Eric Banks 2b1b00ade5 All integration tests and VC/Allele unit tests are passing 2012-07-27 17:03:49 -04:00
Eric Banks beb7610195 Resolving merge conflicts 2012-07-27 15:52:02 -04:00
Eric Banks 27e7e11ec0 Allele refactoring checkpoint #3: all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this. 2012-07-27 15:48:40 -04:00
Ryan Poplin a0890126a8 ActiveRegionWalker's isActive function returns a results object now instead of just a double. 2012-07-27 11:01:39 -04:00
Eric Banks ef335b6213 Several more walkers have been brought up to use the new Allele representation. 2012-07-27 02:14:25 -04:00
Eric Banks baf3e33730 Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass. 2012-07-26 23:27:11 -04:00
Guillermo del Angel 2ae890155c Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy 2012-07-26 13:43:00 -04:00
Eric Banks a694d1b5de Merge branch 'master' into allelePadding 2012-07-26 01:53:14 -04:00
Eric Banks 32516a2f60 Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point. 2012-07-26 01:50:39 -04:00
Mark DePristo 8c418a15da Sorting out HMS error handling (fingers crossed)
-- Check if a traversal error occurred in the last shard
-- Catch ExecutionException from the TreeReducer and throw as our HMS execption
-- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself
-- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)
2012-07-25 23:13:12 -04:00
Mark DePristo 9242f63a4d On the way to really sorting out HMS error handling
-- Better error message when a traveral error occurs (a real bug)
-- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times
-- A bit of cleanup in WalkerTest
2012-07-25 22:11:10 -04:00
Mark DePristo 5671992db3 RMDTrackBuilderUnitTest now uses private/testdata file to avoid filesystem race conditions 2012-07-25 22:05:04 -04:00
Mark DePristo 16947e93f2 Integration test to ensure VariantFiltration makes . -> PASS/FAIL like VQSR
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:39 -04:00
Mark DePristo fcefa61bce Remove reference dependence in BCF2Codec
-- Adding BCF2Codec to VCF.jar and associated unit tests

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Mark DePristo 19a257a5c1 Multiple bugfixes
-- VariantFiltration now properly sets passFilters in VC
-- BCF2 writer now properly decodes lazy BCF genotype data that it uses.  Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug!

Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
2012-07-25 08:56:38 -04:00
Guillermo del Angel 39f45127f3 Fix md5's broken by recent changes to FisherStrand calculation 2012-07-21 14:41:38 -04:00
Mauricio Carneiro 65f4b67b86 Fixing walker unit test with the new naming convention 2012-07-20 17:50:29 -04:00
Mauricio Carneiro 116885a450 Removed the "Walker" suffix from all walkers that had it.
* Did not touch archived walkers... those can be named whatever.
   * Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...)
   * Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.
2012-07-20 17:27:11 -04:00
Mark DePristo 2ca5fc62a2 Support for MISSING BCF2 type
-- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid.  Updated our codebase to support this construct.  Heng said he'll update the BCF2 quick reference.
-- Enabled integration test reading Heng's ex2.bcf file
-- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK.  Turns out there's a single record in the 1000G SV call set that doesn't have the right length
-- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X
-- Added integration test reading 1000G SV chr1 calls (from Chris)
2012-07-19 16:14:26 -04:00
Eric Banks 5f5edeca63 Reverting move of BQSR tests to public, as per DR's email 2012-07-19 10:02:05 -04:00
Eric Banks d46ccec04e Adding Unit Tests to cover the exception catching for Picard errors: because we are using String matching, we want to ensure that we know if/when the exception text changes underneath us. 2012-07-18 21:48:58 -04:00
Eric Banks 9c1ab1b0c0 Move BQSR integration test and its dependent files into public; previously there was a protected->private dependency. 2012-07-18 21:11:33 -04:00
Mark DePristo 994c5c31c1 Enabling VariantEval integration tests for ValidationReport 2012-07-18 16:07:47 -04:00
Mark DePristo 74e153ff4a FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing
-- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase
-- Updating MD5s as appropriate
2012-07-18 16:07:47 -04:00
Mark DePristo dede3a30e9 Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- TODO: actually run integration tests when I have an internet connection
2012-07-18 16:07:47 -04:00
Mark DePristo 559a4826be Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
2012-07-18 16:07:46 -04:00
Laurent Francioli 68d0e4dd6d - Multi-allelic sites are now correctly ignored - Reporting of mendelian violations enhanced - Corrected TP overflow by caping it to Bye.MAX_VALUE
-Updated integrationtests to reflect changes in MVF file output

Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-07-17 15:21:10 -04:00
Eric Banks f657b8bda8 Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow. 2012-07-17 00:32:34 -04:00
Eric Banks 52baac1e16 Move BQSRv2 into public and v1 into the archive. 2012-07-16 14:23:38 -04:00
Khalid Shakir 6dfcc486e8 In ApplyRecalibration marking filter as PASS instead of '.' when the site passes by calling .passFilters(). 2012-07-13 15:40:56 -04:00
Guillermo del Angel 279dff9f81 Bug fix when specifying a JEXL expression for a field that doesn't exist: we should treat the whole expression as false, but we were rethrowing the JEXL exception in this case. Added integration test to cover this in SelectVariants 2012-07-10 13:59:00 -04:00
Mark DePristo 5b0ade67c8 Updates to VCF processing for better BCF processing
-- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order].  Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy
-- Updating GATK code to use the appropriate header function (this is why so many files have changed)
-- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this)
-- Bugfix for adding contig lines to BCF2 header dictionary
-- VCFHeader metaData no longer sorted internally.  The system now maintains the data in header order, and only sorts output as requested in API
-- VCFWriter and BCF2Writer now explictly sort their header lines
-- Don't allow filters to be added that are PASS in the contract
2012-07-08 15:44:33 -07:00
Mauricio Carneiro e93b025b39 Fixing unit test
with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.
2012-07-06 12:08:09 -04:00
Mauricio Carneiro 17efbbf8b1 Fixed ReadClipperUnitTest
The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.
2012-07-03 16:38:51 -04:00
Eric Banks 22f1afddaa Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 14:55:59 -04:00
Ryan Poplin 9e8e78de15 Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels. 2012-07-03 14:30:37 -04:00
Eric Banks 0b37d44b0d Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup. 2012-07-03 13:05:11 -04:00
Eric Banks 031322ff00 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 00:12:59 -04:00
Eric Banks a4670113bd Refactored/renamed the nested integer array; cleaned up code a bit. 2012-07-03 00:12:33 -04:00
Mark DePristo 1b0a775773 Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest 2012-07-02 15:55:42 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00