Commit Graph

2480 Commits (6f8e7692d4e67d5cfb2d2f9e33549ff4fbf2e5f8)

Author SHA1 Message Date
Guillermo del Angel c16f9f2f15 a) Use new method to check for GATK Like, b) minor improvements to indel pool caller (more to come): brain-dead, quick way to limit number of alt alleles to genotype. We can't process too many alt alleles because of the combinatorial explosion of GL values with high ploidy, and some STR validation targets had up to 12 alt alleles, resulting of GL vectors of > 1e8 elements. Can't use pileup elements since typically not many alleles will be in one pileup, and different alleles will appear in different samples, TBD a nicer solution. c) Commit to posterity scala script for large scale validation calling, still work in progress 2012-07-19 10:24:08 -04:00
Eric Banks 5f5edeca63 Reverting move of BQSR tests to public, as per DR's email 2012-07-19 10:02:05 -04:00
Eric Banks e370030e6c As requested by Mark, I've broken out the code to pull out the protected subclass when available (and otherwise use the public version) into the GATKLiteUtils class. People should use this code instead of reimplementing all of the java reflection on their own. 2012-07-18 22:44:37 -04:00
Eric Banks d46ccec04e Adding Unit Tests to cover the exception catching for Picard errors: because we are using String matching, we want to ensure that we know if/when the exception text changes underneath us. 2012-07-18 21:48:58 -04:00
Eric Banks 9c1ab1b0c0 Move BQSR integration test and its dependent files into public; previously there was a protected->private dependency. 2012-07-18 21:11:33 -04:00
Mark DePristo 994c5c31c1 Enabling VariantEval integration tests for ValidationReport 2012-07-18 16:07:47 -04:00
Mark DePristo 74e153ff4a FisherStrand now uses RankSumTest isUsableBase to decide if a read should be included in testing
-- Previously used hardcoded MAPQ > 20 && QUAL > 20 but now uses isUsableBase
-- Updating MD5s as appropriate
2012-07-18 16:07:47 -04:00
Mark DePristo dede3a30e9 Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- TODO: actually run integration tests when I have an internet connection
2012-07-18 16:07:47 -04:00
Mark DePristo 559a4826be Improvements to the validation report of VariantEval
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status.  This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF.  The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
2012-07-18 16:07:46 -04:00
Mark DePristo dc292c0317 FisherStrand now includes all reads and bases, regardless of mapping quality and base quality, just like other annotations
-- This actually proved to be a problem with Ion Torrent data where the base quality can be quite low, and so we need to include Q15 bases for calling effectively.
2012-07-18 16:07:46 -04:00
Eric Banks 2c0f073ab1 Make -qq arg hidden for now since it's still very experimental 2012-07-18 15:43:25 -04:00
Eric Banks b46c85e8b4 More bad BAM file catching 2012-07-18 15:26:31 -04:00
Eric Banks 659eee13a6 Handle NPE generated in UG when non-standard reference bases are present in the fasta 2012-07-18 15:16:27 -04:00
Eric Banks 9af2cfe283 Catch underlying file system problems that get masked as Tribble index errors. There's also a quick patch to the HMS that isn't really the ultimate fix needed; Mark and I will review at a later point. 2012-07-18 15:11:38 -04:00
Eric Banks 4c730542f0 Handle RuntimeExceptions thrown by Picard that are really User Errors. I will add unit tests for these as best I can later. 2012-07-18 13:56:35 -04:00
Eric Banks ae08d35138 Catch 'too many open files' errors that show up when trying to read the bam index. All that needs to be done is to flesh out the original error message (because it will get caught later and rethrown correctly). 2012-07-18 12:57:34 -04:00
Eric Banks f2fe59a9d4 Wow, there are a ton of errors captured having to do with being unable to merge the temp Tribble output. I'm expanding the error message a bit to help see if we can do anything going forward. 2012-07-18 12:31:59 -04:00
Eric Banks e4db8dde91 Enabled a whole other bunch of integration tests for BQSRv2. While I was there I also changed the default context size for indels to 3 (from 8) since that's what works best in the current implementation (as suggested by Ryan). At this point, all of the new core tools (ReduceReads, BQSRv2, HaplotypeCaller, UG extensions) have been moved over to protected and should be stable. Looks like we are pretty much ready for GATK 2.0! 2012-07-17 23:36:43 -04:00
Eric Banks a8d08ea18d As a user pointed out, it is not valid for a GenomeLoc to have a start or stop equal to 0. 2012-07-17 22:18:43 -04:00
Guillermo del Angel 29273abab7 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 16:58:12 -04:00
Guillermo del Angel 731bbba2e6 Bug fixes for integration test, use correct new UG syntax 2012-07-17 16:57:59 -04:00
Eric Banks 33be41ecf5 Cleaning up integration test 2012-07-17 16:06:04 -04:00
Eric Banks 8dbc9cb29c Add the ability to emit the original quals in the OQ tag 2012-07-17 15:52:56 -04:00
Guillermo del Angel 40b8c7172c Pool Caller refactoring in preparation of GATK 2.0: a) PoolCallerUnifiedArgumentCollection disappeared, and arguments moved to UnifiedArgumentCollection. b) PoolCallerWalker is no longer needed and redundant, all functionality subsumed by UG. UG now checks if GATK is lite - if so, don't allow ploidy > 2. c) Moved pool classes from private to protected. d) Changed the way to specify ploidy. Instead of specifying samples per pool and having ploidy = 2*samplesPerPool, have user specify ploidy directly, which is cleaner. Update tests accordingly. We can now call triploid seedless grape genotypes correctly in theory. e) Renamed argument -reference to -reference_sample_calls since the former is ambiguous and it's not clear what it refers to. 2012-07-17 15:27:04 -04:00
Laurent Francioli 68d0e4dd6d - Multi-allelic sites are now correctly ignored - Reporting of mendelian violations enhanced - Corrected TP overflow by caping it to Bye.MAX_VALUE
-Updated integrationtests to reflect changes in MVF file output

Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-07-17 15:21:10 -04:00
Eric Banks b0d99fd10d Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 15:12:28 -04:00
Eric Banks 305db8c0d1 Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us. 2012-07-17 15:11:03 -04:00
Ryan Poplin 6efbcd99f1 HaplotypeCaller is now an AnnotatorCompatibleWalker with all the rights and privileges pertaining thereto. Enabling the ClippingRankSumTest after showing it was useful for 1000 Genomes calling. 2012-07-17 14:38:36 -04:00
Eric Banks 110886e8b9 Oops, got the logic wrong. 2012-07-17 13:37:11 -04:00
Eric Banks a963b37424 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 13:15:37 -04:00
Eric Banks 3a64398d07 Cleaned up the isGATKLite check 2012-07-17 12:46:16 -04:00
Eric Banks 62c5228048 1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method. 2012-07-17 12:23:40 -04:00
Chris Saunders 1913d1bbd0 Put RunReport S3 upload on timeout thread
Move the RunReport S3 upload process onto a separate thread with a timeout allowing the parent to continue.

Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
2012-07-17 12:19:39 -04:00
Eric Banks 40618ac471 A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots). 2012-07-17 10:52:43 -04:00
Eric Banks d5b3a2eabf Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-17 00:32:53 -04:00
Eric Banks f657b8bda8 Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow. 2012-07-17 00:32:34 -04:00
Eric Banks a003148d50 Move AnalyzeCovariates over too. 2012-07-16 16:11:56 -04:00
Eric Banks 0a89adbcdb Add utility decorators so that classes can tell you which package source they come from if they want to (suggested by Khalid). Using those decorators, we can easily pull out the BQSR updateDataForPileupElement() method into a standard RecalibrationEngine and an AdvancedRecalibrationEngine and use the protected one (AdvancedRE) if available (otherwise, the public one). 2012-07-16 15:34:50 -04:00
Eric Banks 52baac1e16 Move BQSRv2 into public and v1 into the archive. 2012-07-16 14:23:38 -04:00
Khalid Shakir 07822d6c0f Fixed input annotations for master/test files on DiffObjectsWalker. 2012-07-16 13:33:11 -04:00
Eric Banks 2a830939df Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-14 23:49:59 -04:00
Eric Banks f29cadd7e2 By default, don't quantize quals in BQSRv2 2012-07-14 23:49:48 -04:00
Eric Banks 75543a3f22 ReadClipper.clipRead's claim that it doesn't modify the original read was false. Ultimately, GATKSAMRecord.clone (as documented) creates a soft copy of the read - so modifying e.g. the bases of the cloned read means that you modify the bases of the original read too. Because of this, when the BQSRv2 Context covariate was writing Ns over the low quality tails of the reads they got propagated out to the output BAM file (very bad). I've updated the ReadClipper docs and cleaned up the code (no reason to use a clone of the read anymore given that we are already modifying the original). For now, the simplest thing is to have the Context covariate store the original bases, overwrite low quality Ns, compute covariates, and rewrite the original bases; we can update later if needed. 2012-07-13 18:50:27 -04:00
Ryan Poplin 443f02ffc2 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-13 16:09:24 -04:00
Khalid Shakir 6dfcc486e8 In ApplyRecalibration marking filter as PASS instead of '.' when the site passes by calling .passFilters(). 2012-07-13 15:40:56 -04:00
Ryan Poplin d70bb59182 HaplotypeCaller now calls insertion events that aren't fully assembled as symbolic alleles. 2012-07-10 14:22:23 -06:00
Guillermo del Angel 279dff9f81 Bug fix when specifying a JEXL expression for a field that doesn't exist: we should treat the whole expression as false, but we were rethrowing the JEXL exception in this case. Added integration test to cover this in SelectVariants 2012-07-10 13:59:00 -04:00
Mauricio Carneiro 7eb45b4038 Fixed BQSR IntegrationTests
* BinaryTag covariate is Experimental, not Standard (this was breaking integration tests)
   * New parameter in the Recalibration report requires new MD5 for one of the integration tests.
2012-07-09 13:55:12 -04:00
Eric Banks dd0c47ab7e Don't cast to a specific walker type since any walker can use the VA engine 2012-07-09 10:25:58 -04:00
Mark DePristo 5b0ade67c8 Updates to VCF processing for better BCF processing
-- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order].  Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy
-- Updating GATK code to use the appropriate header function (this is why so many files have changed)
-- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this)
-- Bugfix for adding contig lines to BCF2 header dictionary
-- VCFHeader metaData no longer sorted internally.  The system now maintains the data in header order, and only sorts output as requested in API
-- VCFWriter and BCF2Writer now explictly sort their header lines
-- Don't allow filters to be added that are PASS in the contract
2012-07-08 15:44:33 -07:00
Mark DePristo 63f5262e45 mergeInfoWithMaxAC is no longer hidden in CombineVariants 2012-07-08 15:44:32 -07:00
Mark DePristo 66aee613e2 Bugfix for set key in mergeInfoWithMaxAC.
-- Previous version was always setting set=source of info with highest AC.  Should actually have been set to the set annotation value itself.
2012-07-08 15:44:32 -07:00
Mark DePristo 91f0ed8059 Fixed nasty Rscript typo in VariantRecalibrator when compactPDF is available 2012-07-08 15:44:32 -07:00
Mark DePristo 87b090c362 Update VariantRecalibator error message to use -resource not old -B syntax 2012-07-08 15:44:31 -07:00
Mauricio Carneiro 125e6c1a47 added BinaryTagCovariate for ancient dna analysis 2012-07-06 15:03:20 -04:00
Mauricio Carneiro e93b025b39 Fixing unit test
with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.
2012-07-06 12:08:09 -04:00
Mauricio Carneiro f603d4c48c Fixing PairHMMIndelErrorModel boundary issue
When checking the limits of a read to clip, it wasn't considering reads that may already been clipped before.
2012-07-06 11:48:04 -04:00
Eric Banks dd571d9aa0 Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags. 2012-07-04 01:22:20 -04:00
Eric Banks 33306d2e20 Changing the logic of the -standard argument; the way it stands currently one can never turn off the cycle or context covariates. Now they are on by default and users must opt out of them to turn them off. 2012-07-04 00:21:21 -04:00
Eric Banks 7d30558e6f Only 'pad' the cycle covariate for indels, not substitutions 2012-07-03 23:47:01 -04:00
Mauricio Carneiro 17efbbf8b1 Fixed ReadClipperUnitTest
The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.
2012-07-03 16:38:51 -04:00
Eric Banks 22f1afddaa Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 14:55:59 -04:00
Eric Banks 617eebd204 More misc cleanup 2012-07-03 14:55:37 -04:00
Eric Banks 344c3aeb1d Cleanup from previous commit 2012-07-03 14:42:44 -04:00
Ryan Poplin 9e8e78de15 Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels. 2012-07-03 14:30:37 -04:00
Eric Banks 0b37d44b0d Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup. 2012-07-03 13:05:11 -04:00
Eric Banks 031322ff00 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 00:12:59 -04:00
Eric Banks a4670113bd Refactored/renamed the nested integer array; cleaned up code a bit. 2012-07-03 00:12:33 -04:00
Ryan Poplin f92139dd82 Ooops, UG VA path for rank sum tests aren't happy with empty lists. Disabling clipping rank sum test for now. 2012-07-02 21:12:42 -04:00
Ryan Poplin 7e7b4cd1b9 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-02 16:37:54 -04:00
Ryan Poplin b807ff63ef HaplotypeCaller now creates MNP and complex substitutions by using LD information to decide if events segregate together on haplotypes. Added unit test. 2012-07-02 16:37:39 -04:00
Mauricio Carneiro 3cea080aa8 Cache SoftStart() and SoftEnd() in the GATKSAMRecord
these are costly operations when done repeatedly on the same read.
2012-07-02 16:22:00 -04:00
Mauricio Carneiro 88a02fa2cb Fixing but for reads with cigars like 9S54H
When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.
2012-07-02 16:22:00 -04:00
Mark DePristo 1b0a775773 Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest 2012-07-02 15:55:42 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00
Mark DePristo 602729c09d Moved parallel tests from SelectVariants to separate SelectVariantsParallelIntegrationTest
-- Enabled previous tests -- all now working
-- Added modern test against new VCF as well
2012-07-02 11:40:28 -04:00
Mark DePristo bcd2e13d8b Adding duplicate header line keys is a logger.debug not logger.warn message now 2012-07-02 11:39:34 -04:00
Mark DePristo 01e04992f8 Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output 2012-07-02 11:38:59 -04:00
Eric Banks c94c8a9c09 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-02 08:53:01 -04:00
Mark DePristo 7aff4446d4 Added unit tests for header repairing capabilities in the GATK engine 2012-07-01 15:38:10 -04:00
Mark DePristo 480b32e759 BCF2 is now officially zero-based open-interval, and that's how the GATK does it now 2012-07-01 14:59:27 -04:00
Ryan Poplin b6093ff02c Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-01 10:32:37 -04:00
Mark DePristo 9b87dcda4f Fixing remaining integration test errors. Adding missing ex2.bcf 2012-06-30 16:23:11 -04:00
Mark DePristo 5ad9a98a15 Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations
-- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order
-- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values
2012-06-30 11:22:49 -04:00
Mark DePristo 385a3c630f Added check in VariantContext.validate to ensure that getEnd() == END value when present
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
2012-06-30 11:22:48 -04:00
Mark DePristo 893630af53 Enabling symbolic alleles in BCF2
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication.  Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper.  Actually documented this code, which results in lots of **** negative comments on the code quality.  Eric has promised that he and Ami are going to rethink this code from scratch.  Fixed many nasty bugs in here, cleaning up unnecessary branches, etc.  Added UnitTests in VCFAlleleClipper that actually test the code full.  In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed.  Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason.  Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
2012-06-30 11:22:48 -04:00
Mark DePristo 16276f81a1 BCF2 with support symbolic alleles
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
2012-06-30 11:22:48 -04:00
Mark DePristo 86feea917e Updating MD5s to reflect new FT fixed count of 1 not UNBOUNDED 2012-06-30 11:22:47 -04:00
Mark DePristo 6bea28ae6f Genotype filters are now just Strings, not Set<String> 2012-06-30 11:22:47 -04:00
Guillermo del Angel f631be8d80 UnifiedGenotyperEngine.calculateGenotypes() is not only used in UG but in other walkers - vc attributes shouldn't be inherited by default or it may cause undefined behaviour in those walkers, so only inherit attributes from input vc in case of UG calling this function 2012-06-29 23:51:52 -04:00
Guillermo del Angel 65037b87da Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-29 11:08:44 -04:00
Guillermo del Angel 5a9a37ba01 Pool caller improvements: a) Log ref sample depth at every called site (will add more ref-related annotations later), b) Make -glm POOLBOTH work in case we want to genotype snp's and indels together, c) indel bug fix (pool and non-pool): prevent a bad GenomeLoc to be formed if we're running GGA and incoming alleles are larger than ref window size (typically 400 bb) 2012-06-29 11:08:16 -04:00
Eric Banks 96ea334bf2 Disable caching in BQSR for now since it significantly slows down computation; will look into this in a bit. 2012-06-28 15:27:44 -04:00
Ryan Poplin 05791ebf80 Adding the Clipping rank sum test: If alternate-supporting reads have more hard clipping than reference-supporting reads this is evidence for error. 2012-06-28 13:22:56 -04:00
Ryan Poplin d12ec92a55 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-28 12:57:59 -04:00
Ryan Poplin 5bb0693888 Bug fix for HC GGA mode. Shouldn't try to add an indel into the haplotype if that haplotype already contains the event of interest. Misc minor assembly param changes. Turning off capping of base qualities by base indel qualities until we can evaluate that change. 2012-06-28 12:57:51 -04:00
Khalid Shakir 1ce0b9d519 Throwing UnknownTribbleType exception instead of CommandLineException when an unknown tribble type is specified. 2012-06-28 11:28:04 -04:00
Mark DePristo 734bb5366b Special case the situation where we have ploidy == 0 (no GT values) to implicitly assume we have diploid samples
-- numLikelihoods no longer allows even ploidy == 0 in requires
-- VCFCompoundHeaderLine handles the case where ploidy == 0 => implicit ploidy == 2
2012-06-28 10:06:07 -04:00
Mark DePristo 064cc56335 Update integration tests to reflect new FT header line standard and new DiagnoseTargets field names 2012-06-28 10:06:06 -04:00
Mark DePristo 64d7e93209 Massive bugfixes
-- Previous version was reading the size of the encoded genotypes vector for each genotype.  This only worked because I never wrote out genotype field values with > 15 elements.  Mauricio's killer DiagnoseTargets VCF uncovered the bug.  Unfortunately since symbolic allele clipping is still busted those tests are still diabled
-- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.
2012-06-28 10:06:06 -04:00
Mark DePristo 7144154f53 VCFWriter and BCFWriter no longer allow missing samples in the VC compared to their header
-- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously.
-- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects
-- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC
-- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function
2012-06-28 10:06:06 -04:00
Mark DePristo 4811a00891 GENOTYPE_FILTER_KEY is now a VCFStandardHeaderLine 2012-06-28 10:06:05 -04:00
Mark DePristo 93426a44b1 Fixes for DiagnoseTargets to be VCF/BCF2 spec complaint
-- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int
-- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK.  Genotype filters should be empty for PASSing sites
2012-06-28 10:06:05 -04:00
Eric Banks dc7636b923 Refactor the ContextCovariate to significantly reduce runtime 2012-06-28 02:29:35 -04:00
Eric Banks 1fafd9f6c8 NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation. 2012-06-27 16:55:49 -04:00
Khalid Shakir 746a5e95f3 Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding.
Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals.
IntervalUtils return unmodifiable lists so that utilities don't mutate the collections.
Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus.
Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values.
JobRunInfo handles the null done times when jobs crash with strange errors.
2012-06-27 01:15:22 -04:00
Mark DePristo 016b25be87 Update annoying md5s in unit tests, also failing because of header fixing 2012-06-26 17:32:42 -04:00
Mark DePristo cd32b6ae54 CombineVariantsUnitTest was failing because the header repair was fixing the problem it wanted to detect 2012-06-26 17:32:42 -04:00
Mark DePristo 1f45551a15 Bugfixes to G count types in VCF header
-- Previously VCF header lines of count type G assumed that the sample would be diploid.
-- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class
-- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class
-- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations
-- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples
-- Added extensive unit tests that compare A and G type values in genotypes
2012-06-26 15:28:34 -04:00
Mark DePristo 7ef5ce28cc VariantRecalibrator test currently doesn't work with shadowBCF 2012-06-26 15:28:34 -04:00
Mark DePristo 5f5885ec78 Updating many MD5s to reflect correct fixed headers
-- Previous bugfix ensures that header fixing is always on in the GATK by default, even after integration tests that failed and when through the VCFDiffableReader.  Updating md5s to reflect this.
2012-06-26 15:28:34 -04:00
Mark DePristo 39c849aced Bugfix to ensure the DB=1 old files decode properly 2012-06-26 15:28:33 -04:00
Mark DePristo c1ac0e2760 BCF2 cleanup
-- allowMissingVCFHeaders is now part of -U argument.  If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING.  Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup.  This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
2012-06-26 15:28:33 -04:00
Mark DePristo 0b5980d7b3 Added Heng's nasty ex2.vcf to standard tests 2012-06-26 15:28:33 -04:00
Mark DePristo 11dbfc92a7 Horrible bugfix to decodeLoc() in BCF2Codec
-- Just completely wrong.
-- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem
-- Added samtools ex2.bcf file for decoding to our integrationtests
2012-06-26 15:28:32 -04:00
Mark DePristo fb26c0f054 Update integration tests to reflect header changes 2012-06-26 15:28:32 -04:00
Mark DePristo 7b96263f8b Disable shadowBCF for VariantRecalibrationWalkers tests because it cannot handle symbolic alleles yet 2012-06-26 15:28:32 -04:00
Mark DePristo 7dbba465ee Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf 2012-06-26 15:28:32 -04:00
Mark DePristo 6e9a81aabe Minor bugfix -- now that the testfile is in our testdata regenerate the idx file as needed to pass tests 2012-06-26 15:28:32 -04:00
Roger Zurawicki 7eb3e4da41 Added integration Tests for DiagnoseTargets
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-25 17:02:46 -04:00
Joel Thibault f0c54d99ed Account for a null attributes object
* field attributesCanBeModified - a null attributes object can't be modified in its current state
* method makeAttributesModifiable() - initialize a null attributes object to empty
2012-06-25 12:07:36 -04:00
Joel Thibault d0cf8bcc80 Add unit tests for VariantContextBuilder.rmAttribute() and .attribute()
* These generated NPEs when the attribute object is null
2012-06-25 12:05:04 -04:00
Joel Thibault fd9effbfe2 Fix Exception typo 2012-06-25 12:05:04 -04:00
Ryan Poplin 429ad44421 Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start. 2012-06-22 14:53:29 -04:00
Ryan Poplin 735b59d942 Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero. 2012-06-22 12:38:48 -04:00
Ryan Poplin 0650b349d7 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-22 10:42:49 -04:00
Guillermo del Angel eed32df30d a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this 2012-06-21 21:19:55 -04:00
Mark DePristo d17369e0ac A few misc. residual errors in last commit 2012-06-21 16:04:25 -04:00
Mark DePristo 734756d6b2 Final fixes before BCF2 mark III push
-- Added MLEAC and MLEAF format lines to PoolCallerWalker
-- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation
-- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing
2012-06-21 15:17:22 -04:00
Mark DePristo 31ee8aa01a JEXL update
-- Update to 2.1.1 from 2.0
-- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching.  So "AF < 0.5" works even in the presence of multi-allelics now.
--
2012-06-21 15:17:21 -04:00
Mark DePristo 549293b6f7 Bugfixes towards final BCF2 implementation
-- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF
-- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing
-- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one
-- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing
-- Keys whose value is null are put into the VariantContext info attributes now
2012-06-21 15:17:21 -04:00
Mark DePristo 567dba0f76 Cleanup of VCF header lines and constants, BCF2 bugfixes
-- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller
-- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place.  Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers
-- VCF parsers now automatically repair standard VCF header lines when reading the header
-- Updating integration tests to reflect header changes
-- Created private and public testdata directories (public/testdata and private/testdata).  Updated tests to use test
-- SelectHeaders now always updates the header to include the contig lines
-- SelectVariants add UG header lines when in regenotype mode
-- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY
-- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs)
-- Throw error when VCF has unbounded non-flag values that don't have = value bindings
-- By default we no longer allow writing of BCF2 files without contig lines in the header
2012-06-21 15:16:31 -04:00
Mark DePristo fba7dafa0e Finalizing BCF2 mark III commit
-- Moved GENOTYPE_KEY vcf header line to VCFConstants.  This general migration and cleanup is on Eric's plate now
-- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header.  Still doesn't work...
-- Updating integration test files.  Moved many more files into public/testdata.  Updated their headers to all work correctly with new strict VCF header checking.
-- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt
-- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine.  DB = 0 is never seen in the output VCFs now
-- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status
-- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be
-- VariantsToVCF now properly writes out the GT FORMAT field
-- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code.  Eric said he and Ami will clean up this whole piece of instructure
-- Fixed bug in BCF2Codec that wasn't setting the phase field correctly.  UnitTested now
-- PASS string now added at the end of the BCF2 dictionary after discussion with Heng
-- Fixed bug where I was writing out all field values as BigEndian.  Now everything is LittleEndian.
-- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException
-- Cleaned up unused code
-- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding
-- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples
-- We always write the number of genotype samples into the BCF2 nSamples header.  How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names...
-- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally
-- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele
-- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values.  Genotype objects are all PASS implicitly, or explicitly filtered.  We only write out the FT values if at least one sample is filtered.  Removed interface functions and cleaned up code
-- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works.  In general, **** NEVER COPY CODE **** if you need to share funcitonality make a function, that's why there were invented!
-- Increased the default number of records to read for DiffObjects to 1M
2012-06-21 15:16:27 -04:00
Mark DePristo 0c8b830db7 Updating MD5s for inclusion of RPA field header 2012-06-21 15:16:26 -04:00
Mark DePristo d015a5738d Bugfixes for VCFWriterUnitTest and TestProvider to deal with stricter VCFWriter behavior 2012-06-21 15:16:26 -04:00
Mark DePristo 9c81f45c9f Phase I commit to get shadowBCFs passing tests
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header.  This helps avoid some of the low-level errors I saw in SelectVariants.  This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
    -- using arrays not lists for intermediate data structures
    -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
    -- Warn and fix on the way flag values with counts > 0
    -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow.  Duplicates are detected once at header creation.
    -- Explicitly track FILTER fields for efficient lookup in their own hashmap
    -- Automatically add PL field when we see a GL field and no PL field
    -- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing.  Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
2012-06-21 15:16:26 -04:00
Mauricio Carneiro ab53220635 Refactor on how RR treats soft clips
* Sites with more soft clipped bases than regular will force-trigger a variant region
   * No more unclipping/reclipping, RR machinery now handles soft clips natively.
   * implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
   * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.

note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
2012-06-21 14:02:03 -04:00
Ryan Poplin 769e190202 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-20 09:59:55 -04:00
Christopher Hartl fe1d6e3953 Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-19 08:02:00 -04:00
Christopher Hartl 79ef3325bd Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations. 2012-06-19 07:50:13 -04:00
Eric Banks 15ae906f32 Once I was playing with integration tests it was simple to fix the ones I left broken from earlier today. 2012-06-18 21:54:58 -04:00
Eric Banks 62cee2fb5b Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature. 2012-06-18 21:36:27 -04:00
Eric Banks 4393adf9e7 If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it. 2012-06-18 13:36:14 -04:00
Ryan Poplin 707151f0a4 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 12:55:58 -04:00
Eric Banks 82a2c40338 Emit the MLE AC and AF in the INFO field of the UG output 2012-06-18 12:19:36 -04:00
Ryan Poplin 5ec737f008 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-18 08:51:48 -04:00
Ryan Poplin e3147969d9 Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled. 2012-06-18 08:51:41 -04:00
Eric Banks 677babf546 Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort. 2012-06-15 15:55:03 -04:00
Eric Banks 783b7f6899 Misc cleanup 2012-06-15 10:39:19 -04:00
Eric Banks 0c218e4822 Refactoring mostly for readability (and small performance improvement) 2012-06-15 10:36:41 -04:00