Commit Graph

9876 Commits (bcd2e13d8bb1e1d4c9493030ef06c8775dadaa35)

Author SHA1 Message Date
Mark DePristo bcd2e13d8b Adding duplicate header line keys is a logger.debug not logger.warn message now 2012-07-02 11:39:34 -04:00
Mark DePristo 01e04992f8 Fixed compatibilities in AbstractVCFCodec that resulted in key=; being parsed as written as key; in VCF output 2012-07-02 11:38:59 -04:00
Mark DePristo 337e434c01 Test files for LENIENT_VCF_PROCESSING and repairing VCF headers 2012-07-01 15:52:04 -04:00
Mark DePristo 7aff4446d4 Added unit tests for header repairing capabilities in the GATK engine 2012-07-01 15:38:10 -04:00
Mark DePristo 480b32e759 BCF2 is now officially zero-based open-interval, and that's how the GATK does it now 2012-07-01 14:59:27 -04:00
Mark DePristo 9b87dcda4f Fixing remaining integration test errors. Adding missing ex2.bcf 2012-06-30 16:23:11 -04:00
David Roazen d7d140eb10 Move archived qscripts from private/scala to private/archive/scala 2012-06-30 13:15:57 -04:00
David Roazen 6dc8f7c049 Revert "Moving scala archive files to the actual archive"
This reverts commit 1ea2cbeda707f3aa72c18221ee97c9b1f3560884.
2012-06-30 13:04:40 -04:00
David Roazen 1ff2384c0c Revert "Removing extra directory level"
This reverts commit f83536ecefb7cc2414bbc5e0381c2208ab4eab68.
2012-06-30 13:04:20 -04:00
Mark DePristo 40c03ab2f5 Removing extra directory level 2012-06-30 12:16:09 -04:00
Mark DePristo 955bab0802 Moving scala archive files to the actual archive 2012-06-30 11:37:27 -04:00
Mark DePristo 5ad9a98a15 Minor bugfixes / consistency fixes to filter strings of Genotypes and AC/AF annotations
-- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order
-- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values
2012-06-30 11:22:49 -04:00
Mark DePristo 6ebefcc89f Added contig and info field header lines 2012-06-30 11:22:49 -04:00
Mark DePristo 385a3c630f Added check in VariantContext.validate to ensure that getEnd() == END value when present
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
2012-06-30 11:22:48 -04:00
Mark DePristo 893630af53 Enabling symbolic alleles in BCF2
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication.  Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper.  Actually documented this code, which results in lots of **** negative comments on the code quality.  Eric has promised that he and Ami are going to rethink this code from scratch.  Fixed many nasty bugs in here, cleaning up unnecessary branches, etc.  Added UnitTests in VCFAlleleClipper that actually test the code full.  In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed.  Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason.  Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
2012-06-30 11:22:48 -04:00
Mark DePristo 16276f81a1 BCF2 with support symbolic alleles
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
2012-06-30 11:22:48 -04:00
Mark DePristo 86feea917e Updating MD5s to reflect new FT fixed count of 1 not UNBOUNDED 2012-06-30 11:22:47 -04:00
Mark DePristo 6bea28ae6f Genotype filters are now just Strings, not Set<String> 2012-06-30 11:22:47 -04:00
Guillermo del Angel f631be8d80 UnifiedGenotyperEngine.calculateGenotypes() is not only used in UG but in other walkers - vc attributes shouldn't be inherited by default or it may cause undefined behaviour in those walkers, so only inherit attributes from input vc in case of UG calling this function 2012-06-29 23:51:52 -04:00
Guillermo del Angel 14274c43d9 Added integration tests for true pools, using a 1 MB extract from the 93-pool large scale validation data. Test -glm BOTH and -glm INDEL 2012-06-29 14:22:33 -04:00
Guillermo del Angel b945fb22a1 Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-29 14:21:46 -04:00
Joel Thibault cbc79443b0 Add intervals file option to Tester scripts 2012-06-29 13:49:47 -04:00
Joel Thibault f95b9848b4 Add new scripts for VCF and BCF testing 2012-06-29 13:49:47 -04:00
Joel Thibault 256704c03d Copy intervals/intersection processing to new MongoDBIntersection file 2012-06-29 13:49:45 -04:00
Guillermo del Angel 65037b87da Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-29 11:08:44 -04:00
Guillermo del Angel 5a9a37ba01 Pool caller improvements: a) Log ref sample depth at every called site (will add more ref-related annotations later), b) Make -glm POOLBOTH work in case we want to genotype snp's and indels together, c) indel bug fix (pool and non-pool): prevent a bad GenomeLoc to be formed if we're running GGA and incoming alleles are larger than ref window size (typically 400 bb) 2012-06-29 11:08:16 -04:00
Joel Thibault e9d750297a Updates for the inline/extended attributes split 2012-06-28 13:38:00 -04:00
Joel Thibault abe74dc32d Navel -> GXDB 2012-06-28 13:38:00 -04:00
Khalid Shakir 1ce0b9d519 Throwing UnknownTribbleType exception instead of CommandLineException when an unknown tribble type is specified. 2012-06-28 11:28:04 -04:00
Mark DePristo 734bb5366b Special case the situation where we have ploidy == 0 (no GT values) to implicitly assume we have diploid samples
-- numLikelihoods no longer allows even ploidy == 0 in requires
-- VCFCompoundHeaderLine handles the case where ploidy == 0 => implicit ploidy == 2
2012-06-28 10:06:07 -04:00
Mark DePristo 064cc56335 Update integration tests to reflect new FT header line standard and new DiagnoseTargets field names 2012-06-28 10:06:06 -04:00
Mark DePristo 64d7e93209 Massive bugfixes
-- Previous version was reading the size of the encoded genotypes vector for each genotype.  This only worked because I never wrote out genotype field values with > 15 elements.  Mauricio's killer DiagnoseTargets VCF uncovered the bug.  Unfortunately since symbolic allele clipping is still busted those tests are still diabled
-- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.
2012-06-28 10:06:06 -04:00
Mark DePristo 7144154f53 VCFWriter and BCFWriter no longer allow missing samples in the VC compared to their header
-- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously.
-- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects
-- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC
-- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function
2012-06-28 10:06:06 -04:00
Mark DePristo 4811a00891 GENOTYPE_FILTER_KEY is now a VCFStandardHeaderLine 2012-06-28 10:06:05 -04:00
Mark DePristo 93426a44b1 Fixes for DiagnoseTargets to be VCF/BCF2 spec complaint
-- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int
-- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK.  Genotype filters should be empty for PASSing sites
2012-06-28 10:06:05 -04:00
Mark DePristo e8288c78d7 A brutally complex VCF from DiagnosisTargets for testing 2012-06-28 10:06:04 -04:00
Eric Banks dc7636b923 Refactor the ContextCovariate to significantly reduce runtime 2012-06-28 02:29:35 -04:00
Eric Banks 1fafd9f6c8 NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation. 2012-06-27 16:55:49 -04:00
Christopher Hartl 21d5e8a252 Small modifications to the transcript evaluator 2012-06-27 11:29:59 -04:00
Christopher Hartl a1e72ebe4e InsertSizeDistribution was totally busted, and was killing my retrogene pipeline. I'm not sure GATKReports are any longer meant to be dynamically updated, so I've altered it to calculate the histogram internally and dump it to a molten output table. Also the genotyper wants to read in a flat file, so that'll need to be changed in the near future. 2012-06-27 11:25:21 -04:00
Khalid Shakir 746a5e95f3 Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding.
Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals.
IntervalUtils return unmodifiable lists so that utilities don't mutate the collections.
Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus.
Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values.
JobRunInfo handles the null done times when jobs crash with strange errors.
2012-06-27 01:15:22 -04:00
Mark DePristo a5df8f1277 This file shouldn't be under source control 2012-06-26 17:32:42 -04:00
Mark DePristo 016b25be87 Update annoying md5s in unit tests, also failing because of header fixing 2012-06-26 17:32:42 -04:00
Mark DePristo cd32b6ae54 CombineVariantsUnitTest was failing because the header repair was fixing the problem it wanted to detect 2012-06-26 17:32:42 -04:00
Mauricio Carneiro bbd46690e6 fixing conflicts 2012-06-26 17:12:24 -04:00
Mauricio Carneiro 91f02dfd85 fixing pipeline tests (sorry, my bad) 2012-06-26 17:10:58 -04:00
Mark DePristo 1f45551a15 Bugfixes to G count types in VCF header
-- Previously VCF header lines of count type G assumed that the sample would be diploid.
-- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class
-- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class
-- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations
-- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples
-- Added extensive unit tests that compare A and G type values in genotypes
2012-06-26 15:28:34 -04:00
Mark DePristo 7ef5ce28cc VariantRecalibrator test currently doesn't work with shadowBCF 2012-06-26 15:28:34 -04:00
Mark DePristo 5f5885ec78 Updating many MD5s to reflect correct fixed headers
-- Previous bugfix ensures that header fixing is always on in the GATK by default, even after integration tests that failed and when through the VCFDiffableReader.  Updating md5s to reflect this.
2012-06-26 15:28:34 -04:00
Mark DePristo 39c849aced Bugfix to ensure the DB=1 old files decode properly 2012-06-26 15:28:33 -04:00