Commit Graph

900 Commits (6f8e7692d4e67d5cfb2d2f9e33549ff4fbf2e5f8)

Author SHA1 Message Date
Mauricio Carneiro e93b025b39 Fixing unit test
with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.
2012-07-06 12:08:09 -04:00
Mauricio Carneiro 17efbbf8b1 Fixed ReadClipperUnitTest
The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.
2012-07-03 16:38:51 -04:00
Eric Banks 22f1afddaa Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 14:55:59 -04:00
Ryan Poplin 9e8e78de15 Adding the model name to the VQSR filter lines so that they don't get clobbered with consecutive VQSR runs for SNPs and then indels. 2012-07-03 14:30:37 -04:00
Eric Banks 0b37d44b0d Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup. 2012-07-03 13:05:11 -04:00
Eric Banks 031322ff00 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-07-03 00:12:59 -04:00
Eric Banks a4670113bd Refactored/renamed the nested integer array; cleaned up code a bit. 2012-07-03 00:12:33 -04:00
Mark DePristo 1b0a775773 Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest 2012-07-02 15:55:42 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00
Mark DePristo 602729c09d Moved parallel tests from SelectVariants to separate SelectVariantsParallelIntegrationTest
-- Enabled previous tests -- all now working
-- Added modern test against new VCF as well
2012-07-02 11:40:28 -04:00
Mark DePristo 7aff4446d4 Added unit tests for header repairing capabilities in the GATK engine 2012-07-01 15:38:10 -04:00
Mark DePristo 9b87dcda4f Fixing remaining integration test errors. Adding missing ex2.bcf 2012-06-30 16:23:11 -04:00
Mark DePristo 385a3c630f Added check in VariantContext.validate to ensure that getEnd() == END value when present
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
2012-06-30 11:22:48 -04:00
Mark DePristo 893630af53 Enabling symbolic alleles in BCF2
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication.  Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper.  Actually documented this code, which results in lots of **** negative comments on the code quality.  Eric has promised that he and Ami are going to rethink this code from scratch.  Fixed many nasty bugs in here, cleaning up unnecessary branches, etc.  Added UnitTests in VCFAlleleClipper that actually test the code full.  In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed.  Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason.  Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
2012-06-30 11:22:48 -04:00
Mark DePristo 16276f81a1 BCF2 with support symbolic alleles
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
2012-06-30 11:22:48 -04:00
Mark DePristo 86feea917e Updating MD5s to reflect new FT fixed count of 1 not UNBOUNDED 2012-06-30 11:22:47 -04:00
Mark DePristo 6bea28ae6f Genotype filters are now just Strings, not Set<String> 2012-06-30 11:22:47 -04:00
Mark DePristo 064cc56335 Update integration tests to reflect new FT header line standard and new DiagnoseTargets field names 2012-06-28 10:06:06 -04:00
Eric Banks 1fafd9f6c8 NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation. 2012-06-27 16:55:49 -04:00
Khalid Shakir 746a5e95f3 Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding.
Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals.
IntervalUtils return unmodifiable lists so that utilities don't mutate the collections.
Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus.
Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values.
JobRunInfo handles the null done times when jobs crash with strange errors.
2012-06-27 01:15:22 -04:00
Mark DePristo 016b25be87 Update annoying md5s in unit tests, also failing because of header fixing 2012-06-26 17:32:42 -04:00
Mark DePristo cd32b6ae54 CombineVariantsUnitTest was failing because the header repair was fixing the problem it wanted to detect 2012-06-26 17:32:42 -04:00
Mark DePristo 1f45551a15 Bugfixes to G count types in VCF header
-- Previously VCF header lines of count type G assumed that the sample would be diploid.
-- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class
-- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class
-- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations
-- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples
-- Added extensive unit tests that compare A and G type values in genotypes
2012-06-26 15:28:34 -04:00
Mark DePristo 7ef5ce28cc VariantRecalibrator test currently doesn't work with shadowBCF 2012-06-26 15:28:34 -04:00
Mark DePristo 5f5885ec78 Updating many MD5s to reflect correct fixed headers
-- Previous bugfix ensures that header fixing is always on in the GATK by default, even after integration tests that failed and when through the VCFDiffableReader.  Updating md5s to reflect this.
2012-06-26 15:28:34 -04:00
Mark DePristo c1ac0e2760 BCF2 cleanup
-- allowMissingVCFHeaders is now part of -U argument.  If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING.  Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup.  This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
2012-06-26 15:28:33 -04:00
Mark DePristo 0b5980d7b3 Added Heng's nasty ex2.vcf to standard tests 2012-06-26 15:28:33 -04:00
Mark DePristo 11dbfc92a7 Horrible bugfix to decodeLoc() in BCF2Codec
-- Just completely wrong.
-- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem
-- Added samtools ex2.bcf file for decoding to our integrationtests
2012-06-26 15:28:32 -04:00
Mark DePristo fb26c0f054 Update integration tests to reflect header changes 2012-06-26 15:28:32 -04:00
Mark DePristo 7b96263f8b Disable shadowBCF for VariantRecalibrationWalkers tests because it cannot handle symbolic alleles yet 2012-06-26 15:28:32 -04:00
Mark DePristo 6e9a81aabe Minor bugfix -- now that the testfile is in our testdata regenerate the idx file as needed to pass tests 2012-06-26 15:28:32 -04:00
Roger Zurawicki 7eb3e4da41 Added integration Tests for DiagnoseTargets
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-06-25 17:02:46 -04:00
Joel Thibault d0cf8bcc80 Add unit tests for VariantContextBuilder.rmAttribute() and .attribute()
* These generated NPEs when the attribute object is null
2012-06-25 12:05:04 -04:00
Ryan Poplin 735b59d942 Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero. 2012-06-22 12:38:48 -04:00
Mark DePristo d17369e0ac A few misc. residual errors in last commit 2012-06-21 16:04:25 -04:00
Mark DePristo 734756d6b2 Final fixes before BCF2 mark III push
-- Added MLEAC and MLEAF format lines to PoolCallerWalker
-- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation
-- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing
2012-06-21 15:17:22 -04:00
Mark DePristo 549293b6f7 Bugfixes towards final BCF2 implementation
-- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF
-- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing
-- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one
-- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing
-- Keys whose value is null are put into the VariantContext info attributes now
2012-06-21 15:17:21 -04:00
Mark DePristo 567dba0f76 Cleanup of VCF header lines and constants, BCF2 bugfixes
-- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller
-- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place.  Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers
-- VCF parsers now automatically repair standard VCF header lines when reading the header
-- Updating integration tests to reflect header changes
-- Created private and public testdata directories (public/testdata and private/testdata).  Updated tests to use test
-- SelectHeaders now always updates the header to include the contig lines
-- SelectVariants add UG header lines when in regenotype mode
-- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY
-- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs)
-- Throw error when VCF has unbounded non-flag values that don't have = value bindings
-- By default we no longer allow writing of BCF2 files without contig lines in the header
2012-06-21 15:16:31 -04:00
Mark DePristo fba7dafa0e Finalizing BCF2 mark III commit
-- Moved GENOTYPE_KEY vcf header line to VCFConstants.  This general migration and cleanup is on Eric's plate now
-- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header.  Still doesn't work...
-- Updating integration test files.  Moved many more files into public/testdata.  Updated their headers to all work correctly with new strict VCF header checking.
-- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt
-- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine.  DB = 0 is never seen in the output VCFs now
-- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status
-- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be
-- VariantsToVCF now properly writes out the GT FORMAT field
-- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code.  Eric said he and Ami will clean up this whole piece of instructure
-- Fixed bug in BCF2Codec that wasn't setting the phase field correctly.  UnitTested now
-- PASS string now added at the end of the BCF2 dictionary after discussion with Heng
-- Fixed bug where I was writing out all field values as BigEndian.  Now everything is LittleEndian.
-- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException
-- Cleaned up unused code
-- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding
-- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples
-- We always write the number of genotype samples into the BCF2 nSamples header.  How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names...
-- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally
-- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele
-- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values.  Genotype objects are all PASS implicitly, or explicitly filtered.  We only write out the FT values if at least one sample is filtered.  Removed interface functions and cleaned up code
-- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works.  In general, **** NEVER COPY CODE **** if you need to share funcitonality make a function, that's why there were invented!
-- Increased the default number of records to read for DiffObjects to 1M
2012-06-21 15:16:27 -04:00
Mark DePristo 0c8b830db7 Updating MD5s for inclusion of RPA field header 2012-06-21 15:16:26 -04:00
Mark DePristo d015a5738d Bugfixes for VCFWriterUnitTest and TestProvider to deal with stricter VCFWriter behavior 2012-06-21 15:16:26 -04:00
Mark DePristo 9c81f45c9f Phase I commit to get shadowBCFs passing tests
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header.  This helps avoid some of the low-level errors I saw in SelectVariants.  This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
    -- using arrays not lists for intermediate data structures
    -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
    -- Warn and fix on the way flag values with counts > 0
    -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow.  Duplicates are detected once at header creation.
    -- Explicitly track FILTER fields for efficient lookup in their own hashmap
    -- Automatically add PL field when we see a GL field and no PL field
    -- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing.  Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
2012-06-21 15:16:26 -04:00
Mauricio Carneiro ab53220635 Refactor on how RR treats soft clips
* Sites with more soft clipped bases than regular will force-trigger a variant region
   * No more unclipping/reclipping, RR machinery now handles soft clips natively.
   * implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
   * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.

note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
2012-06-21 14:02:03 -04:00
Eric Banks 15ae906f32 Once I was playing with integration tests it was simple to fix the ones I left broken from earlier today. 2012-06-18 21:54:58 -04:00
Eric Banks 62cee2fb5b Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature. 2012-06-18 21:36:27 -04:00
Eric Banks 4393adf9e7 If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it. 2012-06-18 13:36:14 -04:00
Eric Banks 82a2c40338 Emit the MLE AC and AF in the INFO field of the UG output 2012-06-18 12:19:36 -04:00
Eric Banks 677babf546 Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort. 2012-06-15 15:55:03 -04:00
Eric Banks c54e84e739 Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations. 2012-06-15 09:28:56 -04:00
Eric Banks 61fcbcb190 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-06-15 02:45:57 -04:00
Eric Banks 4895fe2289 No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java) 2012-06-15 02:32:00 -04:00
Mark DePristo 5c23ab0817 Final cleanup of VCFWriterUnitTest 2012-06-14 16:42:39 -04:00
Mark DePristo 0384ce5d34 Simple optimizations for BCF2Encoder
-- Inline encodeString that doesn't go via List<Byte> intermediate
-- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2
-- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation
-- Final UG integration test update
2012-06-14 16:42:39 -04:00
Mark DePristo 68eed7b313 Optimizations for VCF and BCF2
-- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking.  encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version.  Lots of contracts.  User code updated to use specialized versions where possible
-- Misc code refactoring
-- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1.  Updating MD5s accordingly
-- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations
2012-06-14 16:42:39 -04:00
Mark DePristo fbc45e14d3 Cleanup formatting of VCF floats
-- Final integrationtest update before commit (and fixing new formatting changes)
2012-06-14 16:42:38 -04:00
Mark DePristo 71da76039e Final support for variable length lists of strings in BCF2
-- Updating many MD5s as well.
2012-06-14 16:42:38 -04:00
Mark DePristo bd9d40fb84 Code cleanup and more documentation for BCFFieldWriters
-- Update integration tests where appropriate
2012-06-14 16:42:37 -04:00
Mark DePristo dc07067265 Fix bug in incorrectly reporting relative paths in log 2012-06-14 16:42:37 -04:00
Mark DePristo 856905ee5b Cleanup Genotypes
-- Renamed getAttribute to getExtendedAttribute, as this is really what this function does
-- Added a few more genotype tests
2012-06-14 16:42:36 -04:00
Mark DePristo aa2178cc68 Updating MD5s to latest version to reflect inclusion of contigs in headers 2012-06-14 16:42:36 -04:00
Mark DePristo 31997f8092 Bugfixes on the way to passing integration tests
-- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate
-- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it.  Very expensive but convenient.
-- Fixed nasty subsetting bug in SelectVariants with excluding samples
-- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT
-- Bugfix for dropping old style GL field values
-- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header
-- Bugfix for Allele.getBaseString to properly show NO_CALL alleles
-- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes
2012-06-14 16:42:33 -04:00
Mark DePristo dd6aee347a Genotype encoding uses the BCF2FieldEncoder system 2012-06-14 16:42:33 -04:00
Mark DePristo 7506994d09 Nearing final BCF commit
-- Cleanup some (but not all) VCF3 files.  Turns out there are lots so...
-- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec.  Now VCF3 properly handles the new GenotypeBuilder interface
-- Misc. bugfixes in GenotypeBuilder
2012-06-14 16:42:32 -04:00
Mark DePristo 8014178f2f Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see.  This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly.  Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-14 16:42:30 -04:00
Mark DePristo 2a86b81a3f Initial version of clean, fast formatting routines built dynamically from a VCF header
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
2012-06-14 16:42:30 -04:00
Mark DePristo 51a3b6e25e No more makePrecisionFormatStringFromDenominatorValue
-- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating.
-- Created a smart float formatter in VCFWriter, with unit tests
-- Removed makePrecisionFormatStringFromDenominatorValue and its uses
-- Fix broken contracted
-- Refactored some code from the encoder to utils in BCF2
-- HaplotypeCaller's GenotypingEngine was using old version of subset to context.  Replaced with a faster call that I think is correct. Ryan, please confirm.
2012-06-14 16:42:30 -04:00
Mark DePristo 43ad890fcc Finalizing BCF2 v2
-- FastGenotypes are the default in the engine.  Use --useSlowGenotypes engine argument to return to old representation
-- Cleanup of BCF2Codec.  Good error handling.  Added contracts and docs.
-- Added a few more contacts and docs to BCF2Decoder
-- Optimized encodePrimitive in BCF2Encoder
-- Removed genotype filter field exceptions
-- Docs and cleanup of BCF2GenotypeFieldDecoders
-- Deleted unused BCF2TestWalker
-- Docs and cleanup of BCF2Types
-- Faster version of decodeInts in VCFCodec
-- BCF2Writer
    -- Support for writing a sites only file
    -- Lots of TODOs for future optimizations
    -- Removed lack of filter field support
    -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural
    -- Lots of docs and contracts
-- Docs for GenotypeBuilder.  More filter creation routines (unfiltered, for example)
-- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters.  Better genotype comparisons
2012-06-14 16:42:29 -04:00
Mark DePristo 6cfb2d1393 Restoring SelectVariantsIntegrationTest 2012-06-14 16:42:28 -04:00
Mark DePristo cfd1e50068 Minor updates to test code 2012-06-14 16:42:28 -04:00
Mark DePristo 54817f8d16 VCFHeaderUnitTest needed to be updated to reflect that we are doing VCF4.1 not VCF4.0 2012-06-14 16:42:28 -04:00
Mark DePristo 982192e2e4 MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis
-- This file is in integrationtests/md5mismatches.txt, and looks like:

expected        observed        test
7fd0d0c2d1af3b16378339c181e40611        2339d841d3c3c7233ebba9a6ace895fd        test BeagleOutputToVCF
43865f3f0d975ee2c5912b31393842f8        1b9c4734274edd3142a05033e520beac        testBeagleChangesSitesToRef
daead9bfab1a5df72c5e3a239366118e        27be14f9fc951c4e714b4540b045c2df        testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e

-- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods
2012-06-14 16:42:27 -04:00
Mark DePristo 249d5e5533 Better tests for Genotype parsing 2012-06-14 16:42:27 -04:00
Mark DePristo 4a4d3cde3d UnitTests for decodeIntArray method 2012-06-14 16:42:27 -04:00
Mark DePristo 17fbd103d0 Smarter infrastructure to decode genotypes in BCF
-- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder.  The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2
-- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder.  In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form
-- Reduced the amount of data further to be computed in the DiffEngine.  The DiffEngine algorithm needs to be rethought to be efficient...
2012-06-14 16:42:25 -04:00
Mark DePristo cebd37609c Finalizing new Genotype object and associated routines
-- Builder now provides a depreciated log10pError function to make a new GQ value
-- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions
-- Lots of contracts
-- Bugfixes throughout
2012-06-14 16:42:25 -04:00
Mark DePristo d37a8a0bc8 Efficient Genotype object Intermediate commit
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype.  New genotype object is FastGenotype.  They can be used interchangable
-- There's no way to create Genotypes directly any longer.  You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default.  Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
  -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
  -- Cleaned up semantics of subContextFromSamples.  There's one function that either rederives or not the alleles from the subsetted genotypes

-- MASSIVE BUGFIX in SelectVariants.  The code has been decoding genotypes always, even if you were not subsetting down samples.  Fixed!
2012-06-14 16:42:24 -04:00
Mark DePristo a648b5e65e First step towards an efficient Genotype object
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness.  Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code.  Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup.  Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS.  Uses in code replaced with constant
2012-06-14 16:42:23 -04:00
Mark DePristo 8fc1a26ac7 Fixed comparison of VCFHeader as the set.equals() isn't working as expected 2012-06-14 16:42:22 -04:00
Mark DePristo 5fda16bea9 Enable shadow BCF2 2012-06-14 16:42:22 -04:00
Eric Banks 0398ae9695 I hate these disabled unit tests, #2 2012-06-14 15:19:27 -04:00
Eric Banks 676a57de7b I hate these disabled unit tests 2012-06-14 14:03:58 -04:00
Eric Banks 29a74908bb The next round of BQSR optimizations: no more Long[] array creation 2012-06-14 00:05:42 -04:00
David Roazen 0550b27799 Make downsampler classes themselves generic (instead of just the Downsampler interface)
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.

Thanks to Khalid for his feedback on this issue.

Also:

-added a unit test to verify GATKSAMRecord support with no typecasting required

-added some unit tests for the FractionalDownsampler that Mauricio will/might be using

-moved classes from private to public to better sync up with my local development
branch for engine integration
2012-06-13 16:43:39 -04:00
Eric Banks bb77aa88c3 Drat, forgot the unit tests again 2012-06-12 19:00:47 -04:00
Eric Banks 37f56ce8fd A couple of minor updates to BQSR 2012-06-12 16:12:13 -04:00
Eric Banks 0f79adb2aa Changing more Java Lists to native arrays in BQSR for performance optimization. 2012-06-12 15:41:01 -04:00
Eric Banks a96c5da884 Oops, forgot to push the unit tests 2012-06-12 11:38:30 -04:00
Eric Banks 891ce51908 Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements. 2012-06-12 09:19:36 -04:00
Ryan Poplin 0a37e19998 Bug fix in VQSR so that the VCF index will be created for the recalFile. 2012-06-08 11:51:28 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Khalid Shakir c3c7f17d90 Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000.
Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run.
TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.
2012-05-29 11:18:22 -04:00
Roger Zurawicki b8b139841d DiagnoseTargets with working Q1,Median,Q3
- Merged Roger's metrics with Mauricio's optimizations
 - Added Stats for DiagnoseTargets
     - now has functions to find the median depth, and upper/lower quartile
     - the REF_N callable status is implemented
 - The walker now runs efficiently
 - Diagnose Targets accepts overlapping intervals
 - Diagnose Targets now checks for bad mates
 - The read mates are checked in a memory efficient manner
 - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker.
 - Fixed some bugs
 - Removed rod binding

Added more Unit tests

 - Test callable statuses on the locus level
 - Test bad mates

 - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-05-29 10:16:45 -04:00
Mark DePristo 08de4dfd96 Missed one integration test 2012-05-29 07:23:24 -04:00
Mark DePristo 454c8e63e6 Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s
-- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec.  As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)
2012-05-28 20:20:05 -04:00
Mark DePristo 06b02e1b9b Update MD5s to reflect new limited output of DiffObjectsWalkers
-- Also updated GQ change in VCFIntegrationTest
2012-05-27 11:20:47 -04:00
Mark DePristo 5894d045cb Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests
-- This version of BCF should actually work properly for most files, assuming headers are properly defined.
-- Lots of bug fixes to BCF2 codec
-- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL.  NOTE THIS SEMANTICS change
-- Equals() method for GenotypeLikelihoods, using PLs.
-- VCFCodec now longer adds empty bindings to missing input field values.  NOTE THIS CHANGE
-- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work.  The BCF2 codec now makes VCs marked as fully decoded
-- stringToBytes returns empty list for null or "" string in BCF2Encoder
-- Proper handling of genotype ordering in BCF2 reader / writer
-- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily
-- Many failing MD5s now due to double -> int change in GQ, will update later
2012-05-27 11:17:17 -04:00
Mark DePristo 86e5a066fc Even more conservative limit on number of differences to summarize at 1000 2012-05-27 11:17:13 -04:00
Mark DePristo 31f4e5b52e Stop unlimited runtimes in DiffEngine when you have lots of differences
-- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000
2012-05-27 11:17:13 -04:00
Guillermo del Angel a6ee4f98b5 Yet More missing md5's 2012-05-25 17:21:47 -04:00