Commit Graph

791 Commits (7506994d09c4dfb4a66fb23b76f8d8a1aba0714b)

Author SHA1 Message Date
Mark DePristo 7506994d09 Nearing final BCF commit
-- Cleanup some (but not all) VCF3 files.  Turns out there are lots so...
-- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec.  Now VCF3 properly handles the new GenotypeBuilder interface
-- Misc. bugfixes in GenotypeBuilder
2012-06-14 16:42:32 -04:00
Mark DePristo 8014178f2f Algorithmically faster version of DiffEngine
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see.  This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly.  Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
2012-06-14 16:42:30 -04:00
Mark DePristo 2a86b81a3f Initial version of clean, fast formatting routines built dynamically from a VCF header
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
2012-06-14 16:42:30 -04:00
Mark DePristo 51a3b6e25e No more makePrecisionFormatStringFromDenominatorValue
-- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating.
-- Created a smart float formatter in VCFWriter, with unit tests
-- Removed makePrecisionFormatStringFromDenominatorValue and its uses
-- Fix broken contracted
-- Refactored some code from the encoder to utils in BCF2
-- HaplotypeCaller's GenotypingEngine was using old version of subset to context.  Replaced with a faster call that I think is correct. Ryan, please confirm.
2012-06-14 16:42:30 -04:00
Mark DePristo 43ad890fcc Finalizing BCF2 v2
-- FastGenotypes are the default in the engine.  Use --useSlowGenotypes engine argument to return to old representation
-- Cleanup of BCF2Codec.  Good error handling.  Added contracts and docs.
-- Added a few more contacts and docs to BCF2Decoder
-- Optimized encodePrimitive in BCF2Encoder
-- Removed genotype filter field exceptions
-- Docs and cleanup of BCF2GenotypeFieldDecoders
-- Deleted unused BCF2TestWalker
-- Docs and cleanup of BCF2Types
-- Faster version of decodeInts in VCFCodec
-- BCF2Writer
    -- Support for writing a sites only file
    -- Lots of TODOs for future optimizations
    -- Removed lack of filter field support
    -- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural
    -- Lots of docs and contracts
-- Docs for GenotypeBuilder.  More filter creation routines (unfiltered, for example)
-- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters.  Better genotype comparisons
2012-06-14 16:42:29 -04:00
Mark DePristo 6cfb2d1393 Restoring SelectVariantsIntegrationTest 2012-06-14 16:42:28 -04:00
Mark DePristo cfd1e50068 Minor updates to test code 2012-06-14 16:42:28 -04:00
Mark DePristo 54817f8d16 VCFHeaderUnitTest needed to be updated to reflect that we are doing VCF4.1 not VCF4.0 2012-06-14 16:42:28 -04:00
Mark DePristo 982192e2e4 MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis
-- This file is in integrationtests/md5mismatches.txt, and looks like:

expected        observed        test
7fd0d0c2d1af3b16378339c181e40611        2339d841d3c3c7233ebba9a6ace895fd        test BeagleOutputToVCF
43865f3f0d975ee2c5912b31393842f8        1b9c4734274edd3142a05033e520beac        testBeagleChangesSitesToRef
daead9bfab1a5df72c5e3a239366118e        27be14f9fc951c4e714b4540b045c2df        testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e

-- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods
2012-06-14 16:42:27 -04:00
Mark DePristo 249d5e5533 Better tests for Genotype parsing 2012-06-14 16:42:27 -04:00
Mark DePristo 4a4d3cde3d UnitTests for decodeIntArray method 2012-06-14 16:42:27 -04:00
Mark DePristo 17fbd103d0 Smarter infrastructure to decode genotypes in BCF
-- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder.  The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2
-- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder.  In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form
-- Reduced the amount of data further to be computed in the DiffEngine.  The DiffEngine algorithm needs to be rethought to be efficient...
2012-06-14 16:42:25 -04:00
Mark DePristo cebd37609c Finalizing new Genotype object and associated routines
-- Builder now provides a depreciated log10pError function to make a new GQ value
-- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions
-- Lots of contracts
-- Bugfixes throughout
2012-06-14 16:42:25 -04:00
Mark DePristo d37a8a0bc8 Efficient Genotype object Intermediate commit
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype.  New genotype object is FastGenotype.  They can be used interchangable
-- There's no way to create Genotypes directly any longer.  You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default.  Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
  -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
  -- Cleaned up semantics of subContextFromSamples.  There's one function that either rederives or not the alleles from the subsetted genotypes

-- MASSIVE BUGFIX in SelectVariants.  The code has been decoding genotypes always, even if you were not subsetting down samples.  Fixed!
2012-06-14 16:42:24 -04:00
Mark DePristo a648b5e65e First step towards an efficient Genotype object
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness.  Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code.  Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup.  Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS.  Uses in code replaced with constant
2012-06-14 16:42:23 -04:00
Mark DePristo 8fc1a26ac7 Fixed comparison of VCFHeader as the set.equals() isn't working as expected 2012-06-14 16:42:22 -04:00
Mark DePristo 5fda16bea9 Enable shadow BCF2 2012-06-14 16:42:22 -04:00
Eric Banks 0398ae9695 I hate these disabled unit tests, #2 2012-06-14 15:19:27 -04:00
Eric Banks 676a57de7b I hate these disabled unit tests 2012-06-14 14:03:58 -04:00
Eric Banks 29a74908bb The next round of BQSR optimizations: no more Long[] array creation 2012-06-14 00:05:42 -04:00
David Roazen 0550b27799 Make downsampler classes themselves generic (instead of just the Downsampler interface)
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.

Thanks to Khalid for his feedback on this issue.

Also:

-added a unit test to verify GATKSAMRecord support with no typecasting required

-added some unit tests for the FractionalDownsampler that Mauricio will/might be using

-moved classes from private to public to better sync up with my local development
branch for engine integration
2012-06-13 16:43:39 -04:00
Eric Banks bb77aa88c3 Drat, forgot the unit tests again 2012-06-12 19:00:47 -04:00
Eric Banks 37f56ce8fd A couple of minor updates to BQSR 2012-06-12 16:12:13 -04:00
Eric Banks 0f79adb2aa Changing more Java Lists to native arrays in BQSR for performance optimization. 2012-06-12 15:41:01 -04:00
Eric Banks a96c5da884 Oops, forgot to push the unit tests 2012-06-12 11:38:30 -04:00
Eric Banks 891ce51908 Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements. 2012-06-12 09:19:36 -04:00
Ryan Poplin 0a37e19998 Bug fix in VQSR so that the VCF index will be created for the recalFile. 2012-06-08 11:51:28 -04:00
Eric Banks 8405156ae1 Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities. 2012-06-04 14:28:32 -04:00
Ryan Poplin 5dd811f84a Adding genotype given alleles mode to the HaplotypeCaller. 2012-05-30 15:07:01 -04:00
Khalid Shakir c3c7f17d90 Updated hard limit MathUtils.MAXN number of samples from 11,000 to 50,000.
Instead of creating a supposed network temporary directory locally which then fails when remote nodes try to access the non-existant dir, now checking to see if they network directory is available and throwing a SkipException to bypass the test when it cannot be run.
TODO: Throw similar SkipExceptions when fastas are not available. Right now instead of skipping the test or failing fast the REQUIRE_NETWORK_CONNECTION=false means that the errors popup later when the networked fastas aren't found.
2012-05-29 11:18:22 -04:00
Roger Zurawicki b8b139841d DiagnoseTargets with working Q1,Median,Q3
- Merged Roger's metrics with Mauricio's optimizations
 - Added Stats for DiagnoseTargets
     - now has functions to find the median depth, and upper/lower quartile
     - the REF_N callable status is implemented
 - The walker now runs efficiently
 - Diagnose Targets accepts overlapping intervals
 - Diagnose Targets now checks for bad mates
 - The read mates are checked in a memory efficient manner
 - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker.
 - Fixed some bugs
 - Removed rod binding

Added more Unit tests

 - Test callable statuses on the locus level
 - Test bad mates

 - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-05-29 10:16:45 -04:00
Mark DePristo 08de4dfd96 Missed one integration test 2012-05-29 07:23:24 -04:00
Mark DePristo 454c8e63e6 Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s
-- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec.  As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)
2012-05-28 20:20:05 -04:00
Mark DePristo 06b02e1b9b Update MD5s to reflect new limited output of DiffObjectsWalkers
-- Also updated GQ change in VCFIntegrationTest
2012-05-27 11:20:47 -04:00
Mark DePristo 5894d045cb Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests
-- This version of BCF should actually work properly for most files, assuming headers are properly defined.
-- Lots of bug fixes to BCF2 codec
-- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL.  NOTE THIS SEMANTICS change
-- Equals() method for GenotypeLikelihoods, using PLs.
-- VCFCodec now longer adds empty bindings to missing input field values.  NOTE THIS CHANGE
-- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work.  The BCF2 codec now makes VCs marked as fully decoded
-- stringToBytes returns empty list for null or "" string in BCF2Encoder
-- Proper handling of genotype ordering in BCF2 reader / writer
-- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily
-- Many failing MD5s now due to double -> int change in GQ, will update later
2012-05-27 11:17:17 -04:00
Mark DePristo 86e5a066fc Even more conservative limit on number of differences to summarize at 1000 2012-05-27 11:17:13 -04:00
Mark DePristo 31f4e5b52e Stop unlimited runtimes in DiffEngine when you have lots of differences
-- Added a new parameter to control the maximum number of pairwise differences to generate, which previously could expand to a very large number when there were lots of differences among genotypes, resulting in a n^2 algorithm running with n > 1,000,000
2012-05-27 11:17:13 -04:00
Guillermo del Angel a6ee4f98b5 Yet More missing md5's 2012-05-25 17:21:47 -04:00
Guillermo del Angel 175bb35e70 Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former) 2012-05-25 12:56:23 -04:00
Mark DePristo d6df817174 Oops, don't enable shadow BCF tests 2012-05-24 13:31:13 -04:00
Mark DePristo 0a86564669 Updated test files didn't make it into last push 2012-05-24 13:29:44 -04:00
Mark DePristo 7280cdf937 Bugfixes and testdata cleanup
-- Cut down the size of a few large files in public/testdata that were only used in part
-- Refactor vcf Filename => shadow BCF filename to BCF2Utils.  Fix bug in WalkerTest due to the way this was handled previously
2012-05-24 13:26:05 -04:00
Mark DePristo e9c22b9aad Final updates to integration tests for BCF2
-- Fully working version
-- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf
-- Moved MedianUnitTest to its proper home in Utils
-- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests.  From this website it's easy to see md5 diffs, etc.  This is a vastly better way to manage unit and integration test output
2012-05-24 10:58:59 -04:00
Mark DePristo 6ca71fe3b4 GATK tests use public/testdata not /humgen/ as much as possible 2012-05-24 10:58:58 -04:00
Mark DePristo 69ee4d0454 Moved getMetaDataForField to VariantContextUtils 2012-05-24 10:57:09 -04:00
Mark DePristo cb13f16e90 WalkerTest infrastructure to generate and test shadowBCF file for every generated VCF file
-- Currently disabled
2012-05-24 10:57:09 -04:00
Mark DePristo f77d2e6965 Renamed NO_HEADER to the more accurate no_cmdline_in_header
-- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well
2012-05-24 10:57:08 -04:00
Mark DePristo 6f469305ab Don't try to share BCF2 yet 2012-05-24 10:57:06 -04:00
Mark DePristo c8ed0bfc4c Edge case fixes for BCF2
--handle entirely missing GT in a sample in decodeGenotypeAlleles
--Create MAX_ALLELES_IN_GENOTYPES constant in BCF2Utils, and extracted its use inline from the code
-- Generalized genotype writing code to handle ploidy != 2 and variable ploidy among samples
-- Remove special case inline treatment of case where all samples have no GT field values, and moved this into calcVCFGenotypeKeys
-- Removed restriction on getPloidy requiring ploidy > 1.  It's logically find to return 0 for a no called sample
-- getMaxPloidy() in VC that does what it says
-- Support for padding / depadding of generic genotype fields
2012-05-24 10:57:06 -04:00
Mark DePristo 64d4238e2f 99% working version of BCF2 encoder / decoder
-- fixed final bugs with PL encoding / decoding
-- Ready for testing by other members of the group
-- Current performance numbers aren't so great, but they will improve in the next phase of BCF2 optimizations
-- Fixed a nasty bug in the filter field
-- Not that some (many?) GATK tools won't work with BCF because they internally assume values are Strings not their true types

Read 1500 genotypes file in VCF -> VCF : 11 seconds
Read 1500 genotypes file in VCF -> BCF : 9.5 seconds

VariantEval 1500 genotypes file in VCF : 3 seconds
VariantEval 1500 genotypes file in BCF : 3 seconds
2012-05-24 10:57:05 -04:00