gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	fba7dafa0e	Finalizing BCF2 mark III commit -- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now -- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work... -- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking. -- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt -- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now -- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status -- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be -- VariantsToVCF now properly writes out the GT FORMAT field -- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure -- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now -- PASS string now added at the end of the BCF2 dictionary after discussion with Heng -- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian. -- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException -- Cleaned up unused code -- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding -- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples -- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names... -- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally -- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele -- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code -- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, ** NEVER COPY CODE ** if you need to share funcitonality make a function, that's why there were invented! -- Increased the default number of records to read for DiffObjects to 1M	2012-06-21 15:16:27 -04:00
Mark DePristo	0c8b830db7	Updating MD5s for inclusion of RPA field header	2012-06-21 15:16:26 -04:00
Mark DePristo	9c81f45c9f	Phase I commit to get shadowBCFs passing tests -- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument -- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA -- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample -- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original -- Performance optimizations in BCF2 codec and writer -- using arrays not lists for intermediate data structures -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over -- VCFHeader (which needs a complete rewrite, FYI Eric) -- Warn and fix on the way flag values with counts > 0 -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation. -- Explicitly track FILTER fields for efficient lookup in their own hashmap -- Automatically add PL field when we see a GL field and no PL field -- Added get and has methods for INFO, FILTER, and FORMAT fields -- No longer add AC and AF values to the INFO field when there's no ALT allele -- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare -- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF	2012-06-21 15:16:26 -04:00
Eric Banks	15ae906f32	Once I was playing with integration tests it was simple to fix the ones I left broken from earlier today.	2012-06-18 21:54:58 -04:00
Eric Banks	4393adf9e7	If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it.	2012-06-18 13:36:14 -04:00
Eric Banks	82a2c40338	Emit the MLE AC and AF in the INFO field of the UG output	2012-06-18 12:19:36 -04:00
Eric Banks	677babf546	Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort.	2012-06-15 15:55:03 -04:00
Eric Banks	c54e84e739	Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations.	2012-06-15 09:28:56 -04:00
Eric Banks	61fcbcb190	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-15 02:45:57 -04:00
Eric Banks	4895fe2289	No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java)	2012-06-15 02:32:00 -04:00
Mark DePristo	0384ce5d34	Simple optimizations for BCF2Encoder -- Inline encodeString that doesn't go via List<Byte> intermediate -- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2 -- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation -- Final UG integration test update	2012-06-14 16:42:39 -04:00
Mark DePristo	68eed7b313	Optimizations for VCF and BCF2 -- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking. encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version. Lots of contracts. User code updated to use specialized versions where possible -- Misc code refactoring -- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1. Updating MD5s accordingly -- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations	2012-06-14 16:42:39 -04:00
Mark DePristo	fbc45e14d3	Cleanup formatting of VCF floats -- Final integrationtest update before commit (and fixing new formatting changes)	2012-06-14 16:42:38 -04:00
Mark DePristo	71da76039e	Final support for variable length lists of strings in BCF2 -- Updating many MD5s as well.	2012-06-14 16:42:38 -04:00
Mark DePristo	bd9d40fb84	Code cleanup and more documentation for BCFFieldWriters -- Update integration tests where appropriate	2012-06-14 16:42:37 -04:00
Mark DePristo	aa2178cc68	Updating MD5s to latest version to reflect inclusion of contigs in headers	2012-06-14 16:42:36 -04:00
Mark DePristo	31997f8092	Bugfixes on the way to passing integration tests -- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate -- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it. Very expensive but convenient. -- Fixed nasty subsetting bug in SelectVariants with excluding samples -- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT -- Bugfix for dropping old style GL field values -- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header -- Bugfix for Allele.getBaseString to properly show NO_CALL alleles -- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes	2012-06-14 16:42:33 -04:00
Mark DePristo	dd6aee347a	Genotype encoding uses the BCF2FieldEncoder system	2012-06-14 16:42:33 -04:00
Mark DePristo	7506994d09	Nearing final BCF commit -- Cleanup some (but not all) VCF3 files. Turns out there are lots so... -- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec. Now VCF3 properly handles the new GenotypeBuilder interface -- Misc. bugfixes in GenotypeBuilder	2012-06-14 16:42:32 -04:00
Mark DePristo	8014178f2f	Algorithmically faster version of DiffEngine -- Now only includes leaf nodes in the summary, i.e., summaries of the form ".....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm -- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size. -- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well. -- Added integration test for new leaf and old pairwise calculations -- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields	2012-06-14 16:42:30 -04:00
Mark DePristo	51a3b6e25e	No more makePrecisionFormatStringFromDenominatorValue -- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating. -- Created a smart float formatter in VCFWriter, with unit tests -- Removed makePrecisionFormatStringFromDenominatorValue and its uses -- Fix broken contracted -- Refactored some code from the encoder to utils in BCF2 -- HaplotypeCaller's GenotypingEngine was using old version of subset to context. Replaced with a faster call that I think is correct. Ryan, please confirm.	2012-06-14 16:42:30 -04:00
Mark DePristo	6cfb2d1393	Restoring SelectVariantsIntegrationTest	2012-06-14 16:42:28 -04:00
Mark DePristo	982192e2e4	MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis -- This file is in integrationtests/md5mismatches.txt, and looks like: expected observed test 7fd0d0c2d1af3b16378339c181e40611 2339d841d3c3c7233ebba9a6ace895fd test BeagleOutputToVCF 43865f3f0d975ee2c5912b31393842f8 1b9c4734274edd3142a05033e520beac testBeagleChangesSitesToRef daead9bfab1a5df72c5e3a239366118e 27be14f9fc951c4e714b4540b045c2df testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e -- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods	2012-06-14 16:42:27 -04:00
Mark DePristo	d37a8a0bc8	Efficient Genotype object Intermediate commit -- Created a new Genotype interface with a more limited set of operations -- Old genotype object is now SlowGenotype. New genotype object is FastGenotype. They can be used interchangable -- There's no way to create Genotypes directly any longer. You have to use GenotypeBuilder just like VariantContextBuilder -- Modified lots and lots of code to use GenotypeBuilder -- Added a temporary hidden argument to engine to use FastGenotype by default. Current default is SlowGenotype -- Lots of bug fixes to BCF2 codec and encoder. -- Feature additions -- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes -- Cleaned up semantics of subContextFromSamples. There's one function that either rederives or not the alleles from the subsetted genotypes -- MASSIVE BUGFIX in SelectVariants. The code has been decoding genotypes always, even if you were not subsetting down samples. Fixed!	2012-06-14 16:42:24 -04:00
Eric Banks	0398ae9695	I hate these disabled unit tests, #2	2012-06-14 15:19:27 -04:00
Eric Banks	676a57de7b	I hate these disabled unit tests	2012-06-14 14:03:58 -04:00
Eric Banks	29a74908bb	The next round of BQSR optimizations: no more Long[] array creation	2012-06-14 00:05:42 -04:00
David Roazen	0550b27799	Make downsampler classes themselves generic (instead of just the Downsampler interface) This is in response to a request from Mauricio to make it easier to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords) without having to do any cumbersome typecasting. Sadly, Java language limitations make this sort of solution the best choice. Thanks to Khalid for his feedback on this issue. Also: -added a unit test to verify GATKSAMRecord support with no typecasting required -added some unit tests for the FractionalDownsampler that Mauricio will/might be using -moved classes from private to public to better sync up with my local development branch for engine integration	2012-06-13 16:43:39 -04:00
Eric Banks	bb77aa88c3	Drat, forgot the unit tests again	2012-06-12 19:00:47 -04:00
Eric Banks	0f79adb2aa	Changing more Java Lists to native arrays in BQSR for performance optimization.	2012-06-12 15:41:01 -04:00
Eric Banks	a96c5da884	Oops, forgot to push the unit tests	2012-06-12 11:38:30 -04:00
Eric Banks	891ce51908	Refactoring of BQSRv2 to use longs (and standard bit fiddling techniques) instead of Java BitSets for performance improvements.	2012-06-12 09:19:36 -04:00
Ryan Poplin	0a37e19998	Bug fix in VQSR so that the VCF index will be created for the recalFile.	2012-06-08 11:51:28 -04:00
Eric Banks	8405156ae1	Refactored VariantsToTable so that 1) genotype-level fields can be specified (stabilized and supported code) and 2) the --moltenize argument could be supported to produce molten output of the data. Added tests that cover these capabilities.	2012-06-04 14:28:32 -04:00
Roger Zurawicki	b8b139841d	DiagnoseTargets with working Q1,Median,Q3 - Merged Roger's metrics with Mauricio's optimizations - Added Stats for DiagnoseTargets - now has functions to find the median depth, and upper/lower quartile - the REF_N callable status is implemented - The walker now runs efficiently - Diagnose Targets accepts overlapping intervals - Diagnose Targets now checks for bad mates - The read mates are checked in a memory efficient manner - The statistics thresholds have been consolidated and moved outside of the statistics classes and into the walker. - Fixed some bugs - Removed rod binding Added more Unit tests - Test callable statuses on the locus level - Test bad mates - Changed NO_COVERAGE -> COVERAGE_GAPS to avoid confusion Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-05-29 10:16:45 -04:00
Mark DePristo	08de4dfd96	Missed one integration test	2012-05-29 07:23:24 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	06b02e1b9b	Update MD5s to reflect new limited output of DiffObjectsWalkers -- Also updated GQ change in VCFIntegrationTest	2012-05-27 11:20:47 -04:00
Guillermo del Angel	a6ee4f98b5	Yet More missing md5's	2012-05-25 17:21:47 -04:00
Guillermo del Angel	175bb35e70	Made TandemRepeatAnnotator standard annotation. HRun no longer standard (superceded by former)	2012-05-25 12:56:23 -04:00
Mark DePristo	7280cdf937	Bugfixes and testdata cleanup -- Cut down the size of a few large files in public/testdata that were only used in part -- Refactor vcf Filename => shadow BCF filename to BCF2Utils. Fix bug in WalkerTest due to the way this was handled previously	2012-05-24 13:26:05 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Khalid Shakir	e57cd78bba	Killed two more resource leakers that ignored requests to close wrapped file pointers, and added Unit Tests for each. This bug will happen in all adapter/wrapper classes that are passed a resource, and then in their close method they ignore requests to close the wrapped resource, causing a leak when the adapter is the only one left with a reference to the resource. Ex: public Wrapper getNewWrapper(File path) { FileStream myStream = new FileStream(path); // This stream must be eventually closed. return new Wrapper(myStream); } public void close(Wrapper wrapper) { wrapper.close(); // If wrapper.close() does nothing, NO ONE else has a reference to close myStream. }	2012-05-21 15:41:56 -04:00
Eric Banks	26968ae8eb	Forgot that the VCFStreamingOntegrationTest uses VE	2012-05-18 02:51:53 -04:00
Eric Banks	52c206d5db	Has anyone else ever noticed that the DiffEngine outputs were always doubled for some reason? That no longer happens with the new reports.	2012-05-18 02:32:20 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Guillermo del Angel	ae26f0fe14	a) Fully functional and working multiallelic exact model for pools. Needs cleanup/more testing. b) Better unit test for pool genotype likelihoods - it now optionally generates actual noisy pileups that can be used for assessing GL accuracy, c) Totally experimental, hidden option in VariantsToTable to output genotype fields. Specifying -GF will output columns of form Sample.FieldName - needs also more testing	2012-05-14 10:55:35 -04:00

1 2 3 4 5 ...

525 Commits (fba7dafa0efdfc2be83bd39b797aecf1793a854c)