gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	9242f63a4d	On the way to really sorting out HMS error handling -- Better error message when a traveral error occurs (a real bug) -- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times -- A bit of cleanup in WalkerTest	2012-07-25 22:11:10 -04:00
Mark DePristo	7b96263f8b	Disable shadowBCF for VariantRecalibrationWalkers tests because it cannot handle symbolic alleles yet	2012-06-26 15:28:32 -04:00
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	9c81f45c9f	Phase I commit to get shadowBCFs passing tests -- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument -- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA -- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample -- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original -- Performance optimizations in BCF2 codec and writer -- using arrays not lists for intermediate data structures -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over -- VCFHeader (which needs a complete rewrite, FYI Eric) -- Warn and fix on the way flag values with counts > 0 -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation. -- Explicitly track FILTER fields for efficient lookup in their own hashmap -- Automatically add PL field when we see a GL field and no PL field -- Added get and has methods for INFO, FILTER, and FORMAT fields -- No longer add AC and AF values to the INFO field when there's no ALT allele -- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare -- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF	2012-06-21 15:16:26 -04:00
Mark DePristo	2a86b81a3f	Initial version of clean, fast formatting routines built dynamically from a VCF header -- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level. -- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values -- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder -- Allowed us to naturally support encoding of lists of strings -- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion -- Fixes for integration test failures -- Enabling contig updates -- WalkerTest now prints out relative paths where possible to make cut/paste/run easier	2012-06-14 16:42:30 -04:00
Mark DePristo	982192e2e4	MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis -- This file is in integrationtests/md5mismatches.txt, and looks like: expected observed test 7fd0d0c2d1af3b16378339c181e40611 2339d841d3c3c7233ebba9a6ace895fd test BeagleOutputToVCF 43865f3f0d975ee2c5912b31393842f8 1b9c4734274edd3142a05033e520beac testBeagleChangesSitesToRef daead9bfab1a5df72c5e3a239366118e 27be14f9fc951c4e714b4540b045c2df testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e -- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods	2012-06-14 16:42:27 -04:00
Mark DePristo	a648b5e65e	First step towards an efficient Genotype object -- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness. Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code. Much cleaner interface now, but not yet hooked up to anything -- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class -- Code cleanup. Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS. Uses in code replaced with constant	2012-06-14 16:42:23 -04:00
Mark DePristo	8fc1a26ac7	Fixed comparison of VCFHeader as the set.equals() isn't working as expected	2012-06-14 16:42:22 -04:00
Mark DePristo	5fda16bea9	Enable shadow BCF2	2012-06-14 16:42:22 -04:00
Mark DePristo	454c8e63e6	Made GQ an int, not a float. Updated VC code and lots of corresponding MD5s -- VCFWriter / codec now passes the same rigorous UnitTest as the BCF2 writer / codec. As part of this we now can only test doubles for equivalence in VCFs to 1e-2 (not exactly impressive)	2012-05-28 20:20:05 -04:00
Mark DePristo	5894d045cb	Bugfixes and code cleanup throughout so BCF2 passes VC -> BCF -> VC tests -- This version of BCF should actually work properly for most files, assuming headers are properly defined. -- Lots of bug fixes to BCF2 codec -- Genotype getPhredScaledQual is now an int, returning -1 if there's no QUAL. NOTE THIS SEMANTICS change -- Equals() method for GenotypeLikelihoods, using PLs. -- VCFCodec now longer adds empty bindings to missing input field values. NOTE THIS CHANGE -- VCs can be marked as fully decoded, so that when fullyDecode() is called it returns itself, instead of doing the decoding work. The BCF2 codec now makes VCs marked as fully decoded -- stringToBytes returns empty list for null or "" string in BCF2Encoder -- Proper handling of genotype ordering in BCF2 reader / writer -- Removed the crazy slow noDups and sameSamples tests that were slowing down unit and integration tests totally unnecessarily -- Many failing MD5s now due to double -> int change in GQ, will update later	2012-05-27 11:17:17 -04:00
Mark DePristo	d6df817174	Oops, don't enable shadow BCF tests	2012-05-24 13:31:13 -04:00
Mark DePristo	0a86564669	Updated test files didn't make it into last push	2012-05-24 13:29:44 -04:00
Mark DePristo	7280cdf937	Bugfixes and testdata cleanup -- Cut down the size of a few large files in public/testdata that were only used in part -- Refactor vcf Filename => shadow BCF filename to BCF2Utils. Fix bug in WalkerTest due to the way this was handled previously	2012-05-24 13:26:05 -04:00
Mark DePristo	e9c22b9aad	Final updates to integration tests for BCF2 -- Fully working version -- Use -generateShadowBCF to write out foo.bcf as well as foo.vcf anywhere you use -o foo.vcf -- Moved MedianUnitTest to its proper home in Utils -- Added reportng to ivy and testng, so build/report/X/html/ is a nicely formatted output for Unit and Integration tests. From this website it's easy to see md5 diffs, etc. This is a vastly better way to manage unit and integration test output	2012-05-24 10:58:59 -04:00
Mark DePristo	6ca71fe3b4	GATK tests use public/testdata not /humgen/ as much as possible	2012-05-24 10:58:58 -04:00
Mark DePristo	69ee4d0454	Moved getMetaDataForField to VariantContextUtils	2012-05-24 10:57:09 -04:00
Mark DePristo	cb13f16e90	WalkerTest infrastructure to generate and test shadowBCF file for every generated VCF file -- Currently disabled	2012-05-24 10:57:09 -04:00
Mark DePristo	58c470a6c5	Rev'ing Tribble from 53 to 94 -- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code -- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase	2012-05-03 07:31:47 -04:00
Mark DePristo	27e7e17dc7	New way to handle exceptions in multi-threaded GATK -- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now. -- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer -- Better printing of stack traces in WalkerTest	2012-04-13 09:23:33 -04:00
Mark DePristo	e85e9a8cf5	More extensive testing of type of error thrown in multi-threaded walker test -- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown -- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs	2012-04-13 09:23:33 -04:00
Eric Banks	5c5d8e7cd3	Minor: cleaner way of turning off index-on-the-fly checking in case we want to turn it back on.	2012-03-18 00:53:29 -04:00
Guillermo del Angel	a05a7f287d	TMP: disable checking of whether on the fly index is equal to index after run completed	2012-03-16 21:14:45 -04:00
David Roazen	0702ee1587	Public-key authorization scheme to restrict use of NO_ET -Running the GATK with the -et NO_ET or -et STDOUT options now requires a key issued by us. Our reasons for doing this, and the procedure for our users to request keys, are documented here: http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home -A GATK user key is an email address plus a cryptographic signature signed using our private key, all wrapped in a GZIP container. User keys are validated using the public key we now distribute with the GATK. Our private key is kept in a secure location. -Keys are cryptographically secure in that valid keys definitely came from us and keys cannot be fabricated, however keys are not "copy-protected" in any way. -Includes private, standalone utilities to create a new GATK user key (GenerateGATKUserKey) and to create a new master public/private key pair (GenerateKeyPair). Usage of these tools will be documented on the internal wiki shortly. -Comprehensive unit/integration tests, including tests to ensure the continued integrity of the GATK master public/private key pair. -Generation of new user keys and the new unit/integration tests both require access to the GATK private key, which can only be read by members of the group "gsagit".	2012-03-06 00:09:43 -05:00
Mark DePristo	463eab7604	All MD5 mismatches for test are shown -- Now for tests like DoC, with 20 output md5s, you see all of the differences before failing.	2011-10-04 15:53:52 -07:00
Mark DePristo	b7511c5ff3	Fixed long-standing bug in tribble index creation -- Previously, on the fly indices didn't have dictionary set on the fly, so the GATK would read, add dictionary, and rewrite the index. This is now fixed, so that the on the fly index contains the reference dictionary when first written, avoiding the unnecessary read and write -- Added a GenomeAnalysisEngine and Walker function called getMasterSequenceDictionary() that fetches the reference sequence dictionary. This can be used conveniently everywhere, and is what's written into the Tribble index -- Refactored tribble index utilities from RMDTrackBuilder into IndexDictionaryUtils -- VCFWriter now requires the master sequence dictionary -- Updated walkers that create VCFWriters to provide the master sequence dictionary	2011-09-20 10:53:18 -04:00
Mark DePristo	d6e2e89f99	Walker test system refactoring. All MD5DB related functions are now in MD5DB.java. System has the concept of a local and a global MD5 db. The local one is like it operated previously. The global one lives in /humgen/gsa-hpprojects/GATK/data/integrationtests. If the system can find this directory then MD5s will also be read / written to this location. This means that gsabamboo will print differences as appropriate. And all users will in effect have access to a complete history of MD5 file results. A few minor code reshuffles changed VariantRecalibration and VCFHeader test files.	2011-07-18 10:46:01 -04:00
Mark DePristo	4db2b13e9e	Rev tribble. Just added more documentation for diffEngine and pointer to new wiki: http://www.broadinstitute.org/gsa/wiki/index.php/DiffEngine	2011-07-17 13:05:04 -04:00
Mark DePristo	eacf205f40	Tests needed to be updated to reflect the code reorg of tribble.	2011-07-16 09:22:34 -04:00
Mark DePristo	c0bbeb23ba	Now providing more information when the index on the fly isn't equal to the one created by reading the file from disk.	2011-07-14 15:12:28 -04:00
David Roazen	3c9497788e	Reorganized the codebase beneath top-level public and private directories, removing the playground and oneoffprojects directories in the process. Updated build.xml accordingly.	2011-06-28 06:55:19 -04:00

31 Commits (9242f63a4d1185ef0a146b7fb354d2450752390b)