gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	016b25be87	Update annoying md5s in unit tests, also failing because of header fixing	2012-06-26 17:32:42 -04:00
Mark DePristo	cd32b6ae54	CombineVariantsUnitTest was failing because the header repair was fixing the problem it wanted to detect	2012-06-26 17:32:42 -04:00
Mark DePristo	1f45551a15	Bugfixes to G count types in VCF header -- Previously VCF header lines of count type G assumed that the sample would be diploid. -- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class -- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class -- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations -- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples -- Added extensive unit tests that compare A and G type values in genotypes	2012-06-26 15:28:34 -04:00
Mark DePristo	7ef5ce28cc	VariantRecalibrator test currently doesn't work with shadowBCF	2012-06-26 15:28:34 -04:00
Mark DePristo	5f5885ec78	Updating many MD5s to reflect correct fixed headers -- Previous bugfix ensures that header fixing is always on in the GATK by default, even after integration tests that failed and when through the VCFDiffableReader. Updating md5s to reflect this.	2012-06-26 15:28:34 -04:00
Mark DePristo	39c849aced	Bugfix to ensure the DB=1 old files decode properly	2012-06-26 15:28:33 -04:00
Mark DePristo	c1ac0e2760	BCF2 cleanup -- allowMissingVCFHeaders is now part of -U argument. If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING. Updated lots of files to use this -- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup. This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.	2012-06-26 15:28:33 -04:00
Mark DePristo	0b5980d7b3	Added Heng's nasty ex2.vcf to standard tests	2012-06-26 15:28:33 -04:00
Mark DePristo	11dbfc92a7	Horrible bugfix to decodeLoc() in BCF2Codec -- Just completely wrong. -- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem -- Added samtools ex2.bcf file for decoding to our integrationtests	2012-06-26 15:28:32 -04:00
Mark DePristo	fb26c0f054	Update integration tests to reflect header changes	2012-06-26 15:28:32 -04:00
Mark DePristo	7b96263f8b	Disable shadowBCF for VariantRecalibrationWalkers tests because it cannot handle symbolic alleles yet	2012-06-26 15:28:32 -04:00
Mark DePristo	7dbba465ee	Bugfix for shadow BCFs to not attempt to write to /dev/null.bcf	2012-06-26 15:28:32 -04:00
Mark DePristo	6e9a81aabe	Minor bugfix -- now that the testfile is in our testdata regenerate the idx file as needed to pass tests	2012-06-26 15:28:32 -04:00
Roger Zurawicki	7eb3e4da41	Added integration Tests for DiagnoseTargets Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-06-25 17:02:46 -04:00
Joel Thibault	f0c54d99ed	Account for a null attributes object * field attributesCanBeModified - a null attributes object can't be modified in its current state * method makeAttributesModifiable() - initialize a null attributes object to empty	2012-06-25 12:07:36 -04:00
Joel Thibault	d0cf8bcc80	Add unit tests for VariantContextBuilder.rmAttribute() and .attribute() * These generated NPEs when the attribute object is null	2012-06-25 12:05:04 -04:00
Joel Thibault	fd9effbfe2	Fix Exception typo	2012-06-25 12:05:04 -04:00
Ryan Poplin	429ad44421	Bug fix for read pos rank sum test annotation. Shouldn't be using the un-hardclipped start as the alignment start.	2012-06-22 14:53:29 -04:00
Ryan Poplin	735b59d942	Bug fix in MLEAC calculation for when the exact model says the greedy AC of the alternate allele is zero.	2012-06-22 12:38:48 -04:00
Ryan Poplin	0650b349d7	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-22 10:42:49 -04:00
Guillermo del Angel	eed32df30d	a) Sanity check in PoolCaller: if user didn't specify correct -glm or -pnrm models then error out with useful message, b) Have VariantsToTable deal with case where sample namess have spaces: technically they're allowed (or at least not explicitly forbidden) but they'll produce R-incompatible tables. TBD which other tools have issues, or whether there's a generic fix for this	2012-06-21 21:19:55 -04:00
Mark DePristo	d17369e0ac	A few misc. residual errors in last commit	2012-06-21 16:04:25 -04:00
Mark DePristo	734756d6b2	Final fixes before BCF2 mark III push -- Added MLEAC and MLEAF format lines to PoolCallerWalker -- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation -- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing	2012-06-21 15:17:22 -04:00
Mark DePristo	31ee8aa01a	JEXL update -- Update to 2.1.1 from 2.0 -- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching. So "AF < 0.5" works even in the presence of multi-allelics now. --	2012-06-21 15:17:21 -04:00
Mark DePristo	549293b6f7	Bugfixes towards final BCF2 implementation -- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF -- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing -- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one -- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing -- Keys whose value is null are put into the VariantContext info attributes now	2012-06-21 15:17:21 -04:00
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	fba7dafa0e	Finalizing BCF2 mark III commit -- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now -- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work... -- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking. -- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt -- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now -- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status -- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be -- VariantsToVCF now properly writes out the GT FORMAT field -- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure -- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now -- PASS string now added at the end of the BCF2 dictionary after discussion with Heng -- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian. -- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException -- Cleaned up unused code -- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding -- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples -- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names... -- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally -- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele -- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code -- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, ** NEVER COPY CODE ** if you need to share funcitonality make a function, that's why there were invented! -- Increased the default number of records to read for DiffObjects to 1M	2012-06-21 15:16:27 -04:00
Mark DePristo	0c8b830db7	Updating MD5s for inclusion of RPA field header	2012-06-21 15:16:26 -04:00
Mark DePristo	d015a5738d	Bugfixes for VCFWriterUnitTest and TestProvider to deal with stricter VCFWriter behavior	2012-06-21 15:16:26 -04:00
Mark DePristo	9c81f45c9f	Phase I commit to get shadowBCFs passing tests -- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument -- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA -- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample -- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original -- Performance optimizations in BCF2 codec and writer -- using arrays not lists for intermediate data structures -- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over -- VCFHeader (which needs a complete rewrite, FYI Eric) -- Warn and fix on the way flag values with counts > 0 -- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation. -- Explicitly track FILTER fields for efficient lookup in their own hashmap -- Automatically add PL field when we see a GL field and no PL field -- Added get and has methods for INFO, FILTER, and FORMAT fields -- No longer add AC and AF values to the INFO field when there's no ALT allele -- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare -- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF	2012-06-21 15:16:26 -04:00
Mauricio Carneiro	ab53220635	Refactor on how RR treats soft clips * Sites with more soft clipped bases than regular will force-trigger a variant region * No more unclipping/reclipping, RR machinery now handles soft clips natively. * implemented support for base insertion and base deletion quality scores in synthetic and regular reads. * GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present. note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!	2012-06-21 14:02:03 -04:00
Ryan Poplin	769e190202	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-20 09:59:55 -04:00
Christopher Hartl	fe1d6e3953	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-19 08:02:00 -04:00
Christopher Hartl	79ef3325bd	Fix a NullPointerException that could occur in DoC if the user requested an interval summary but never provided a -L argument. This situation is now checked for and a UserError thrown instead. Also (after a great struggle) pushing some old VR3 code into the central repository which had been improperly pushed (e.g. with rsync rather than git push) into my repository on the server, and never migrated to unstable. In addition, minor convenience function added to the GATKReport that allows an entire row to be added, and a walker that parses out annotations from a tool called VariantEffectPredictor and summarizes annotations across transcripts, and consensus annotations.	2012-06-19 07:50:13 -04:00
Eric Banks	15ae906f32	Once I was playing with integration tests it was simple to fix the ones I left broken from earlier today.	2012-06-18 21:54:58 -04:00
Eric Banks	62cee2fb5b	Feature request from Tim that could be useful to all: there's now an --interval_padding argument that specifies how many basepairs to add to each of the intervals provided with -L (on both ends). This is particularly useful when trying to run over the exome plus flanks and don't want to have to pre-compute the flanks (just use e.g. --interval_padding 50). Added integration test to cover this feature.	2012-06-18 21:36:27 -04:00
Eric Banks	4393adf9e7	If present, VE's AlleleCount stratification uses the MLE AC by default (and otherwise drops down to use the greedy AC). Added integration test to cover it.	2012-06-18 13:36:14 -04:00
Ryan Poplin	707151f0a4	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 12:55:58 -04:00
Eric Banks	82a2c40338	Emit the MLE AC and AF in the INFO field of the UG output	2012-06-18 12:19:36 -04:00
Ryan Poplin	5ec737f008	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-18 08:51:48 -04:00
Ryan Poplin	e3147969d9	Smith Waterman parameters have somehow gotten too diverged from what it is used in the indel realigner. Results are very dependent on these params. Changes to the assembly to not create long haplotypes out of only small pieces that were properly assembled.	2012-06-18 08:51:41 -04:00
Eric Banks	677babf546	Officially removing all code associated with extended events. Note that I still have a longer term project on my plate to refactor the ReadBackedPileup, but that's a much larger effort.	2012-06-15 15:55:03 -04:00
Eric Banks	783b7f6899	Misc cleanup	2012-06-15 10:39:19 -04:00
Eric Banks	0c218e4822	Refactoring mostly for readability (and small performance improvement)	2012-06-15 10:36:41 -04:00
Eric Banks	c54e84e739	Ryan confirmed that we don't need separate arguments to control the context size for insertions and deletions, which allows us to cut down the expensive context calculations.	2012-06-15 09:28:56 -04:00
Eric Banks	61fcbcb190	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-06-15 02:45:57 -04:00
Eric Banks	4895fe2289	No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java)	2012-06-15 02:32:00 -04:00
Mark DePristo	5c23ab0817	Final cleanup of VCFWriterUnitTest	2012-06-14 16:42:39 -04:00
Mark DePristo	0384ce5d34	Simple optimizations for BCF2Encoder -- Inline encodeString that doesn't go via List<Byte> intermediate -- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2 -- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation -- Final UG integration test update	2012-06-14 16:42:39 -04:00
Mark DePristo	68eed7b313	Optimizations for VCF and BCF2 -- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking. encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version. Lots of contracts. User code updated to use specialized versions where possible -- Misc code refactoring -- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1. Updating MD5s accordingly -- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations	2012-06-14 16:42:39 -04:00

1 2 3 4 5 ...

2274 Commits (a5df8f1277d7dc1bc75dfb837f0598a0e0220c34)