gatk-3.8

Commit Graph

Author	SHA1	Message	Date
depristo	6a49e8df34	Significant change to the way subsetting by sample works with monomorphic sites. Now keeps the alt allele, even if a record is AC=0 after the subset. Previously, the system dropped the alt allele, which I don't think is the right behavior. If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting. See detailed information below. Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost. For example, when selecting down NA12878 from the OMNI, I previously received the following VCF: 1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942 Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site. My change results in a VCF that looks like: 1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205 1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491 1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471 1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942 The genotype remains unchanged, but the ALT allele is now preserved. I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant(). isVariant => is there an ALT allele? isPolymorphic => is some sample non-ref in the samples? In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this. Wiki docs updated. Does anyone have critical infrastructure that depends on the previous convention? Let me know so we can coordinate the change. There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-21 13:59:16 +00:00
ebanks	dfdef2d29b	PLEASE READ ME! In order to prepare for the upcoming changes to VCF4, we felt it was best to split up the vcf3 and vcf4 codecs (vcf4 is not backwards compatible to vcf3 and certain changes are too complex to handle in both codecs). Using the 'VCF' rod type in the GATK will now throw a UserException for vcf3.2 or vcf3.3 files telling you to use the 'VCF3' type instead (and vice versa). Integration/unit tests have been updated. For programmers: note that there is currently a lot of code duplication in the two codecs (although I pulled out the easy stuff to a VCFCodecUtils class); however WE ARE FREEZING THE VCF3 CODEC AND WILL NO LONGER MAKE CHANGES TO IT. All updates/improvements will be targetted to the vcf4 codec only as vcf3 is there only to be able to read legacy files. People should really be using vcf4 files only. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5787 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-11 12:07:44 +00:00
ebanks	ab9ffb1a74	Epic failure on the lazy loading of genotypes: if the input VCF had its samples unsorted and we used a walker that didn't require genotypes, then we would sort the samples but not load genotypes (and therefore the genotypes wouldn't match the samples anymore). Added simple integration test to cover this case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5711 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-29 16:03:45 +00:00
depristo	8ed9c0f518	VariantsToTable now blows up by default if you ask for a field that isn't present in a record. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5636 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-14 14:42:43 +00:00
kshakir	dc33fbed7c	Switched the CVUnitTest broken info from an Integer to a String since as of r5383 Integers are no longer broken when converted to Floats. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5390 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 16:33:14 +00:00
depristo	af71576a07	CalculateChromosomeCounts() now only calculates AC, AF, and AN when there are genotypes. Can now combine variants with headers that differ in only whether a field is a integer or a float. Updated CombineVariants integrationtest, as incorrect AC values where being calculated in the previous GS outputs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5383 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-06 19:25:52 +00:00
ebanks	698096dc5a	Moving VariantsToVCF to the proper directory; removing the oneoffs CG indel converter in preparation for a ligitimate CG variant Feature class in the works. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5207 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-07 05:21:01 +00:00
kiran	35c688ac67	Updated md5 for testVCFStreamingChain to reflect latest changes to VariantEval. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5206 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-06 21:22:05 +00:00
hanna	06b63d8336	Pulled out CpG stratification in test results at Kiran's suggestion. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5165 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-01 18:36:09 +00:00
hanna	91297c138b	Update VCFStreamingIntegrationTest to use new variant eval command-line arguments, output format. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5162 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-01 15:40:43 +00:00
hanna	aea121a9d5	<key>=<value> tagging support for command-line arguments. Unfortunately, still very hard to validate and still very hard to use (requires core hacking to support additional tags). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5038 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-21 00:22:42 +00:00
hanna	8831ec3dce	Some refactoring and cleanup around the area of my sleep-deprived integration test typo, which Khalid already fixed for me. Sorry, Khalid! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5035 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-20 15:03:14 +00:00
kshakir	3022f4dfa0	Fixed missing space character in testSimpleVCFStreaming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5034 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-20 14:49:38 +00:00
hanna	7087c2f422	Very simple integration tests for basic VCF streaming functionality. Rather than try to fork the integration test process to get a pipe source and sink, creates a new named pipe by Runtime.exec()ing the 'mkfifo' shell command. We'll see whether this proves to be a reliable method for testing streaming. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5028 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-20 04:38:54 +00:00
chartl	5a27d231fa	Rename it so that nobody else falls into the trap laid out (the test is VariantToTable, the walker is Variant[s]ToTable) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4844 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 11:43:00 +00:00
chartl	5e27e9162f	Huh? I thought we parsed out comma-separated command line arguments into list automatically...just change the syntax of the integration test, no need to update the md5 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4843 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-15 11:40:27 +00:00
ebanks	f1f01610f8	Remove the extra trailing tab at the end of the VCF ## header line. Unfortunately, this meant updating every freaking integration test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4806 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-08 17:22:29 +00:00
ebanks	e3e6d176df	Looking over the daily error log email made me realize that there were 2 implementations of vc.modifyLocation() - the correct one in VC that didn't require lazy loading the genotype data and the bad one in VCUtils that did. Removing the implementation in VCUtils and updating the code accordingly. Also, removing createPotentiallyInvalidGenomeLoc() since no one uses it anymore. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4736 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-26 18:40:34 +00:00
ebanks	6934f83cc7	Two changes to CombineVariants. 1. Fix: VCs were padded before the merge, but they were never unpadded afterwards. This leaves us with a VC that doesn't meet our spec. 2. Update: instead of running the merged VC through every standard annotation (which seems really wrong, since this isn't the annotator tool), just update the chromosome count annotations (AC,AF,AN) through VCUtils. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4734 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-25 04:52:12 +00:00
depristo	721e8cb679	VariantsToTable now supports wildcard captures. -F PREFIX* now captures all fields that begin with PREFIX, output as a comma-separated list of unique values. Added integration test for VariantsToTable since I find it so useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4706 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-18 18:54:59 +00:00
ebanks	b51762c279	When you commit code late at night you tend to make careless mistakes... like forgetting to update integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4658 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 14:41:10 +00:00
depristo	5ef4b234d8	Updates for broken integration tests. Counting annotations (AC, AF) now work correctly for AC = 0 sites git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4640 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-09 19:43:43 +00:00
hanna	861ee3e37a	Changing testing framework from junit -> testng, for its enhanced configurability. Initial test to see how Bamboo will respond. More detailed email to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-01 21:31:44 +00:00
ebanks	7a291a8ff3	First pass at a VCF validator. Will test more tonight. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4524 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 19:55:49 +00:00
chartl	2bc5971ca1	Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it. Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. IMPORTANT I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do. I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 03:18:01 +00:00
ebanks	7aa030a9a4	Hmm. Apparently variants can get lifted over to different chromosomes. Who knew? Reverting changes from a couple of days ago. The only way to do this correctly (without requiring lots of memory) is to turn off on-the-fly indexing for this walker. Integration tests cover this now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4510 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 02:54:12 +00:00
ebanks	954dd84f51	Adding an integration test (against hg18 this time) that requires on-the-fly sorting in order to work properly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4500 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 07:45:21 +00:00
ebanks	9f54170dff	Hooking up the liftover tool to the new on-the-fly sorting VCF writer so that records can now get emitted in order. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4499 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 07:27:01 +00:00
kiran	f348ca2976	Now processes VCF files with repeated loci without crashing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4481 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-12 04:36:07 +00:00
hanna	0c99c97685	The engine now automatically adds the command-line arguments to the header of every VCF, unless -NO_HEADER is specified. Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line arg headers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-22 15:27:58 +00:00
aaron	2586f0a1ca	fix for the build I broke - the original file got corrupted, which I replaced with a version that didn't have the header stripped off. Other integration tests passed, but this test relied on the header being stripped off. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4320 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-21 15:35:25 +00:00
rpoplin	7e58d8ed61	CombineVariants now outputs the command line in the VCF header. Added a new hidden argument to VR walkers called --NoByHapMapValidationStatus to turn off the by-hapmap dbsnp rod behavior. Very useful for experimenting with which sets to use as training data. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4307 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-18 16:06:50 +00:00
hanna	bf0b6bd486	Update integration tests to use the new ROD syntax. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4112 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-25 18:13:30 +00:00
kiran	121b4f23b6	Simple change to allow a list of samples or regular expressions to be provided in a text file (one line per sample). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4074 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-21 00:01:48 +00:00
ebanks	3ff6e3404e	Alleles are now returned in a consistent order, so we can deal with tri-allelic sites git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4002 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-10 15:21:10 +00:00
ebanks	419a36f74c	Starting the clean up of the sting.utils.genotype code which is all either moving to Tribble, moving to sting.utils.vcf, or being removed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3994 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-10 02:16:05 +00:00
kiran	e242a8f143	Put single quotes around the regex. This isn't strictly necessary through the integration test machinery, but is necessary at the console, and it's convenient to be able to cut and paste this. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3977 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-08 05:56:57 +00:00
kiran	13f29660bb	Integration test for SelectVariants. Tests a complex case with an explicit sample selection, sample selection by regex, exclusion of non-variant and filtered loci, and JEXL selection on low allele-frequency variants git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3976 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-08 05:49:47 +00:00
depristo	f275041b1c	-minimalVCF for CombineVariants. Work around for broken locking code. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3960 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-06 16:10:59 +00:00
ebanks	341e752c6c	1) AlleleBalance is no longer a standard annotation, but the Allelic Depth (AD) is for each sample. 2) Small fixes in the VCFWriter: a) Trailing missing values weren't being removed if their count was > 1 (e.g. ".,.") b) We were handling key values that were Lists, but not Arrays. We now handle both. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3956 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-06 12:05:14 +00:00
aaron	72ae81c6de	VariantContext has now moved over to Tribble, and the VCF4 parser is now the only VCF parser in town. Other changes include: - Tribble is included directly in the GATK repo; those who have access to commit to Tribble can now directly commit from the GATK directory from Intellij; command line users can commit from inside the tribble directory. - Hapmap ROD now in Tribble; all mentions have been switched over. - VariantContext does not know about GenomeLoc; use VariantContextUtils.getLocation(VariantContext vc) to get a genome loc. - VariantContext.getSNPSubstitutionType is now in VariantContextUtils. - This does not include the checked-in project files for Intellij; still running into issues with changes to the iml files being marked as changes by SVN I'll send out an email to GSAMembers with some more details. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3954 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-05 18:47:53 +00:00
ebanks	340bd0e2c1	Removed hard-coded pointers to references git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-04 17:59:37 +00:00
depristo	19ad44d332	Minor improvements to CombineVariants to handle the complex case from Chris. IntegrationTest of complex case. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3876 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-25 13:46:11 +00:00
depristo	e21376219d	Updates to CombineVariants for Tim. -setKey can be null. Integrationtests for -setKey foo and -setKey null. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3870 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 22:35:52 +00:00
delangel	5eef15cfdf	a) Bad bug fix to CombineVariants: when indels were being merged, the reference base provided was wrong - ref.getBases()[0] was being used, but this returns bease at start of window. Instead, the reference at current locus should be used. b) Cosmetic change to Beagle annotation description. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3861 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 15:13:47 +00:00
depristo	536399eaa0	Improvements to variant combine. Now calculates AC/AN/AF correctly by calling into the VariantAnnotator engine. Automatically removes annotations that are inconsistent across incoming VCs (in simpleMerge). TODO bug fix for Guillermo/Eric. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3858 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-23 13:33:11 +00:00
delangel	473ec91633	a) Bug fix in VCFHeader parsing - Info fields were not being parsed properly, with the result that the Count field was not being properly displayed in records (e.g. if Count=0 for a particular field, the INFO tag was still being displayed as ...;Field=x;... instead of ...;Field;... b) Bug fixes and update to how we represent indels and other complex events in a VariantContext object. Convention is now that all events are left aligned, with the first variant context location marking the common base before an event occurs. However, alleles in a VC don't have the common base in all VC's. Two new functions are now part of VariantContextUtils: CreateVariantContextWithPaddedAlleles and CreateVariantContextWithTrimmedAlleles. Both take a VC as an input and create a VC as an output. Main flow is that a VCF reader would create a VC with trimmed alleles, all walkers would ideally work with these trimmed alleles, and then the VCF writer would pad back the alleles before writing. However, there are special cases where we need to pad alleles like for example when merging/combining VC's. Pending issues: - PED and DBSNP RODs have to be updated to create VC's for indels following the convention above. Changes will go in after Tribble location is moved and things are tested. - Need to verify Indel genotyper and other modules that create VC's with indels.- Wiki page describing convention above and how walkers should interpret indel VC's still needs updating/detailing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3850 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-22 02:36:45 +00:00
ebanks	ff6748d1cd	oops - missed one git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3841 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 18:55:19 +00:00
ebanks	c6ad26e04f	1) When quals/GQs are really integers (x.00), strip off the floating points. 2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output. 3) No more "ID" hard-coded all over the code to set the VariantContext ID. Use a static variable instead. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-20 18:01:45 +00:00
depristo	70b07206a2	CombineVariants tests for Guillermo and Eric to explore the correctness of the in/out reader, writer behavior of the system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3834 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-19 22:41:48 +00:00

1 2

56 Commits (d77f4ebe31a8f9e48165bd7ccfd3cd39f2ee25e1)