gatk-3.8

Commit Graph

Author	SHA1	Message	Date
depristo	7c42e6994f	FindBugs fixes throughout the code base git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3823 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-18 16:29:59 +00:00
delangel	55b756f1cc	First step in major cleanup/redo of VCF functionality. Specifically, now: a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version. b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted. c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format. d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved. e) Several VCF 4.0 writer bugs are now solved. f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0. g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension. Pending issues: a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes. b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-16 22:49:16 +00:00
hanna	dfddf8fd75	- Bring the PaperGenotyper up to code. - Remove some old debugging cruft regarding handling of threaded engine exceptions. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3796 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 22:31:21 +00:00
ebanks	af23762778	Removing more references to VCFRecord git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3789 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 11:54:23 +00:00
ebanks	460283f6d2	No more manually converting VariantContexts to VCFRecords. You should be utilizing VCs and not VCFRecords. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3787 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 05:21:28 +00:00
ebanks	6b5c88d4d6	The GATK no longer writes vcf3.3; welcome to the world of vcf4.0. Needed to fix a few output bugs to get this to work, but it's looking great. Much more still to come. Guillermo: hopefully this doesn't break your local build too badly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3786 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-14 04:56:58 +00:00
ebanks	9a05e8143d	Move to 4.0 and away from VCFRecord. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3780 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 15:54:54 +00:00
ebanks	7e7da75d27	Moving over to 4.0 and away from VCFRecord git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3778 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-13 14:07:10 +00:00
delangel	297f15a60c	Protect ProduceBeagleInputWalker against evil users who feed to it VCF's with indels, no variation sites or other interesting markers: Write to Beagle input only in biallelic SNP sites since that's the only thing Beagle can do. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3772 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 20:54:42 +00:00
delangel	5992b79159	a) Simplify normalization code in ProduceBeagleInputWalker, as to always normalize, and use MathUtils.normalizeFromLog10 to do this. b) Several improvements to BeagleOutputToVCFWalker: 1. If a Hapmap input track is provided (e.g. -B comp,VCF,file), Hapmap sites will be annotated with Hapmap Allele count and allele frequency (key ACH, AFH). 2. If probability of correct genotype is lower than ncthr (optional argument provided by user, default = 0.0), walker will keep original calls instead of using Beagle calls. 3. Instead of annotating just whether Beagle had modified a site, annotate instead HOW MANY genotypes in a site were actually changed by Beagle. All three improvements are mostly for debugging and analysis only. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3769 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 19:54:58 +00:00
ebanks	e50627a49e	1. Updated tests and added integration test for liftover code. 2. Updated liftover code (and scripts) to emit vcf 4.0 and no longer depend on VCFRecord. 3. Beagle walker now also emits vcf 4.0. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3767 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 17:58:18 +00:00
ebanks	8086ab1f75	Pulled sample/header merging routines out of CombineVariants and into util classes. Added more generalized methods for retrieving samples. Updated the Beagle walkers to use these methods. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3764 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 16:51:54 +00:00
ebanks	0c4a32843c	No longer uses VCFRecord git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3763 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 13:57:39 +00:00
ebanks	f130d29318	No longer uses VCFRecord. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3762 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-12 13:34:10 +00:00
ebanks	fb717fe128	First pass needed to remove old VCF code: moving all VCF-related constants into a single unified class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3759 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-11 07:19:16 +00:00
delangel	be75b087ec	a) Add input argument (-ncrate) to BeagleOutputToVCFWalker. If the genotype posterior error probability is higher than this threshold, we declare No-call at this genotype. b) Add "OG" annotation to genotypes. If Beagle changes genotypes, this annotation gets the original genotype call, to ease performance comparisons. If not, this annotation gets an empty value. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3723 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-06 18:33:28 +00:00
aaron	3347d1ca7c	part one of combining format and info header lines code into a single abstract class for Mark; plus some 'm' removals from access methods for Eric. Adding fixes for CombineVariants next. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3719 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-05 05:57:58 +00:00
hanna	4995950d04	IndexedFastaSequenceFile is now in Picard; transitioning to that implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3701 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 04:40:31 +00:00
delangel	ed71e53dd4	1) Initial complete version of VCF4 writer. There are still issues (see below) but at least this version is fully functional. It incorporates getting rid of intermediate VCFRecord so we now operate from VariantContext objects directly to VCF 4.0 output. See VCF4WriterTestWalker for usage example: it just amounts to adding vcfWriter.add(vc,ref.getBases()) in walker. add() method in VCFWriter is polymorphic and can also take a VCFRecord, lthough eventually this should be obsolete. addRecord is still supported so all backward compatibility is maintained. Resulting VCF4.0 are still not perfect, so additional changes are in progress. Specifically: a) INFO codes of length 0 (e.g. HM, DB) are not emitted correctly (they should emit just "HM" but now they emit "HM=1"). b) Genotype values that are specified as Integer in header are ignored in type and are printed out as Doubles. Both issues should be corrected with better header parsing. 2) Check in ability of Beagle to mask an additional percentage of genotype likelihoods (0 by default), for testing purposes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3664 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-28 23:54:38 +00:00
weisburd	147ba68441	Fixed bug with mrnaCoord field - made it count exon positions only, rather than introns & exons git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3642 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-25 19:53:32 +00:00
aaron	682f9b46c6	Two fixes together: 1) Some improvements to the VCF4 parsing, including disabling validation. 2) Reimplemented RefSeq in the new Tribble-style rod system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3630 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-24 22:17:03 +00:00
ebanks	824c2bbac0	Finishing previous checkin git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3608 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-22 17:21:38 +00:00
ebanks	aa1852575e	Add -noVerbose flag to stop output of INFO data. Cuts runtime by 30% and output from 65Mb to 1Kb. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3591 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 18:53:35 +00:00
rpoplin	724affc3cc	Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-18 17:37:11 +00:00
delangel	b694ca9633	Cleanup: Don't require likelihood ROD in Beagle parameters when generating output VCF. Likelihoods file is only an input to Beagle but the Walker that generates a VCF doesn't need it, so it's silly to ask for it and it's error-prone. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3579 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 17:45:48 +00:00
aaron	3d049204ed	some refactoring for the variant eval output system git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3576 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-17 05:34:31 +00:00
delangel	8cb16a1d45	a) Cleanup, remove -input argument from BeagleOutputToVCFWalker since it's not needed. b) Added back old Beagle ROD to maintain backward compatibility (does anyone even use this???) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3563 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 02:13:08 +00:00
delangel	d319a28be7	Complete rewrite of the Beagle functionality to read from Beagle output files and produce VCF with modified genotypes. Now, a new ROD system using Tribble is in place. Beagle inputs are set using -B beagleType,Beagle,pathToBeagleFile, where beagleType can be either beagleR2, beagleLike, beaglePhased or beagleR2 (BeagleOutputToVCFWalker requires all of the above). Only pending items: -input argument is now unused and can be removed, will be cleaned later. Wiki will be updated with new usage shortly. We can now run with a reduced memory footprint, and output VCF is exactly identical to previous version. Drawback is increased runtime because Tribble has to create an index for all the Beagle files when starting if the idx files are missing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3562 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-16 02:01:35 +00:00
sjia	b99a5e06f3	Added option to only consider alleles of > specific allele frequency. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3557 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-15 02:09:35 +00:00
sjia	8defb30796	Documentation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3555 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 21:31:01 +00:00
weisburd	c1046653a2	Fixed handling of records where gene-names are identical (eg. as in refseq NR_030638 in chr20) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3554 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 20:00:49 +00:00
sjia	b3c3023c3c	Allows callers to handle HLA reference files as input (rather than hard-coded paths) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3552 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 18:56:08 +00:00
sjia	abdc8521ea	Added debug options for FindClosestHLAWalker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3549 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 17:52:03 +00:00
sjia	c38390eabb	Added option for min number of matches between reads and alleles required to consider reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3548 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 16:08:49 +00:00
sjia	d8c963c91c	Remove PhaselikelihoodsWalker.java git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3544 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:21:43 +00:00
sjia	5704294f9d	HLA caller updated - now searches all (common and rare) alleles, more efficient read filtering and allele comparison runs. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3543 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-14 15:14:40 +00:00
weisburd	06fc5eecf8	Implemented TreeReducible - if num threads > 1, the output will be accumulated in memory and written to a vcf file at the end - in onTraveralDone(..). If num threads == 1, things will work as before - where vcf records are written to disk as soon as they are computed with map(..). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3530 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-10 20:57:23 +00:00
weisburd	fdded73861	Improved error reporting git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3520 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:52:48 +00:00
weisburd	c1b7bcc786	Fixed handling of mitochondrial genes - added special cases such as ATT being a start codon in mitochondria. Added warning if a gene doesn't start with Met or end in a stop codon git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3517 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:15:47 +00:00
weisburd	4f1181974b	Added toString() method git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3516 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:12:57 +00:00
ebanks	9b2fcc4711	Refactoring of the annotation system: 1. VA is now a ROD walker so it no longer requires reads (needs a little more testing) 2. Annotations can now represent multiple INFO fields (i.e. sets of key/value pairs) 3. The chromosome count annotations have been pulled out of UG and the VCF writer code and into VA where they belong. Fixed the headers too. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3513 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-09 17:05:51 +00:00
delangel	de134c226d	Removed ability of users to specify annotations to recompute, cleanups. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3501 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 19:17:59 +00:00
ebanks	4d1a6b3d99	quick changes for G git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3500 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 16:33:27 +00:00
delangel	907931c902	a) Update annotations when creating new vcf with Beagle's imputed data. Since genotypes may (will) change based on imputation, several annotations need to be updated. By default, AC, AF, AN and AB will be updated. User can force extra annotaqtions to be updated with -A <annotation> argument. b) Several cleanups and beautifications. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3499 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-08 15:12:04 +00:00
depristo	6eeb1693ca	JEXL2 upgrade. Improvements to JEXL processing including dynamically resolving variable -> value bindings instead of up front adding them to a map. Performance improvements and code cleanup throughout. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3494 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-07 00:33:02 +00:00
delangel	c503f01dcf	More cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3492 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-06 17:41:38 +00:00
delangel	d4c66d6191	a) Small cleanup b) Fix major issue with Beagle likelihood converter: if likelihood triplets from UG end up being too low, then Beagle input file will be produced with 0.00,0.00,0.00 triplet. If all samples at a marker have this issue, Beagle will effectively produce junk. To fix, likelihoods are renormalized before converting to linear space. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3491 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-06 17:31:59 +00:00
depristo	cfa18f6743	Fixing missed update with new Allele in it git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3490 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 23:56:34 +00:00
delangel	ef47a69c50	a) First fully functional (sort of) version of walker that parses Beagle imputation output files and produce a vcf with imputed genotypes. More doc/info to follow shortly. Issues still to be solved: a) Walker changes all genotypes based on Beagle data, but annotations on the original VCF are unchanged. They should in theory be recomputed based on new genotypes. b) Current implementation is ugly, dirty unwieldy and will necessitate a refactoring soon so I can keep my pride. Most aesthetically affronting issue right now is that we read the full Beagle files at initialization and keep them in memory, but a more delicate implementation would just read from files on a marker by marker basis. Issue that currently prevents this is that BufferedReader() instances don't seem to play nice when called from the map() function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3488 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-04 20:37:25 +00:00
weisburd	3ab936181c	Supports the join feature of GenomicAnnotator git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3478 348d0f76-0448-11de-a6fe-93d51630548a	2010-06-02 16:29:57 +00:00

1 2 3 4 5 ...

1148 Commits (7c42e6994f0ff14acaf969982325ae0144b499f6)