gatk-3.8

Commit Graph

Author	SHA1	Message	Date
ebanks	21256909bb	Not supported. I'm checking this in for Ryan only. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4756 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-30 16:59:18 +00:00
ebanks	d89e17ec8c	Fare thee well, UGv1. Here come the days UGv2. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4747 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-29 21:51:19 +00:00
delangel	9cdc341be5	Trivial update for data processing paper: change syntax of output argument for Beagle by depth walker to update to new GATK format. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4724 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-24 01:45:44 +00:00
ebanks	b9a59ea54f	Adding Het/Hom ratio to the temp per sample metrics. Because I'm in a generous mood tonight, I'm going ahead and fixing the paths for the classes I'm touching... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4713 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-21 04:24:42 +00:00
bthomas	374c0deba2	Updating the core LocusWalker tools to include the Sample infrastructure that I added last month. This commit touches a lot of files, but only significantly changes a few: LocusIteratorByState and ReadBackedPileup and associated classes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4711 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-19 19:59:05 +00:00
hanna	90711d445c	Change the interface for RMDTrackBuilder, therefore always mandating the specification of a sequence dictionary and related info. This will hopefully eliminate the cases in which the refseq track depends a sequence dictionary / contig parser that hasn't been specified. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4700 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-17 19:00:17 +00:00
kshakir	673fa841a4	Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader. Removed obsolete usages of PackageUtils with updated PluginManager. Ported Queue interval utilities written in scala over to Sting's java IntervalUtils. Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles. Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test). While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1". Upgraded to scala 2.8.1 and updated calls to deprecated functions. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-12 20:14:28 +00:00
hanna	8e36a07bea	Convert GenomeLocParser into an instance variable. This change is required for anything that needs to be simultaneously aware of multiple references, eg Queue's interval sharding code, liftover support, distributed GATK etc. GenomeLocParser instances must now be used to create/parse GenomeLocs. GenomeLocParser instances are available in walkers by calling either -getToolkit().getGenomeLocParser() or -refContext.getGenomeLocParser() This is an intermediate change; GenomeLocParser will eventually be merged with the reference, but we're not clear exactly how to do that yet. This will become clearer when contig aliasing is implemented. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-10 17:59:50 +00:00
depristo	4759fdd2ac	V1 of read and variant simulator and assessor. SimulateReadsForVariants generates BAM and VCF with given combinations of variant and read properties. AssessSimulatedPerformance produces a table suitable for analysis in R git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4637 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-08 21:01:33 +00:00
chartl	42e9987e69	Bug fix to GenotypeConcordance. AC metrics get instantiated based on number of eval samples; if Comp has more samples, we can see AC indeces outside the bounds of the array. Bug fix to LiftoverVariants - no barfing at reference sites. AlleleFrequencyComparison - local changes added to make sure parsing works properly Added HammingDistance annotation. Mostly useless. But only mostly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4622 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-03 19:23:03 +00:00
ebanks	ffc0ed2b32	Renamed getName() to getSource() in VariantContext to be more accurate git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4579 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-27 02:21:41 +00:00
ebanks	cedceb33cd	My only experience with getting external groups (GAP,dbSNP) to use VCF has been painful at best, so I'm not holding my breath to get indels for CG in VCF. To that extent, here's a oneoffs walker to convert from CG format to VCF for all 'del' & 'ins' types (but not 'sub' types, since they're too complex to code up in VCF and I don't care about them for now). rs ids are included. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4572 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-26 17:53:14 +00:00
depristo	84b6d2926b	Useful walker that creates a new interval list with only the interval overlapping input sites list. Really a one-off walker git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4559 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-22 19:55:04 +00:00
chartl	341e93ee12	The reference fixer seems to have munged the OMNI rather than making it better. Looks like some sites need to only have the ref and alt bases swapped, and others need to have the genotypes swapped as well? E.g. some subset need A C 1/1 --> C A 0/0 while another subset need A C 1/1 --> C A 1/1 it's unclear how big these subsets are (or even if one is empty). What I do know is, doing the first one totally screws up concordance metrics for the 421-sample chip. So either something else needs to be done, or there's a bug in this walker. Until I know for sure, I've added an initialize exception to disable this thing... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4523 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-19 12:50:24 +00:00
chartl	5889138f4a	facepalm forgot to add the samples to the header. How could the VCFWriter let me get away with something so boneheaded?! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4513 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 05:36:29 +00:00
chartl	2bc5971ca1	Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it. Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. IMPORTANT I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do. I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-17 03:18:01 +00:00
chartl	8b2d387643	Added in an eval module that calculates the dispersion histograms between eval and comp (e.g. M_{i,j} = # of times eval observed to have AC i, comp AC j -- for af it's i/100 vs j/100 ) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4507 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-15 19:07:43 +00:00
depristo	da29fcdb68	No longer writes the index to disk twice. But fixes for closing VCFWriters throughout the codebase git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4488 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-13 14:26:06 +00:00
aaron	ff0df1a2da	A fix for an integration test that was broken by on-the-fly indexing. Also, better reporting of Tribble exceptions in GATK integration tests. Trying to get the tests back up and running... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4483 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-12 18:39:56 +00:00
asivache	39e373af6e	deleting accidentally committed junk git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4464 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-08 15:13:01 +00:00
delangel	ece694d0af	Next iteration on new UG framework: - Brought over exact AF estimation from branch (which is now dead). Exact model is default in UnifiedGenotyperV2. - Implemented completely new genotyping algorithm given best AF estimate using dynamic programming, which in theory should be better than both greedy search and any HWE-based genotyper. - Integrated and added new Dindel likelihood estimation model. - Corrected annotators that would call readBasePileup: since we can be annotating extended events, best way is to interrogate context for kind of pileup and either readBasePileup or readExtendedEventPileup. All changes above except last one are still in playground since they require more testing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4396 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-30 21:33:59 +00:00
hanna	497bcbcbb7	Recent changes to the build system make the build system complain loudly about pieces of core that depend on playground. Most of these have been eliminated by (temporarily) promoting Aaron's report system to core in this checkin. I'll follow up with other changes in separately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-24 22:09:12 +00:00
delangel	a10cfe213b	Small bug fix in simple indel genotyper: Likelihood of case where best haplotype pair was (REF,REF) was not computed correctly. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4314 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-20 17:04:39 +00:00
delangel	f64b6fddc1	Major changes/improvements to indel genotyper: a) Redid way to compute path metrics in indel error model. Paper formulation where we have an anchor point in the alignemt between read and haplotype won't work in practice except in nice data sets that are perfectly indel-realigned and that are well mapped by aligner. New formulation doesn't assume this, and it's actually simpler and uses less code. It now resembles more a classic SW dynamic programming formulation but it still preserves the HMM probabilistic formulation. b) Added a programmable call threshold, set by command line. c) Use now sample name from BAM file, remove -sampleName argument. d) Simplify loop to compute read-haplotype likelihoods. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4311 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-19 23:47:31 +00:00
ebanks	a10b2a00a5	Moving the util VariantContext 'modifying' routines into VC itself (as opposed to VCUtils) so that we can pass the genotype data directly into it and are no longer forced to decode the genotypes for no reason. This means that any walker that takes in a VCF and modifies the records without touching the genotypes never have to decode them. I've hooked this into the other two Variant Recalibrator walkers for Ryan. One side effect, though, is that we no longer can sort the sample names in the VCF (i.e. if the input VCF doesn't have samples in alphabetical order, then we used to sort them when writing a new VCF but no longer do that), because if we don't decode then we can't re-order the genotypes. I don't think this is a big concern given that the Unified Genotyper does emit sorted samples and that's the main source for most of the VCFs we use. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4300 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-17 07:09:58 +00:00
aaron	782e0018e4	removal of most of the old GATK ROD system; also a fix for -Dsingle so we can again run just a single unit or integration test (single tests in tribble can be run with the -DsingleTest option now). More to come. * Three integration tests had to change: * RecalibarationWalkersIntegrationTest: One of the tests was using the interval as the snp track, and wasn't supplying a DbSNP track (for CountCovariates) SequenomValidationConverterIntegrationTest: relies on Plink ROD which we've removed. PileupWalkerIntegrationTest: we no longer have implicit interval tracks, so there isn't a rod name over the specified region. Otherwise the same result. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4292 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-15 22:54:49 +00:00
delangel	c604ed9440	Several improvements to new indel genotyper (more to come soon): a) Turns out previous change of centering haplotype around indel was a bad idea. Context to the left of indel is important but not as important as right one, because by definition all alleles start at the same location, so haplotype is the same to the left of indel regardless of allele. So, go back to having a constant size window to the left of event. b) Expand reference context so we can test larger haplotypes. c) Optimize computation of read likelihoods by doing them in linear array instead of in a matrix - no difference in biallelic sites but could be significantly faster in multiallelic sites. d) Bug fix: read alignment wasn't being computed correctly if, a) we were at an insertion, b) read started right at the insertion, c) read CIGAR didn't include insertion - more of these corner conditions are lurking, so a revamped computation of how reads align to candidate haplotypes is in the works. e) Add debug option not to use prior haplotype likelihoods. f) Don't hard-code NA12878 for genotyping, now sample name is a required input argument. g) Bug fix: if there are no reads covering a candidate indel event, just output NO_CALL (didn't notice this in HiSeq, but in P1 data it happens all the time). I need to add a confidence threshold for calling later on. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4291 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-15 21:53:08 +00:00
ebanks	4e83ba411f	We now do lazy loading for the genotype data in VCF. Practically, almost all walkers end of loading the genotype data because we need to be smarter about transfering the unparsed genotype string when modifying VariantContexts; however, this does solve the problem for VR's piece to generate clusters (shaved off 75% of runtime for Ryan's large case). That further optimization will happen later. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4279 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-15 00:18:17 +00:00
depristo	7880863eb7	Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 15:07:38 +00:00
depristo	7ad8fbdd5a	Moved GATKException to exceptions git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:47:19 +00:00
depristo	595907e98e	Moving StingException git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4262 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:34:15 +00:00
depristo	40e6179911	Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:02:43 +00:00
delangel	da2e879bbc	Miscellaneous improvements to indel genotyper: - Add a simple calculation model for Pr(R\|H) that doesn't rely on Dindel's HMM model. MUCH faster, at a cost of slightly worse performance since we're more sensitive to bad reads coming from sequencing artifacts (add -simple to command line to activate). - Add debug option to calculation model so that we can optionally output useful info on current read being evaluated. (add -debugout to commandline). - Small performance improvement: instead of evaluating haplotype to the right of indel (just with a 5 base addition to the left), it seems better to center the indel and to add context evenly around event. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4257 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 13:50:28 +00:00
depristo	8f1a32acae	All exceptions thrown by the GATK have been reviewed and UserErrors replaced where appropriate. Shazam. Another check-in will remove the GATKException and restore the StingException. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4252 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-10 15:25:30 +00:00
delangel	ef7454a241	Minor improvements to indel genotyper: a) Ability to specify haplotype size from command line b) Expand reference context window so we can form haplotypes for longer indel events. c) small bug fix in temp output writer (to be removed once I can emit vcfs) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4212 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-03 22:52:08 +00:00
delangel	8a7f5aba4b	First more or less sort of functional framework for statistical Indel error caller. Current implementation computes Pr(read\|haplotype) based on Dindel's error model. A simple walker that takes an existing vcf, generates haplotypes around calls and computes genotype likelihoods is used to test this as first example. No attempt yet to use prior information on indel AF, nor to use multi-sample caller abilities. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4197 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-03 00:25:34 +00:00
kiran	acd6bd2430	Experimental tool to annotates indels that are provided in a VCF file based on RefGene. Specifies gene, transcript, strand, type (Non-frameshift, frameshift, 5'-UTR, 3'-UTR, SpliceSiteDisruption, Intron, or Unknown). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4191 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-01 23:30:28 +00:00
chartl	63c7cbd89b	Forgot to commit this long ago, change so the tables are correctly propagated git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4174 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-31 19:06:52 +00:00
delangel	fe19539188	Small bug fix: if a read falls at the edge of an indel event (but is not part of it), don't count it towards consistency computation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4152 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-27 20:37:27 +00:00
ebanks	cba5f05538	Small fixes for consistency in the numbers. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4103 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-24 20:48:25 +00:00
ebanks	9fb151f417	Minor update git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4094 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-24 05:17:10 +00:00
ebanks	ef795825fd	Yet more argument consistency updates git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4089 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-23 20:52:30 +00:00
ebanks	55a8306a0d	Update the @RMD tags to look for VariantContext.class instead of ReferenceOrderedDatum.class. Since the test for rod type is broken this won't affect anything right now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4084 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-23 17:49:37 +00:00
aaron	35b9883dd6	vcfwriter is in tribble now git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4083 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-23 17:01:04 +00:00
kiran	dec713a184	Simple test code from Steve Schaffner to compute R^2 and D'. This is just for educational purposes. Don't use this code for anything, ever! git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4080 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-23 05:06:16 +00:00
hanna	b80cf7d1d9	Modifications to the output system for better interaction with @Output. Multiplexed arguments. More details in the Monday meeting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4077 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-22 14:27:05 +00:00
hanna	cb144734c0	Getting rid of GenotypeWriter interface. Of note: - GATKVCFWriter deleted, to be replaced if absolutely necessary when VCF writing goes into Tribble. - VCFWriter is now an interface, for easier redirection. - VCFWriterImpl fleshes out the VCFWriter interface. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4026 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-13 16:33:22 +00:00
chartl	49a3db9dfe	A brief implementation of a QD calculation that is not quite so bimodal for known variants (multiplicatively penalizes QD by (n variant samples)/(n variant alleles) ). Not sure how helpful this will be (which is why it is in oneoffs). Seems nice on MCKD1, but I'm still playing with the optimization. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4024 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-13 15:42:37 +00:00
chartl	c6a8fba922	Occasionally if a JEXL expression results in no variants being captured (like "QD > 20.0" on filtered variants) the per-sample mapping from samples to eval objects can be empty. This semi-hacky fix prevents null pointer exceptions in setting up the resulting empty table (by jumping straight to it in this case) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4023 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-13 15:37:45 +00:00
chartl	8c08f47923	1) Make sure that the table size is set correctly in finalize() 2) Make sure variants are biallelic before asking for isTransversion() git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4016 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-11 20:32:22 +00:00

1 2 3 4 5

239 Commits (21256909bb99b2a14f98054828f0c444fc33a07c)