gatk-3.8

Commit Graph

Author	SHA1	Message	Date
rpoplin	562db45fa5	Sites that were marked NO_DINUC no longer get dinuc-corrected but are still recalibrated using the other available covariates. Solid cycle is now the same as Illumina cycle pending an analysis that looks at the effect of PrimerRoundCovariate. Solid color space methods cleaned up to reduce number of calls to read.getAttribute(). Polished NHashMap sort method in preparation for move to core/utils. Added additional plots in AnalyzeCovariates to look at reported quality as a function of the covariate. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2451 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 20:19:37 +00:00
asivache	2a704e83df	Reads now have new traversal flag: generateExtendedEvents(). Support added to GenomeAnalysisEngine and Walker. This is a silent and transparent framework change that no existing code is going to see. The actual code that makes use of the new flag (which is false by default) will be committed separately... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2450 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 19:52:44 +00:00
ebanks	c8d0e6e004	Optimization to pooled calculation model: stop calculating P(D\|AF) if we are beyond the max likelihood such that subsequent likelihoods won't factor into the confidence score. Also, use new Pileup interface. Pooled calling now takes less than half the time it used to. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2449 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 18:39:55 +00:00
ebanks	b1ac4b81d5	Optimization: look up diploid genotypes from a static matrix instead of creating them on the fly (with String.format); bases no longer need to be ordered appropriately git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2448 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 17:28:51 +00:00
andrewk	57516582c2	Converter from HapMap chip genotype data to VCF added; HapMapGenotypeROD adjusted to not convert from Hg18 to b36 formatting of contigs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2447 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-28 01:36:08 +00:00
ebanks	d2770f380c	Writing calls to standard out now works again (it got broken when we introduced parallelization) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2446 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-27 04:36:45 +00:00
ebanks	12990c5e7a	Added qual-by-depth annotation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2445 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-25 02:30:30 +00:00
ebanks	0571d9dcb9	Point MAX_QUAL_SCORE to SAMUtils.MAX_PHRED_SCORE. Also, array size for caches should be max score + 1. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2444 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 20:47:32 +00:00
ebanks	438d21842a	The new recalibrator had been mimicking the behavior of the old one in that if there was no dinuc available (following a no-call base or at either end of a read), it didn't try to recalibrate. Now that Ryan has modularized the system, we no longer need to skip the base completely (we just need to skip the dinuc value)... which is good because the Picard people complained after realizing that cycle #1 never got recalibrated. The major effects of this commit are as follows: 1. We no longer skip any good bases (of course, this change alone breaks every single integration test). 2. The dinuc covariate returns a "no dinuc" value for the first base of a read (but not for the last base anymore, since there is a valid dinuc) or if the previous base is a bad base (e.g. 'N'). I've done a bunch of testing on real data and everything looks right; however, let's wait until the recalibrator guru gets back from vacation next week and can double-check everything before shipping this out in another early access release. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2443 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 20:41:29 +00:00
ebanks	aaf674d9db	Cleaned up this annotation. Still experimental. As of now, it's not useful. More analysis is needed to determine how to handle cases where UG is unsure whether a sample is het or hom. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2442 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 03:06:46 +00:00
ebanks	6df40876a3	Un-reverted Matt's previous changes and fixed integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2441 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 02:47:00 +00:00
hanna	2bd0b1bbf7	After further review, it's unclear that my patch in RecalDataManager was the right choice. Reverting. Also updating other IntervalCleanerIntegrationTest failures that were masked by my first patch. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2440 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-24 00:32:33 +00:00
hanna	98c268483e	Fixed issues with the integration tests: 1) sam-jdk apparently no longer supports custom tags with type int[] values. 2) BAM output for indel cleaner integration test changed in a way that's so subtle it can't be seen after converting the output to .sam. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2439 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 23:12:22 +00:00
aaron	b134e0052f	added changes to the code to allow different types of interval merging, 1: all overlapping and abutting intervals merged (ALL), 2: just overlapping, not abutting intervals (OVERLAPPING_ONLY), 3: no merging (NONE). This option is not currently allowed, it will throw an exception. Once we're more certain that unmerged lists are going to work in all cases in the GATK, we'll enable that. The command line option is --interval_merging or -im git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2437 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:59:14 +00:00
alecw	159778416c	In TableRecalibrationWalker, update UQ tag if it was present in the original SAMRecord. This required a new sam.jar, which caused some other files to need to be changed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2435 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:42:36 +00:00
hanna	87ff2b15d4	First step in introducing a patch to Picard: create our ideal interface into the BAM file for sharding. This commit can iterate over the BAM file, pulling out information about the blocks in the file without actually loading or decompressing the reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2434 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:35:08 +00:00
ebanks	770093a40e	Oops - forgot to check this one in. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2433 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 19:53:28 +00:00
ebanks	dc96879861	2 separate changes which both affect lots of UG integration md5s, so I'm committing them together: 1. allele balance annotation is now weighted by genotype quality (so we don't get misled by borderline het calls) 2. Updates to the Unified Genotyper for parallelization: a. verbose writing now works again; arg was moved from UAC to UG b. UG checks for command that don't work with parallelization c. some cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2432 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 19:03:56 +00:00
ebanks	872a9d1c7b	I'm making this change now (as opposed to waiting until Monday) to honor Tim's request. The cycle covariate is now first/second of pair aware. I'm taking it on faith from both Chris Hartl (waiting on slides from him) and Tim that this is the right thing to do. We'll have Ryan confirm it all next week. The only change is that if a read is the second of a pair, we multiple the cycle by -1 (a simple way of separating its index from that of its mate). Of course, this broke all integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2431 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 16:26:43 +00:00
hanna	e29e8e52b9	Multithreading support for the unified genotyper. Tests on a 10Mbase region on pilot 1 show a 6.8x improvement when running 8 ways parallel. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2430 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 00:48:06 +00:00
kiran	164a94a3d0	Modified the walker documentation so that the stray punctuation wouldn't cause the GATK to stop parsing the help documenation early (aka I changed one word). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2429 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-22 20:50:01 +00:00
kiran	4ee6a478e3	Creates a table of reference allele percentage and alternate allele percentage at Hapmap-chip sites in a BAM file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2428 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-22 20:43:44 +00:00
ebanks	03bf75e335	Now implements TreeReducible git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2427 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-22 17:52:51 +00:00
hanna	0d890e1bf0	Rework Eric's output management code given that the behavior of the UG changes drastically depending on its output format. Current implementation is probably a bit overkill-ish and we can whittle this down to what's absolutely necessary. Writing VCFs to the 'out' protected printstream may not work at this moment. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2425 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-22 00:33:43 +00:00
ebanks	f448a263e9	The cleaner now cleans duplicate reads (instead of ignoring them) - although it doesn't include them for scoring ref or alt consenses git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2424 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 21:01:55 +00:00
ebanks	cf303810d3	VCF reader now creates the correct type of header line for each header type git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2423 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 20:39:06 +00:00
ebanks	e06dfe44c4	Check for null platform (even when the read group isn't null) and assign it the default platform if it is git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2420 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 07:01:41 +00:00
ebanks	87e5a41964	Fixed a bug that accounted for a bunch of my remaining mis-cleaned indels. Also, slightly optimized the cleaner by using readBases (instead of readString) and caching cigar element lengths. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2419 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 05:46:16 +00:00
hanna	b780ffb34a	Add a getFormat() method to get the output format from the writer. The need for this call suggests that I may be thinking about the typing of the GenotypeWriter object the wrong way. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2418 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 01:46:26 +00:00
hanna	11cbfcec9c	Get rid of backlink from ArgumentDefinitions to ArgumentSources. This will help in the future with multiple source -> single definition mapping sets. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2417 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-21 00:39:36 +00:00
hanna	9e53c06328	First revision of command-line argument support for GenotypeWriter. Also, fixed the damn build. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2416 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-20 19:19:23 +00:00
ebanks	4ff61097cf	Trivial change: < -> <= git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2415 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-20 03:35:27 +00:00
ebanks	566b556b50	Give user ability to turn off max allowed interval size git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2414 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-20 03:20:22 +00:00
ebanks	a5f75cbfd4	The previous commit broke the build, so this is a temporary patch to get it to compile. ConcordanceTruthTable should use enums (esp. now that all of the concordance variables need to be public), but VariantEval will need to be rewritten soon anyways so I'll just push it off until then. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2413 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-20 02:34:41 +00:00
depristo	ee8bcdc61d	PooledConcordance calculations have been reformatted and bugs fixed. Now properly handles monomorphic sites. Also works with -G option now, correctly git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2412 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-19 23:22:36 +00:00
depristo	9bf2d12c64	Misc. improvements to the LMW code. Support for emitting all sites, regardless of genotype. Min and max quality scores. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2411 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-19 23:20:57 +00:00
aaron	7e0f69dab5	Changed the GLF record to store it's contig name and position in each record instead of in the Reader. Integration tests all stay the same. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2410 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 22:54:56 +00:00
hanna	80b3eb85fa	Fixed curiously epic failure in read-backed pileup: size() mismatched the numReads-numDeletions at that locus in the case where includeReadsWithDeletionsAtLoci == false, causing failures including bad output from pileup walker. Also fixed up ValidatingPileup to run with the new ReadBackedPileup instead of just compiling successfully. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2409 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 22:52:44 +00:00
rpoplin	fdf542c214	The CycleCovariate for 454 data is now the TACG flow cycle. That is, each flow grabs all the T's, A's, C's, and G's in order in a single cycle. This is changed from incrementing the cycle whenever there is a discontinuous nucleotide along the direction of the read. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2408 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 22:39:51 +00:00
aaron	c39675d2c1	VCFTool.java got left off of the last commit git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2407 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 21:33:53 +00:00
ebanks	4ea31fd949	Pushed header initialization out of the GenotypeWriter constructors and into a writeHeader method, in preparation for parallelization. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2406 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 19:16:41 +00:00
ebanks	eeddf0d08e	Adding sample utils for convenience methods to pull out samples from e.g. SAMFileHeader or Genotype objects git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2405 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 18:51:21 +00:00
chartl	79b997f43d	Minor fix to getValue (thanks Ryan!) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2404 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 15:45:51 +00:00
aaron	9971a8da9a	adding a check to the RodVCF to ensure that records are in-order in the underlying VCF file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2403 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 15:24:45 +00:00
chartl	38563bbc2d	The values used to be integers (-1 for unpaired, 0 for unmapped, 1 for first, 2 for second); but i switched to strings before commit so it was more clear. Forgot to update the OTHER getValue method. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2402 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 15:05:14 +00:00
chartl	7b5e332ff3	Added - PairedQualityScoreCountsWalker: counts quality scores (e.g. as a histogram) on first reads of a pair and second reads of a pair. Turns out there's a consistent difference in quality scores; even after recalibrating without the pair ordering as a covariate (there's a bit of averaging -- but not as much as I initially thought). Added - A paired read order covariate to use with recalibration. Currently experimental: for instance, what's a proper pair versus just a pair? Nobody should use this one... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2401 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 15:01:01 +00:00
ebanks	4f59bfd513	Updates to the various GenotypeWriters to make them do simple things like write records (plus allow GLFReader to close). Adding first pass of stub and storage classes for the GenotypeWriters so that UG can be parallelizable. Not hooked up yet, so UG is unchanged. The mergeInto() code in the storage class is ugly, but it's all Tribble's fault. We can clean it up later if this whole thing works. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2400 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 07:20:23 +00:00
ebanks	1cde4161b7	Fixed another test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2399 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 05:05:03 +00:00
ebanks	94f5edb68a	1. Fixed VCFGenotypeRecord bug (it needs to emit fields in the order specified by the GenotypeFormatString) 2. isNoCall() added to Genotype interface so that we can distinguish between ref and no calls (all we had before was isVariant()) 3. Added Hardy-Weinberg annotation; still experimental - not working yet so don't use it. 4. Move 'output type' argument out of the UnifiedArgumentCollection and into the UnifiedGenotyper, in preparation for parallelization. 5. Improved some of the UG integration tests. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2398 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 04:14:14 +00:00
jmaguire	98839193b7	compatibility with VCF lib's switch to GenomeLoc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2397 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-18 00:52:48 +00:00

1 2 3 4 5 ...

2043 Commits (562db45fa500bafa7050bb6e31674821de7c4fba)