gatk-3.8

Commit Graph

Author	SHA1	Message	Date
ebanks	0a2304eff8	- Rename minConfidenceScore in VariantEval to minPhredConfidenceScore - Moved validation walkers to new qc dir - Killed unused test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2218 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 17:59:19 +00:00
ebanks	7055a3ea2d	- All annotations are now required to return their VCF INFO keys and descriptions - Renamed keys to fit with the standard naming - FisherStrand is no longer standard - Integration tests no longer test experimental annotations since they're not stable git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2216 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 17:24:06 +00:00
depristo	6231637615	fixes for VariantAnnotations and second bases. Misc. removal of failing (and unstable) integration tests that require rereview git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2213 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 15:41:35 +00:00
rpoplin	3180fffd43	Eliminated unnecessary boxing of longs in RecalDatum. Changes to RecalDatum in preparation for new AnalyzeCovariates script. Updated TableRecalibrationWalker to make use of these changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2199 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 16:49:05 +00:00
chartl	21a9a717e4	Some minor changes and test: - DepthOfCoverage is now by reference (so locus-by-locus output correctly reports zero-coverage bases) - VariantsToVCF now lets you bind variants with any string except intervals and dbsnp (not just NA######) - A PileupWalker integration test on a particularly nasty FHS site - Two second-base annotation related integration tests on that same site + outputs were all hand-validated in matlab; within a certain tolerance for the annotations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2197 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 15:15:54 +00:00
rpoplin	d8146ab23d	Changed the format of the recalibration csv file slightly so that it is easier to load the file into something like R and look at the values of the covariates. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2183 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-30 17:55:23 +00:00
depristo	75b61a3663	Updated, optimized REadBackedPileup. Updated test that was breaking the build -- it created a pileup from reads without bases... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2169 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 23:30:39 +00:00
ebanks	d0f673f0c0	Use Math.abs so we don't get (inconsistent) -0's git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2160 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 19:08:34 +00:00
rpoplin	6ff8526592	Added arguments to the recalibration walkers so the user can specify the default read group id and platform to use when a read has no read group. There are also options to force every read group and every platform to be the specified values. Added integration tests that use a bam file with no read groups. Added comments to all the covariates to explain what each of the methods in the Covariate interface are used for. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2157 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 15:41:12 +00:00
ebanks	e1e5b35b19	Don't have the spanning deletions argument be a hard cutoff, but instead be a percentage of the reads in the pileup. Default is now 5% of reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2155 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 04:54:44 +00:00
depristo	03342c1fdd	Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 03:51:41 +00:00
rpoplin	c9ff5f209c	Added a CountCovariates integration test that uses a vcf file as the list of variant sites to skip over instead of the usual dbSNP rod. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2152 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 21:51:38 +00:00
ebanks	3484f652e7	1. Variation is now passed to VariantAnnotator along with the List of Genotypes so non-genotype calls has access to all relevant info. 2. Killed OnOffGenoype 3. SpanningDeletions is now SpanningDeletionFraction git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2151 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 21:47:20 +00:00
rpoplin	dffa46b380	BAM files created by TableRecalibration now have the version number and list of covariates used appended to their header with a new 'PG' tag. Eventually the entire list of command line args will be put in there as well. Big thanks to Matt and Aaron. The integration test uses the --no_pg_tag so that the md5 doesn't change every time the version number changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2148 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 20:53:57 +00:00
ebanks	b3f561710f	Optimizations: 1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having). 2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler) UG now runs in half the time for JOINT_ESTIMATE model. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 16:27:39 +00:00
ebanks	36d493e645	All standard annotations now inherit from StandardVariantAnnotation. Users can specify whether they want all annotations, just the standard annotations, or specific annotations. When calling in from another walker, the default is just the standard ones. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2134 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 03:55:12 +00:00
ebanks	ee5093d2c6	-Added VariantFiltration integration tests -Added integration test for GLFs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2133 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 02:36:27 +00:00
chartl	6a52ca3db6	Update to the UG integration test. Why I had to rm -rf my entire sting directory to get it to correctly fail we may never know. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2128 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 21:23:00 +00:00
chartl	23983b2fd8	New annotation: ResidualQuality Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space. Update to the integration test so bamboo doesn't look as though someone murdered it with a spork git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:04:01 +00:00
ebanks	70059a0fc9	Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to. Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:03:38 +00:00
rpoplin	7f947f6b60	Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 19:51:36 +00:00
ebanks	c299ca5f49	It would help if I copied the MD5s from the right integration test... I hate Mondays. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2121 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 17:21:36 +00:00
ebanks	ff4797acbb	Forgot to check in integration test update git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2120 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 17:13:51 +00:00
rpoplin	1d46de6d34	The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 14:58:33 +00:00
ebanks	bf935a6ab1	1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail. 2. Retired allele frequency range from UG, which wasn't very useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 01:31:48 +00:00
ebanks	d84444200b	The Unified Genotyper now sorts the sample names in the vcf that it outputs. [There was no reason to enforce that every VCF being output from the GATK should have the samples sorted, since someone might want them ordered non-alphabetically] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2102 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 16:13:18 +00:00
rpoplin	22aaf8c5e0	Added the old recalibrator integration tests to the refactored recalibrator sitting in playground. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 22:43:28 +00:00
aaron	6ba1f3321d	Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord. Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 18:17:47 +00:00
chartl	90212c643b	more effective & efficient test for SecondBaseSkew git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2075 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 20:53:32 +00:00
ebanks	0a35c8e0ba	1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele. 2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense). 3. If in pooled mode, throw an exception of pool size isn't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:43:15 +00:00
depristo	6fe1c337ff	Pileup cleanup; pooled caller v1 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:03:48 +00:00
chartl	539f6f15e5	Added -- Second base skew annotations and integration tests. Nothing need be given except -A SecondBaseSkew; the statistic it annotates calls with is a chi-square statistic given by the deviation of the observed proportion of reference second-best-bases from the expected 1/3. Future additions may be to ask that the deviation be instead from a given transition table. A big note for all users: All IllegalStateExceptions from the variation ROD (e.g. the RodGeliText) are dealt with SILENTLY. I understand this isn't optimal, but I'd rather simply not annotate a non-bi-allelic site than fail completely (there are quite a few such sites even on the regions over which the integration test has been written). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2064 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 00:11:13 +00:00
ebanks	4d9c826766	Integration tests actually run on real data now. <tries to hide sheepish grin> git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2061 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 21:04:14 +00:00
ebanks	5e126875ea	temporarily disable (tests are broken) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2060 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 20:45:52 +00:00
ebanks	a048f5cdf1	-Refactored JointEstimation code so that pooled calling will work -Use phred-scale for fisher strand test -Use only 2N allele frequency estimation points git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2059 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 20:21:15 +00:00
ebanks	4558375575	Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated. VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper. UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it). This is a fairly all-encompassing check in. It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout. All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing. Stage 2 of the process will happen later this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 02:41:20 +00:00
ebanks	bf451873ff	1. Bug fix: check that AF=0 doesn't contain more probability than 1-fraction 2. Fix for Kiran: allow UG to call SNPs at deletion sites; we'll add an annotation to the VariantAnotator for deletions at the locus (next week). 3. Added integration tests for joint estimation model git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2038 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 18:02:18 +00:00
hanna	7c386fa428	Another case of reordering of read groups blowing up checksums. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2030 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 00:07:35 +00:00
hanna	8145ed4672	Take 2, updating picard with bug fix for bam files containing no reads. Just stomped on the existing md5s because that's what Eric told me to do. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2029 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 22:52:08 +00:00
aaron	c3c001e02e	cleanup of the traversal output code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2026 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 06:18:10 +00:00
ebanks	6a37090529	Output changes for VCF and UG: 1. Don't cap q-scores at 99 2. Scale SLOD to allow more resolution in the output 3. UG outputs weighted allele balance (AB) and on-off genotype (OO) info fields for het genotype calls (works for joint estimation model and SSG) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2011 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 16:31:31 +00:00
depristo	d316cbad4c	VariantFilteration now accepts a VCF rod in addition to an input geli. It will then annotate this VCF file with filtering information in the INFO field too. --OnlyAnnotate will not write in filtering output git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2008 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 13:24:58 +00:00
aaron	2ed423ed56	print the current location in read walkers (in addition to the number of reads processed), along with some refactoring to support the change. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2006 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 05:57:01 +00:00
ebanks	c9c3cf477a	Based on feedback from Kiran, we know uniquify sample names as sample.rodName (instead of sample.1, sample.2, ...) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2005 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 02:41:37 +00:00
ebanks	2fa2ae43ec	Enough people have found this useful, so... Moving Callset Concordance tool to core and adding integration test. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2003 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-09 20:59:18 +00:00
ebanks	6fdfc97db6	Added optional field DP to VCF output for Mark. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1981 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-06 20:03:22 +00:00
aaron	aacd72854f	a fix for a bug Andrey discovered: in read-based interval traversals we're dupplicating reads in rare cases. The problem was that to accomidate a bug in SAM JDK indexing, we were forced to add one to the stop of our QueryOverlapping() calls to ensure we always got all of the overlapping reads. Added a PlusOneFixIterator that wraps other iterators, and eliminates reads that start outside of our intended interval (interval stop - 1). Updated and checked BamToFastqIntegrationTest MD5 sums. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1976 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-05 05:26:33 +00:00
ebanks	11d950abe0	No longer allow the lod_threshold argument - use confidence instead. Have UG output qscores in all cases. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1968 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-03 16:18:51 +00:00
ebanks	2b96b2e4e7	better multi-sample integration test git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1933 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 13:51:51 +00:00
ebanks	3091443dc7	Sweeping changes to the genotype output system, as per several discussions with Matt & Aaron. Some things still need to be changed, but it will entail some more design decisions first (which means I get to bug M&A again tomorrow!). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1930 348d0f76-0448-11de-a6fe-93d51630548a	2009-10-29 03:46:41 +00:00

1 2 3 4

185 Commits (46f3d3e39b3fb3a3faee7a9dbdf9bd55e0c819f2)