gatk-3.8

Commit Graph

Author	SHA1	Message	Date
depristo	6231637615	fixes for VariantAnnotations and second bases. Misc. removal of failing (and unstable) integration tests that require rereview git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2213 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 15:41:35 +00:00
chartl	886c44303a	-Removing BTTJ integration test -- this broke a few revisions ago (2169) and it is unclear whether the resulting change was a correction to something that had previously been incorrect, or a true build-breaker. I'm currently investigating which case this is, but since Bamboo is back up I'm removing this _temporarily_ so that other testing can occur, and will make whatever changes to the test necessary to reflect the truth, then replace the test itself. Additional (and related) pileup tests are upcoming as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2210 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 05:37:15 +00:00
ebanks	ba8a8febc6	Thanks to Steve Hershman for finding this bug: getNegLog10PError() does not equal the confidence score (you need to multiply by 10 as confidence is traditionally phred scaled). Probably we should change the method to be getNeg10Log10PError(). Anyone have strong feelings on this? git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2207 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-02 01:59:03 +00:00
rpoplin	3180fffd43	Eliminated unnecessary boxing of longs in RecalDatum. Changes to RecalDatum in preparation for new AnalyzeCovariates script. Updated TableRecalibrationWalker to make use of these changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2199 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 16:49:05 +00:00
chartl	21a9a717e4	Some minor changes and test: - DepthOfCoverage is now by reference (so locus-by-locus output correctly reports zero-coverage bases) - VariantsToVCF now lets you bind variants with any string except intervals and dbsnp (not just NA######) - A PileupWalker integration test on a particularly nasty FHS site - Two second-base annotation related integration tests on that same site + outputs were all hand-validated in matlab; within a certain tolerance for the annotations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2197 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-01 15:15:54 +00:00
rpoplin	d8146ab23d	Changed the format of the recalibration csv file slightly so that it is easier to load the file into something like R and look at the values of the covariates. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2183 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-30 17:55:23 +00:00
depristo	af22ca1b47	Bug fixes for VariantEval. dbCoverage now reports dbSNP rate, not some wierd eval_snps_in_db as before. We now separate non-indel and non-snp db sites in dbcoverage. Some dbSNP records don't fit into these two categories. Also fixed a consistency issue where novel / known sites where being determined solely by whether dbSNP had a record there, rather than the stricter dbcoverage screen for isSNP(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2180 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-30 01:39:01 +00:00
depristo	75b61a3663	Updated, optimized REadBackedPileup. Updated test that was breaking the build -- it created a pileup from reads without bases... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2169 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 23:30:39 +00:00
ebanks	d0f673f0c0	Use Math.abs so we don't get (inconsistent) -0's git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2160 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 19:08:34 +00:00
rpoplin	6ff8526592	Added arguments to the recalibration walkers so the user can specify the default read group id and platform to use when a read has no read group. There are also options to force every read group and every platform to be the specified values. Added integration tests that use a bam file with no read groups. Added comments to all the covariates to explain what each of the methods in the Covariate interface are used for. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2157 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 15:41:12 +00:00
ebanks	e1e5b35b19	Don't have the spanning deletions argument be a hard cutoff, but instead be a percentage of the reads in the pileup. Default is now 5% of reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2155 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 04:54:44 +00:00
depristo	03342c1fdd	Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-25 03:51:41 +00:00
rpoplin	c9ff5f209c	Added a CountCovariates integration test that uses a vcf file as the list of variant sites to skip over instead of the usual dbSNP rod. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2152 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 21:51:38 +00:00
ebanks	3484f652e7	1. Variation is now passed to VariantAnnotator along with the List of Genotypes so non-genotype calls has access to all relevant info. 2. Killed OnOffGenoype 3. SpanningDeletions is now SpanningDeletionFraction git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2151 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 21:47:20 +00:00
rpoplin	dffa46b380	BAM files created by TableRecalibration now have the version number and list of covariates used appended to their header with a new 'PG' tag. Eventually the entire list of command line args will be put in there as well. Big thanks to Matt and Aaron. The integration test uses the --no_pg_tag so that the md5 doesn't change every time the version number changes. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2148 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 20:53:57 +00:00
aaron	8fbc0c8473	fix for bug GSA-234: fasta index files couldn't handle anything but letters, numbers, or spaces in the contig name git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2147 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 19:19:47 +00:00
ebanks	b3f561710f	Optimizations: 1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having). 2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler) UG now runs in half the time for JOINT_ESTIMATE model. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 16:27:39 +00:00
ebanks	36d493e645	All standard annotations now inherit from StandardVariantAnnotation. Users can specify whether they want all annotations, just the standard annotations, or specific annotations. When calling in from another walker, the default is just the standard ones. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2134 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 03:55:12 +00:00
ebanks	ee5093d2c6	-Added VariantFiltration integration tests -Added integration test for GLFs git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2133 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 02:36:27 +00:00
ebanks	be6a549e7b	Added the capability to allow expressions in an integration test command (i.e. -filter 'foo') by escaping them in the command. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2132 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-24 02:34:48 +00:00
hanna	903342745d	Basic integration test for the aligner. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2131 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 23:08:05 +00:00
chartl	6a52ca3db6	Update to the UG integration test. Why I had to rm -rf my entire sting directory to get it to correctly fail we may never know. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2128 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 21:23:00 +00:00
chartl	23983b2fd8	New annotation: ResidualQuality Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space. Update to the integration test so bamboo doesn't look as though someone murdered it with a spork git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:04:01 +00:00
ebanks	70059a0fc9	Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to. Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 20:03:38 +00:00
rpoplin	7f947f6b60	Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 19:51:36 +00:00
ebanks	c299ca5f49	It would help if I copied the MD5s from the right integration test... I hate Mondays. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2121 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 17:21:36 +00:00
ebanks	ff4797acbb	Forgot to check in integration test update git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2120 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 17:13:51 +00:00
rpoplin	1d46de6d34	The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 14:58:33 +00:00
ebanks	bf935a6ab1	1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail. 2. Retired allele frequency range from UG, which wasn't very useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-23 01:31:48 +00:00
ebanks	d84444200b	The Unified Genotyper now sorts the sample names in the vcf that it outputs. [There was no reason to enforce that every VCF being output from the GATK should have the samples sorted, since someone might want them ordered non-alphabetically] git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2102 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-20 16:13:18 +00:00
rpoplin	22aaf8c5e0	Added the old recalibrator integration tests to the refactored recalibrator sitting in playground. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 22:43:28 +00:00
chartl	306f4624c6	oops forgot to update the md5s git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2093 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 18:22:29 +00:00
aaron	6ba1f3321d	Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord. Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 18:17:47 +00:00
chartl	b4babb82eb	adding an extra bit of data to come out of CTT (number of chips with actual data) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2091 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 17:46:10 +00:00
chartl	b3872386c9	Test to ensure that ConcordanceTruthTable and those walkers which rely on it for tabulating pooled truth information from truth information of the individuals within the pool is doing that calculation correctly. Tests single het, single hom (with/without reference), together, together without reference, and a mix of everything. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2082 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-19 15:26:32 +00:00
chartl	90212c643b	more effective & efficient test for SecondBaseSkew git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2075 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 20:53:32 +00:00
ebanks	0a35c8e0ba	1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele. 2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense). 3. If in pooled mode, throw an exception of pool size isn't set appropriately. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:43:15 +00:00
depristo	6fe1c337ff	Pileup cleanup; pooled caller v1 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 17:03:48 +00:00
chartl	be31d7f4cc	Added - a walker that outputs relevant information about false negatives given a bunch of hapmap individuals and corresponding integration tests for it. This will output for hapmap variant sites: chromosome position ref allele variant allele number of variant alleles of the individuals depth of coverage power to detect singletons at lod 3 number of variant bases seen whether or not variant was called git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2068 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 15:47:52 +00:00
chartl	539f6f15e5	Added -- Second base skew annotations and integration tests. Nothing need be given except -A SecondBaseSkew; the statistic it annotates calls with is a chi-square statistic given by the deviation of the observed proportion of reference second-best-bases from the expected 1/3. Future additions may be to ask that the deviation be instead from a given transition table. A big note for all users: All IllegalStateExceptions from the variation ROD (e.g. the RodGeliText) are dealt with SILENTLY. I understand this isn't optimal, but I'd rather simply not annotate a non-bi-allelic site than fail completely (there are quite a few such sites even on the regions over which the integration test has been written). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2064 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-18 00:11:13 +00:00
ebanks	4d9c826766	Integration tests actually run on real data now. <tries to hide sheepish grin> git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2061 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 21:04:14 +00:00
ebanks	5e126875ea	temporarily disable (tests are broken) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2060 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 20:45:52 +00:00
ebanks	a048f5cdf1	-Refactored JointEstimation code so that pooled calling will work -Use phred-scale for fisher strand test -Use only 2N allele frequency estimation points git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2059 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 20:21:15 +00:00
aaron	aece7fa4c7	a convenience method to join a map into a single string, which I need for some VCF work. Added some documentation to the join method as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2057 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-17 16:50:01 +00:00
ebanks	4558375575	Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated. VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper. UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it). This is a fairly all-encompassing check in. It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout. All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing. Stage 2 of the process will happen later this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 02:41:20 +00:00
ebanks	bf451873ff	1. Bug fix: check that AF=0 doesn't contain more probability than 1-fraction 2. Fix for Kiran: allow UG to call SNPs at deletion sites; we'll add an annotation to the VariantAnotator for deletions at the locus (next week). 3. Added integration tests for joint estimation model git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2038 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 18:02:18 +00:00
hanna	7c386fa428	Another case of reordering of read groups blowing up checksums. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2030 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-13 00:07:35 +00:00
hanna	8145ed4672	Take 2, updating picard with bug fix for bam files containing no reads. Just stomped on the existing md5s because that's what Eric told me to do. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2029 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 22:52:08 +00:00
aaron	c3c001e02e	cleanup of the traversal output code git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2026 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-12 06:18:10 +00:00
ebanks	6a37090529	Output changes for VCF and UG: 1. Don't cap q-scores at 99 2. Scale SLOD to allow more resolution in the output 3. UG outputs weighted allele balance (AB) and on-off genotype (OO) info fields for het genotype calls (works for joint estimation model and SSG) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2011 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-10 16:31:31 +00:00

1 2 3 4 5 ...

293 Commits (6231637615657e5fe87e8b91cd278db519d966f1)