Commit Graph

1781 Commits (23983b2fd81fdbd6c68f5bec620d1e3a3208453f)

Author SHA1 Message Date
chartl 23983b2fd8 New annotation: ResidualQuality
Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space.

Update to the integration test so bamboo doesn't look as though someone murdered it with a spork




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:04:01 +00:00
ebanks 70059a0fc9 Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to.
Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:03:38 +00:00
rpoplin 7f947f6b60 Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 19:51:36 +00:00
ebanks c299ca5f49 It would help if I copied the MD5s from the right integration test...
I hate Mondays.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2121 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:21:36 +00:00
ebanks ff4797acbb Forgot to check in integration test update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2120 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:13:51 +00:00
ebanks 14bf6ce83c 1. Newest version of the joint estimation model. Faster than previous version and now qscores can get to be > 39.8 for hets.
2. More sanity checks in annotations


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2119 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:05:50 +00:00
hanna ee2abd30c4 Count the best alignments and emit them to a file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2118 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 16:37:59 +00:00
rpoplin 1d46de6d34 The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 14:58:33 +00:00
ebanks dfe7d69471 1. VCF: don't print slod if it's never set
2. UG: don't print slod if lods are infinite (todo: figure out a good guess instead)
3. UG: if probF=0 for 2 alt alleles are both 0 (because of precision), use log values to discriminate



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2116 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 02:55:43 +00:00
ebanks 753cb100a3 Add checks for weird situations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2115 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 02:14:25 +00:00
ebanks 04d6ac940c Always print out VCF header - not just when there is genotype data present.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2114 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 01:44:10 +00:00
ebanks bf935a6ab1 1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail.
2. Retired allele frequency range from UG, which wasn't very useful.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 01:31:48 +00:00
rpoplin b24240664f Reduced the number of calls to new ArrayList() in TableRecalibration. This results in a speed up of perhaps up to 6 percent (timed trials are hard).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2112 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 17:24:31 +00:00
hanna c9c4999354 BWA: odds and ends. Get rid of some spurious debug code that was accidentally
checked in.  Add a better way to write out unmapped reads (thanks Kiran!)  Add 
a pre-built version of the shared library to the repository for early adoption.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2111 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 15:26:07 +00:00
depristo 9c206abb97 removing unnecessary printing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2110 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 12:41:48 +00:00
chartl 59416ae06a This is an annotation adapted from one that Mark Daly suggested some time ago. Right now it calculates:
- For all reference bases, the proportion of their second best bases that support the SNP

- the proportion of non-reference bases that support the SNP

and reports the difference between the two. Initially I was taking depth into account as well, but that did not appear to work as nicely as I'd like (even at 20,000x depth, if 95% of the non-reference bases are C, and 98% of the reference second-best-bases are C, then we would want to be suspicious of it; but perhaps slightly less so than if the depth were only 20...)

Anyway it's now available. I'm not sure how useful it will be, but I spawned the FHS annotation jobs again, so we'll see.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2109 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 00:47:49 +00:00
rpoplin 98f921fe24 The refactored CountCovariates now hashes the read object into a HashMap which holds all the properties the covariates pull out of the read over and over again such as read group string, bases string and its complement string, quality scores, etc. This results in a big speed up. CountCovariatesRefactored is now just slightly slower than CountCovariates (perhaps 1.07x according to my latest time trial). Thanks to Alec for suggesting IdentityHashMap. CycleCovariate now warns the user that is is defaulting to the Solexa definition of cycle when the platform string pulled out of the read is unrecognized instead of halting with an Exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2108 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-21 20:38:17 +00:00
depristo 27122f7f97 Performance improvements for pooled caller. Now possible to actually run on real data in a finite amount of time. Minor changes to GL interface (making strandIndex public) to support cached calculations in pooled caller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2107 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-21 15:07:40 +00:00
ebanks 797bb83209 New VariantFiltration.
Wiki docs are updated.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2105 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 19:50:26 +00:00
hanna a78bc60c0f Minor tweak to improve ease-of-use of iterator system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2104 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 18:24:19 +00:00
hanna 4fbb6d05d0 Refactoring. Push the revisions to the common aligner interface down into
the aligner base classes.  Hack the managed implementation to support the
new interface.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2103 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 17:08:09 +00:00
ebanks d84444200b The Unified Genotyper now sorts the sample names in the vcf that it outputs.
[There was no reason to enforce that every VCF being output from the GATK should have the samples sorted, since someone might want them ordered non-alphabetically]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2102 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 16:13:18 +00:00
hanna 38a030f2ba Finishing off data transfer conduits for single alignment generator.
Misc bug fixes elsewhere.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2101 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 15:21:59 +00:00
ebanks 2a5349d886 VariantAnnotator now adds dbsnp id if a dbsnp rod is supplied and it's not already set for a record
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2100 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 03:26:09 +00:00
ebanks b434c1c240 Check for null entries before adding
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2099 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 03:12:20 +00:00
depristo 82fd824c4d Continuing improvements to unified genotyper
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2098 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 01:39:29 +00:00
aaron 33dcfc858d updates to the paper genotyper based on Mark's comments. There's still more work to do, including more testing.
Also a 250% improvement in the getBases() and getQuals() of BasicPileup, which was nearly all of the runtime for the genotyper (using primitives instead of objects when possible).

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2097 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 23:06:49 +00:00
rpoplin 22aaf8c5e0 Added the old recalibrator integration tests to the refactored recalibrator sitting in playground.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 22:43:28 +00:00
hanna a95302fe98 Single alignment generator, another checkpoint. Does generate single alignments, but some of the data still
needs to plumbed through and it may leak memory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2095 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 21:20:03 +00:00
hanna a972b2769f Checkpoint. Add first phase of single alignment interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2094 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 19:03:43 +00:00
chartl 306f4624c6 oops forgot to update the md5s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2093 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 18:22:29 +00:00
aaron 6ba1f3321d Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord.
Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 18:17:47 +00:00
chartl b4babb82eb adding an extra bit of data to come out of CTT (number of chips with actual data)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2091 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:46:10 +00:00
alecw 7623b39927 Add rodPicardDbSNP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2088 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:27:46 +00:00
alecw b2b4ff7eca Cache SAMReadGroup rather than get it twice
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2087 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:27:18 +00:00
chartl b3872386c9 Test to ensure that ConcordanceTruthTable and those walkers which rely on it for tabulating pooled truth information from truth information of the individuals within the pool is doing that calculation correctly. Tests single het, single hom (with/without reference), together, together without reference, and a mix of everything.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2082 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 15:26:32 +00:00
depristo eeb3a3fffb comments for Aaron
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2081 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 12:56:04 +00:00
aaron 7997455f38 first go of the genotyper for the GATK paper. More testing and review tomorrow to call it done.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2080 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 07:55:24 +00:00
ebanks 7b957d3e2e Make the whining from Khalid's office stop already
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2079 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 03:04:48 +00:00
hanna 85bc9d3e91 (Hopefully) temporary hack: load contig information by contig name rather than contig id to avoid
off-by-one errors.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2078 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 23:33:27 +00:00
rpoplin 0fbd81766b CountCovariates now uses any rod of type VariationRod with the name dbsnp as the source of known variant sites to skip over. It also grabs the platform string out of the read group when deciding which algorithm to use to calculate machine cycle. In this way it can now handle multi-platform bams. I added a new covariate: PositionCovariate. This is simply the offset regardless of which platform the read came from. This will be useful for comparing between the two covariates. Finally, this message serves as a warning that I will be killing the old recalibrator tomorrow after I've updated and verified new integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2077 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 23:03:47 +00:00
ebanks f667bed7fc -Don't annotate allele balance or on-off genotype if there's no genotype data
-If qscore is infinity (because of precision) make a best guess instead


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2076 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 22:01:32 +00:00
chartl 90212c643b more effective & efficient test for SecondBaseSkew
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2075 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 20:53:32 +00:00
ebanks 087e01a439 minor changes for --noSLOD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2074 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 18:48:01 +00:00
ebanks a70cf2b763 A bunch of changes needed to make outputting pooled calls possible
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2073 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 18:42:57 +00:00
ebanks 0a35c8e0ba 1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele.
2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense).
3. If in pooled mode, throw an exception of pool size isn't set appropriately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:43:15 +00:00
chartl 405c6bf2c1 VariantEval genotype concordance for pools! Integration test coming soon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2071 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:24:54 +00:00
depristo 6fe1c337ff Pileup cleanup; pooled caller v1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:03:48 +00:00
rpoplin f0a234ab29 TableRecalibration is now much smarter about hashing calculations, taking advantage of the sequential recalibration formulation. Instead of hashing RecalDatums it hashes the empirical quality score itself. This cuts the runtime by 20 percent. TableRecalibration also now skips over reads with zero mapping quality (outputs them to the new bam but doesn't touch their base quality scores).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2069 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 16:47:44 +00:00
chartl be31d7f4cc Added - a walker that outputs relevant information about false negatives given a bunch of hapmap individuals and corresponding integration tests for it.
This will output for hapmap variant sites:

chromosome  position  ref allele   variant allele   number of variant alleles of the individuals   depth of coverage   power to detect singletons at lod 3   number of variant bases seen   whether or not variant was called




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2068 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 15:47:52 +00:00