Commit Graph

917 Commits (a12933a26d70c0812f155a4ab15cff319fb5c0ec)

Author SHA1 Message Date
hanna 0da2105e3c Moving DuplicateQualsWalker to oneoffprojects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2332 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 19:22:32 +00:00
hanna f97ac939fa Punch up the help documentation for CombineDuplicates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2325 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 17:09:35 +00:00
aaron 86dc98bfb5 update the documentation for CombineDuplicates for the new help system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2324 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-11 17:01:42 +00:00
depristo 8f7554d44f A few improvements to pooled concordance calcluations. Now will show you FN with the -V option. BasicGenotype now prints out a reasonable representaiton wiwth toString
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2320 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 23:09:10 +00:00
aaron f64a4c66ac some tweaks for the GATK paper genotyper to better work with shared memory parallelization, added documentation changes for Matt's new help system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2319 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 22:33:51 +00:00
andrewk a7cd172628 Added 8x coverage field and minimum base quality command line option in order to be able to compare to U. Wash. exome metrics.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2318 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 22:14:44 +00:00
ebanks 0fae798b3a 1. Discoverable base calculations don't care about Genotypes (use Variation's PError regardless of whether the call is ref or var - it's the correct value even for ref calls).
2. Call a base genotypable if any of the Genotypes is above the threshold (you can't assume there's a single Genotype associated with the Variation).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2306 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 04:26:06 +00:00
ebanks 78d5ac9bc2 Don't check het count when there are multiple Genotypes per Variation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2304 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-10 04:07:47 +00:00
ebanks 8d67d9ade3 -Minor fix in UG for all-bases mode
-Make minConfidenceScore in VariantEval a double so non-integer values can be used (requested by Steve H).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2290 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-09 03:49:10 +00:00
ebanks e8822a3fb4 Stage 3 of Variation refactoring:
We are now VCF3.3 compliant.
(Only a few more stages left.  Sigh.)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2287 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-08 21:43:28 +00:00
depristo 8f461d3c40 Critical bug fix for VariantEval dbSNP calculations. Moved the system over to the new improved ROD iterators, resulting in dbSNP rates jumping 5% or so, due to masking of true SNPs by preceding indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2274 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-07 03:36:38 +00:00
hanna 8089aa3c50 Adding support to override the help text.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2273 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-07 00:16:26 +00:00
ebanks b6f8e33f4c Stage 2 of Variation refactoring:
VCFRecord now implements Variation, VCFGenotypeRecord now implements Genotype.

Because of this change, RodVCF is now just a wrapper around the VCFRecord and does nothing else.  Also, one can call toVariation on the VCFGenotypeRecord and it returns the VCFRecord.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2271 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 06:48:03 +00:00
hanna 3b440e0dbc Add a taglet to allow users to override the display name in command-line help.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2270 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 04:12:10 +00:00
ebanks 08f2214f14 Stage 1 of massive Variation/Genotype refactoring.
This stage consists only of the code originating in the Genotyper and flowing through to the genotype writers.  I haven't finished refactoring the writers and haven't even touched the readers at all.

The major changes here are that
1. Variations which are BackedByGenotypes are now correctly associated with those Genotypes
2. Genotypes which have an associated Variation can actually be associated with it (and then return it when toVariation() is called).

The only integration tests which need to be updated are MSG-related (because the refactoring now made it easy for me to prevent MSG from emitting tri-allelic sites).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2269 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-06 03:12:41 +00:00
ebanks aef4be5610 Moved CoarseCoverageWalker to core and packaged both coverage walkers in coverage/
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2249 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:53:36 +00:00
ebanks df4e001a07 Renamed to more accurately describe its function.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2248 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:34:49 +00:00
ebanks c2017cc91b PrintCoverageWalker functionality moved to DepthOfCoverageWalker. Added integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2247 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:23:59 +00:00
ebanks 01cf5cc741 1. Merged CoverageHistogram into DepthOfCoverageWalker
2. Fixed bug in histogram calculation for small intervals
3. Better output in DoCWalker
4. Comments added to code



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2245 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:01:53 +00:00
ebanks 44b9f60735 PercentOfBasesCovered functionality moved to DepthOfCoverageWalker. Added integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2244 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 16:11:09 +00:00
ebanks 126d1eca35 Move to core (qc/)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2243 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 15:45:58 +00:00
ebanks 9da5cc25ad More archiving (with permission from Andrey) plus a move to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2242 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 15:40:27 +00:00
ebanks d7e4cd4c82 Moving some useful and stable walkers to core:
- ClipReads
- PrintRODs (generalized to print all RODs that are Variations)
- FixBAMSortOrderTag (added documentation to walker so that people know what it does and why)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2238 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 03:00:45 +00:00
depristo c776f9fb90 Simple utilities for dealing with Complete Genomics data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2230 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 22:51:41 +00:00
ebanks a09fee2b5e Moved some more walkers to oneoffprojects and killed an old indel-related walker that isn't being used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2228 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 20:28:07 +00:00
ebanks a3343c75db Move and rename a hybrid-selection-specific coverage calculation to hybridselection/
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2225 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 20:11:22 +00:00
ebanks 2c83f2f2bc Move MSG - plus now obsolete classes which it depends on -- to oneoffprojects (with permission from Jared).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2224 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 20:04:22 +00:00
jmaguire c180a76b05 Added option "append": if set, and the specified discovery output already exists, don't re-call anything that's already present in that file. Append new calls to it.
Great for resuming long jobs that died partway through.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2219 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 18:56:19 +00:00
ebanks 0a2304eff8 - Rename minConfidenceScore in VariantEval to minPhredConfidenceScore
- Moved validation walkers to new qc dir
- Killed unused test



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2218 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:59:19 +00:00
aaron d487428468 remove incorrect parentheses
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2211 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 06:46:32 +00:00
ebanks b979bd2ced - Optimized implementation of -byReadGroup in DoCWalker
- Added implementation of -bySample in DoCWalker
- Removed CoverageBySample and added a watered down version to the examples directory



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2209 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 03:39:24 +00:00
ebanks ba8a8febc6 Thanks to Steve Hershman for finding this bug:
getNegLog10PError() does not equal the confidence score (you need to multiply by 10 as confidence is traditionally phred scaled).  Probably we should change the method to be getNeg10Log10PError().  Anyone have strong feelings on this?



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2207 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 01:59:03 +00:00
ebanks 3303808a8f Yet more walkers moved to oneoffprojects.
Made hybridselection subdir in playground.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2205 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 21:29:12 +00:00
ebanks 05923f7fba Started transition to oneoffprojects.
Moved/killed a few other walkers (with permission).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2204 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 21:19:02 +00:00
jmaguire 74f6526e09 VCFHomogenizer: A class that extends InputStream and dynamically re-writes pilot1 VCF's to be on-spec.
VCFTool: A command-line tool with various useful VCF functions (validate, grep, concordance).




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2202 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 17:55:42 +00:00
ebanks e581cceab6 Got Kris's permission to delete these walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2200 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 16:57:28 +00:00
chartl 21a9a717e4 Some minor changes and test:
- DepthOfCoverage is now by reference (so locus-by-locus output correctly reports zero-coverage bases)
  - VariantsToVCF now lets you bind variants with any string except intervals and dbsnp (not just NA######)
  - A PileupWalker integration test on a particularly nasty FHS site
  - Two second-base annotation related integration tests on that same site
       + outputs were all hand-validated in matlab; within a certain tolerance for the annotations




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2197 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 15:15:54 +00:00
ebanks 084337087e Removing deprecated code and walkers for which I had the green light from repository.
Moved piecemealannotator and secondarybases to archive.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2195 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 05:58:20 +00:00
ebanks 2c16c18a04 Move Andrey's old indel code (plus MSG accuracy test, which depends on it) to archive.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2194 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 05:29:00 +00:00
depristo af22ca1b47 Bug fixes for VariantEval. dbCoverage now reports dbSNP rate, not some wierd eval_snps_in_db as before. We now separate non-indel and non-snp db sites in dbcoverage. Some dbSNP records don't fit into these two categories. Also fixed a consistency issue where novel / known sites where being determined solely by whether dbSNP had a record there, rather than the stricter dbcoverage screen for isSNP().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2180 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 01:39:01 +00:00
chartl 27651d8dc2 Oops. numReads is now called size
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2175 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-29 06:59:17 +00:00
chartl 21744e024b Quick walker that determines % of bases covered at (user - defined depth)x . I've been maintaining it in my directories alone, but now that i've accidentally deleted it twice, into playground it goes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2174 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-29 06:51:19 +00:00
depristo db40e28e54 ReadBackedPileup in all its glory. Documented, aligned with the output of LocusIteratorByState, and caching common outputs for performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2165 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 20:54:44 +00:00
aaron cfbd9332b0 small cleanups for the GATK paper genotyper; switched to the managed output system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2156 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 08:04:13 +00:00
depristo 03342c1fdd Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 03:51:41 +00:00
andrewk 3fca23cd16 Added a stub treeReduce function for debugging multi-threaded execution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2146 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:51:19 +00:00
andrewk e4546f802c Accumulates coverage across hybrid selection bait intervals to assess effect of bait adjacency. Requires input bait intervals that have an overhang beyond the actual bait interval to capture coverage data at these points. Outputs R parseable file that has all data in lists and then does some basic plotting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2144 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:12:34 +00:00
andrewk e5106c9924 Hybrid selection performance statistics now include counts of the number of adjacent baits (0,1,2) using OverlapDetector and optionally include assayed bait quantities input via interval lists.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2143 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:07:23 +00:00
ebanks c90bea39a1 read.getReadString().charAt(offset) --> read.getReadBases()[offset]
[As a courtesy I fixed all instances once I was updating GenotypeLikelihoods]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2136 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:25:19 +00:00
rpoplin 1d46de6d34 The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 14:58:33 +00:00
rpoplin b24240664f Reduced the number of calls to new ArrayList() in TableRecalibration. This results in a speed up of perhaps up to 6 percent (timed trials are hard).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2112 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 17:24:31 +00:00
rpoplin 98f921fe24 The refactored CountCovariates now hashes the read object into a HashMap which holds all the properties the covariates pull out of the read over and over again such as read group string, bases string and its complement string, quality scores, etc. This results in a big speed up. CountCovariatesRefactored is now just slightly slower than CountCovariates (perhaps 1.07x according to my latest time trial). Thanks to Alec for suggesting IdentityHashMap. CycleCovariate now warns the user that is is defaulting to the Solexa definition of cycle when the platform string pulled out of the read is unrecognized instead of halting with an Exception.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2108 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-21 20:38:17 +00:00
ebanks b434c1c240 Check for null entries before adding
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2099 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 03:12:20 +00:00
aaron 33dcfc858d updates to the paper genotyper based on Mark's comments. There's still more work to do, including more testing.
Also a 250% improvement in the getBases() and getQuals() of BasicPileup, which was nearly all of the runtime for the genotyper (using primitives instead of objects when possible).

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2097 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 23:06:49 +00:00
rpoplin 22aaf8c5e0 Added the old recalibrator integration tests to the refactored recalibrator sitting in playground.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 22:43:28 +00:00
aaron 6ba1f3321d Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord.
Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 18:17:47 +00:00
chartl b4babb82eb adding an extra bit of data to come out of CTT (number of chips with actual data)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2091 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:46:10 +00:00
alecw b2b4ff7eca Cache SAMReadGroup rather than get it twice
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2087 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:27:18 +00:00
depristo eeb3a3fffb comments for Aaron
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2081 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 12:56:04 +00:00
aaron 7997455f38 first go of the genotyper for the GATK paper. More testing and review tomorrow to call it done.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2080 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 07:55:24 +00:00
rpoplin 0fbd81766b CountCovariates now uses any rod of type VariationRod with the name dbsnp as the source of known variant sites to skip over. It also grabs the platform string out of the read group when deciding which algorithm to use to calculate machine cycle. In this way it can now handle multi-platform bams. I added a new covariate: PositionCovariate. This is simply the offset regardless of which platform the read came from. This will be useful for comparing between the two covariates. Finally, this message serves as a warning that I will be killing the old recalibrator tomorrow after I've updated and verified new integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2077 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 23:03:47 +00:00
chartl 405c6bf2c1 VariantEval genotype concordance for pools! Integration test coming soon
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2071 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:24:54 +00:00
depristo 6fe1c337ff Pileup cleanup; pooled caller v1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:03:48 +00:00
rpoplin f0a234ab29 TableRecalibration is now much smarter about hashing calculations, taking advantage of the sequential recalibration formulation. Instead of hashing RecalDatums it hashes the empirical quality score itself. This cuts the runtime by 20 percent. TableRecalibration also now skips over reads with zero mapping quality (outputs them to the new bam but doesn't touch their base quality scores).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2069 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 16:47:44 +00:00
chartl be31d7f4cc Added - a walker that outputs relevant information about false negatives given a bunch of hapmap individuals and corresponding integration tests for it.
This will output for hapmap variant sites:

chromosome  position  ref allele   variant allele   number of variant alleles of the individuals   depth of coverage   power to detect singletons at lod 3   number of variant bases seen   whether or not variant was called




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2068 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 15:47:52 +00:00
rpoplin ec1a870905 Working with byte arrays is faster than working with Strings so the Covariates now take in byte arrays. None of the Covariates themselves used the reference base so I removed it. DinucCovariate now returns a Dinuc object which implements Comparable instead of returning a String because it was too slow. CountCovariates now uses a read filter to filter out unmapped reads and allows the user to specify -cov all which will use all of the available covariates, of which there are 7 now. If no covariates are specified it defaults to ReadGroup and QualityScore, the two required covariates. Initial code in place to leave SOLID bases alone if they have bad color space quality. TableRecalibration uses @Requires to tell the GATK to not give the reference bases since they weren't being used for anything.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2062 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 21:50:52 +00:00
rpoplin eb07c7f7f8 CountCovariates now warns the user if they didn't supply a dbSNP rod file. Thanks Kiran for the use case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2054 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 18:44:54 +00:00
kiran 97ed945797 Example code for a bug in the VCF implementation. See JIRA entry at http://jira.broadinstitute.org:8008/browse/GSA-225
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2050 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-15 09:27:12 +00:00
rpoplin 88fd762436 The -rf argument is now being used for read filter and is colliding with my walkers. Changed mine to -recalFile
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2048 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-14 19:37:46 +00:00
rpoplin b05119987c Clarified some of the comments in the individual covariates now that things have been moved around to speed up the code. In general most error checking and adjustments to the data are done per read instead of per base. This means that functionality was moved out of the covariate modules and into CovariateCounterWalker and TableRecalibrationWalker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2047 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-14 18:44:05 +00:00
rpoplin 672472789e Added some documentation to the helper classes. Fixed an error case in TableRecalibrationWalker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2046 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-14 18:13:43 +00:00
rpoplin d1b525b428 Default window size for NQS covariate is 3
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2040 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 19:24:27 +00:00
rpoplin 394c839974 Implemented NQS covariate. Extended Cycle covariate to handle 454 and SOLID reads. Added a Primer Round covariate for SOLID reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2039 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 19:22:21 +00:00
rpoplin b1376e4216 structure refactored throughout for performance improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2036 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 15:41:09 +00:00
mmelgar 72825c4848 A walker that generates a table of secondary base counts in a bam file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2031 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 02:11:23 +00:00
ebanks 61b5fb82ce 2 major changes:
1. Add dbsnp RS ID to VCF output from genotyper; to do this I needed to fix the dbsnp rod which did not correctly return this value.

2. Remove AlleleBalanceBacked and instead generalize the arbitrary info fields backing VCFs (and potentially others) in preparation for refactoring VariantFiltration next week.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2028 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 22:51:49 +00:00
ebanks 578dcc54a4 Don't create a record if ref=N
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2018 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 04:32:17 +00:00
rpoplin a13cbe1df0 The refactored recalibrator now passes the integration tests as well as my own validation tests. I'm ready to have other people start jamming on the files. I'll make an updated wiki page soon. The refactored recalibrator is currently a bit slower than the old one but there were a lot of great, easy ideas today for how to improve it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2013 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 22:20:06 +00:00
rpoplin 1e7ddd2d9f Added a validateOldRecalibrator option to CovariateCounterWalker which reorders the output to match the old recalibrator exactly. This facilitates direct comparison of output. Changed the -cov argument slightly to require the user to specify both ReadGroupCovariate and QualityScoreCovariate to make it more clear to the user which covariates are being used. Some speed up improvements throughout.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2010 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 15:55:56 +00:00
ebanks 2fa2ae43ec Enough people have found this useful, so...
Moving Callset Concordance tool to core and adding integration test.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2003 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:59:18 +00:00
ebanks 3793519bd4 -Added convenience method to VCF record to tell if it's a no call and have rodVCF use it before querying for info fields
-Don't restrict info fields to 2-letter keys
[about to move these to core]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2002 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:52:51 +00:00
rpoplin 740a5484c4 Added some documentation to the code, mostly especially to CovariateCounterWalker but various comments added throughout. Also changed the HashMap data structure to accept an estimated initial capacity. This had a very modest improvement to the speed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2001 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:13:56 +00:00
ebanks 74751a8ed3 -Some minor fixes to get accurate vcf record merging done
-Improvement to snp genotype concordance test

And with that, it looks like I get revision #2000.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2000 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 06:40:55 +00:00
ebanks ab705565cf Completely refactored the Callset Concordance code. Now, it takes in VCF rods and emits a single VCF file which has merged calls from all inputs and is annotated (in the INFO fields) with the appropriate concordance test(s).
Still needs a bit of polish...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1999 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 05:03:13 +00:00
kiran 7fde6c0bf4 One more output tweak.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1996 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 04:42:55 +00:00
kiran 00a7113d7a Tweaks to formatting of output table.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1995 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 04:33:36 +00:00
kiran 95d381efe2 Optionally computes the error rate using the best base and a random base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1991 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:47:34 +00:00
kiran a679bdde18 FindContaminatingReadGroupsWalker lists read groups in a single-sample BAM file that appear to be contaminants by searching for evidence of systematic underperformance at likely homozygous-variant sites.
Procedure:
1. Sites that are likely homozygous-variant but are called as heterozygous are identified.
2. For each site and read group, we compute the proportion of bases in the pileup supporting an alternate allele.
3. A one-sample, left-tailed t-test is performed with the null hypothesis being that the alternate allele distribution has a mean of 0.95 and the alternate hypothesis being that the true mean is statistically significantly less than expected (pValue < 1e-9).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1989 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:36:39 +00:00
kiran 2225d8176e A convenience class for maintaining a dynamically growing table of values with access to the elements by named row and column identifiers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1988 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:34:35 +00:00
rpoplin 84ba604611 Sequential quality score calculation is now in place in the refactored recalibrator and matches the quality scores calculated by the old recalibrator exactly; at least on the small sets of data used so far. Validation, documentation, and optimization work is on going.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1985 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-07 15:55:16 +00:00
depristo bf1bc94060 Fixes for PooledConcordance bugs and lack of safety checking
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1984 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-07 01:54:10 +00:00
rpoplin 66d4a995e6 Initial check in of refactored Recalibrator. The new walkers are called CountCovariatesRefactored and TableRecalibrationRefactored. More work is needed to finish up the sequential calculation and to document the code sufficiently. These files are not ready to be used by other people quite yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1982 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-06 22:33:55 +00:00
ebanks 0a55fa5bb1 Completely refactored the Genotype Concordance module(s).
Now PooledConcordance and GenotypeConcordance inherit from the same super class (and can therefore share data structures and functionality).  Also, they now use ConcordanceTruthTable to keep track of necessary info.
GenotypeConcordance passes integration tests.
PooledConcordance needs to be finished by Chris.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1979 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-06 16:27:16 +00:00
ebanks d549347f25 Refactored GenotypeLikelihoods to use an underlying 4-base model.
It needs to be modified a bit and then hooked up to a pooled model, but that is now possible.
At this point, there is no difference to the Unified Genotyper.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1978 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-05 21:59:25 +00:00
jmaguire 4d3871c655 don't flush anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1977 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-05 19:11:51 +00:00
depristo 5d5dc989e7 improvements to VCF and variant eval support of VCF -- now listens to the filter field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1963 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 12:09:30 +00:00
ebanks 3a33401822 2nd stage of the genotyper output refactoring is complete.
Now, all output is generalized and all of the intelligence lies where it is supposed to.
Next stage is syncing up old and new models and making sure we're outputting exactly what we should.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1960 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 22:43:08 +00:00
ebanks af6d0003f8 -Generalized the GenotypeConcordance module to deal with any number of individuals (although it will default to its old behavior if the -samples argument is left out).
-Make rods return the appropriate type of Genotype calls from getGenotype().



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1954 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-01 05:35:47 +00:00
depristo 7d0ac7c6f2 Fix for long-term VariantEval bug plus new intergration test to catch it
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1951 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-31 00:00:33 +00:00
ebanks 51fffc7f69 Comments for Ryan (which also apply to ReadQualityScoreWalker).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1944 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-30 14:44:04 +00:00