Commit Graph

1054 Commits (79c4cc1db7ac3da56db8e039a76c75b6836d61be)

Author SHA1 Message Date
depristo 07b88621c5 Improved RankSum calculations and RankSum annotation. Much more meaningful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2266 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 22:16:40 +00:00
hanna 4c147329a9 Turn javadoc comments for packages and classes into key/value pairs in a properties file. Embed the properties file
in GenomeAnalysisTK.jar.  Still no support for actually displaying the archived javadoc.  Also change the approach 
to providing package javadocs: retired the deprecated package.html file in favor of Java1.5-style package-info.java.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2263 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 20:08:41 +00:00
ebanks 1e8dcc30da -dbSNP rod should not implement VariantBackedByGenotype since dbsnp records have no genotype data
-added code to cache the allele list so it didn't need to get recomputed each time it was requested.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2260 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 14:56:48 +00:00
ebanks 58937bf9ba You can now use the -exp flag to tell the Genotyper to include experimental annotations when it calls out to VariantAnnotator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2256 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 04:45:05 +00:00
ebanks b05e73a914 Finished implementation of the Wilcoxon Rank Sum Test thanks to Tim Fennell (calculating the normal approximation) and Nick Patterson (dithering to break tie bands).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2255 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 04:04:39 +00:00
ebanks 861221d046 - Moved various header line printing into a single method
- Fixed output for coverage above min depth



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2254 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-04 02:15:43 +00:00
ebanks aef4be5610 Moved CoarseCoverageWalker to core and packaged both coverage walkers in coverage/
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2249 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:53:36 +00:00
ebanks c2017cc91b PrintCoverageWalker functionality moved to DepthOfCoverageWalker. Added integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2247 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:23:59 +00:00
ebanks 01cf5cc741 1. Merged CoverageHistogram into DepthOfCoverageWalker
2. Fixed bug in histogram calculation for small intervals
3. Better output in DoCWalker
4. Comments added to code



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2245 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 17:01:53 +00:00
ebanks 44b9f60735 PercentOfBasesCovered functionality moved to DepthOfCoverageWalker. Added integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2244 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 16:11:09 +00:00
ebanks 126d1eca35 Move to core (qc/)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2243 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 15:45:58 +00:00
ebanks 9da5cc25ad More archiving (with permission from Andrey) plus a move to core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2242 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 15:40:27 +00:00
ebanks a88202c3f6 Refactored DoCWalker to output in a more helpful and usable style. It now outputs in tabular format with 2 different sections: per locus and then per interval.
I am now at a point where I can merge the functionality from other coverage walkers into this one.
Thanks to Andrew for input.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2239 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 05:28:21 +00:00
ebanks d7e4cd4c82 Moving some useful and stable walkers to core:
- ClipReads
- PrintRODs (generalized to print all RODs that are Variations)
- FixBAMSortOrderTag (added documentation to walker so that people know what it does and why)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2238 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-03 03:00:45 +00:00
rpoplin 46f3d3e39b Added comments to AnalyzeCovariates and R scripts. R script prevents residuals from going off the edge of the plot. Added skeleton code to the recalibration walkers showing how we plan to handle SOLID reference inserting behavior.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2233 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 23:15:52 +00:00
depristo dec0a781c2 Un-reinventing the wheel. --sleep argument removed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2227 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 20:19:28 +00:00
chartl 6a9e7bea05 Removing experimental annotations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2220 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 19:03:55 +00:00
ebanks 0a2304eff8 - Rename minConfidenceScore in VariantEval to minPhredConfidenceScore
- Moved validation walkers to new qc dir
- Killed unused test



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2218 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:59:19 +00:00
ebanks a5dfc9107d - Cleaned up annotation code some more
- Use QualityUtils when phred-scaling now



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2217 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:45:29 +00:00
ebanks 7055a3ea2d - All annotations are now required to return their VCF INFO keys and descriptions
- Renamed keys to fit with the standard naming
- FisherStrand is no longer standard
- Integration tests no longer test experimental annotations since they're not stable



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2216 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 17:24:06 +00:00
rpoplin 67179e2412 Initial checkin of AnalyzeCovariates.java which replaces analyzeRecalQuals_1KG.py and is updated to use the new Covariates system. It creates similar plots of residual error for each covariate that was used in the calculation. There is also an option to filter out base qualities below a given threshold.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2215 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 16:47:35 +00:00
ebanks 2838629724 -VCF writer now checks whether the allele frequency has been set before trying to write it out.
-Renamed methods to be more consistent.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2214 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 16:25:32 +00:00
depristo 6231637615 fixes for VariantAnnotations and second bases. Misc. removal of failing (and unstable) integration tests that require rereview
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2213 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 15:41:35 +00:00
ebanks b979bd2ced - Optimized implementation of -byReadGroup in DoCWalker
- Added implementation of -bySample in DoCWalker
- Removed CoverageBySample and added a watered down version to the examples directory



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2209 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 03:39:24 +00:00
ebanks 7c73496e72 Moved DoC walker over to new pileup system so it no longer moves like it's stuck in molasses.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2208 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-02 02:46:39 +00:00
ebanks 05923f7fba Started transition to oneoffprojects.
Moved/killed a few other walkers (with permission).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2204 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 21:19:02 +00:00
ebanks c36069355e Trivial change to verbose
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2203 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 20:48:10 +00:00
rpoplin 3180fffd43 Eliminated unnecessary boxing of longs in RecalDatum. Changes to RecalDatum in preparation for new AnalyzeCovariates script. Updated TableRecalibrationWalker to make use of these changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2199 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 16:49:05 +00:00
chartl 21a9a717e4 Some minor changes and test:
- DepthOfCoverage is now by reference (so locus-by-locus output correctly reports zero-coverage bases)
  - VariantsToVCF now lets you bind variants with any string except intervals and dbsnp (not just NA######)
  - A PileupWalker integration test on a particularly nasty FHS site
  - Two second-base annotation related integration tests on that same site
       + outputs were all hand-validated in matlab; within a certain tolerance for the annotations




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2197 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 15:15:54 +00:00
ebanks 7c6c490652 An unfinished implementation of the Wilcoxon rank sum test and a variant annotation that uses it. I need to merge and update this code with Tim's implementation somehow - but that won't happen until later this week, so I'm committing this before I accidentally blow it away.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2193 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 04:56:17 +00:00
ebanks 00f15ea909 Improved performance of deletion-free pileup and added mapping-quality-zero-free pileup convenience method.
Finished converting genotyper and annotator code to new ReadBackedPileup system.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2192 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-01 04:50:47 +00:00
rpoplin 6bb864da2a More misc cleanup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2191 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 22:29:07 +00:00
rpoplin b89b9adb2c misc code cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2190 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 21:16:00 +00:00
rpoplin 4969cb1957 CountCovariates uses new optimized ReadBackedPileup. It also smarter about re-doing calculations for the dnsnp variation rate sanity check.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2188 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 20:35:40 +00:00
ebanks add2fa7ab4 more use of new ReadBackedPileup optimizations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2187 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 20:04:01 +00:00
rpoplin 817e2cb8c5 Recalibrator makes use of the new GATKSAMRecord wrapper and now no longer has to hash the SAMRecord. Covariate's getValue method signature has changed to take the SAMRecord instead of the ReadHashDatum. ReadHashDatum removed completely.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2185 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 19:59:17 +00:00
ebanks e9a8156cfb Use new optimized ReadBackedPileup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2184 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 18:17:18 +00:00
rpoplin d8146ab23d Changed the format of the recalibration csv file slightly so that it is easier to load the file into something like R and look at the values of the covariates.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2183 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 17:55:23 +00:00
ebanks a184d28ce9 Completing the optimization started by Matt: we now wrap SAMRecords and SAMReadGroupRecords with our own versions which cache oft-used variables (e.g. platform, readString, strand flag). All walkers automagically get this speedup since the wrapping occurs in the engine.
I note that all integration/unit tests pass except for BaseTransitionTableCalculatorJava, which is already broken.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2182 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-30 17:39:29 +00:00
hanna 3300ca906a An iterator for Eric to use when injecting his new wrapping reads -- a stopgap solution for getting additional caching
functionality into a SAMRecord.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2173 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 22:25:52 +00:00
rpoplin 26db15be5c Added SingleReadGroupFilter to only use reads from a specific read group, filtering out all others.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2172 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 20:33:59 +00:00
rpoplin 91f5672a32 misc cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2171 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 19:56:20 +00:00
rpoplin d1298dda13 Encapsulated the sections of code that were shared by the two Recalibration walkers. This includes both the shared command line arguments and the section of code in the map methods which pull out data from the SAMRecord and stuff it into the ReadHashDatum. Command line arguments are now passed to the Covariates using a new initialize method that all Covariates must implement. Updated the dbsnp sanity check warning message to be less cryptic.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2170 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-27 19:54:10 +00:00
depristo 75b61a3663 Updated, optimized REadBackedPileup. Updated test that was breaking the build -- it created a pileup from reads without bases...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2169 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 23:30:39 +00:00
alecw ac1b289d55 Add tile to ReadHashDatum, and implement TileCovariate
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2166 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 21:41:42 +00:00
depristo db40e28e54 ReadBackedPileup in all its glory. Documented, aligned with the output of LocusIteratorByState, and caching common outputs for performance
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2165 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 20:54:44 +00:00
rpoplin b44363d20a Removed silly casts from Integer to int.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2164 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 19:59:21 +00:00
ebanks d0f673f0c0 Use Math.abs so we don't get (inconsistent) -0's
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2160 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 19:08:34 +00:00
rpoplin 6ff8526592 Added arguments to the recalibration walkers so the user can specify the default read group id and platform to use when a read has no read group. There are also options to force every read group and every platform to be the specified values. Added integration tests that use a bam file with no read groups. Added comments to all the covariates to explain what each of the methods in the Covariate interface are used for.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2157 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 15:41:12 +00:00
ebanks e1e5b35b19 Don't have the spanning deletions argument be a hard cutoff, but instead be a percentage of the reads in the pileup. Default is now 5% of reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2155 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 04:54:44 +00:00
depristo 03342c1fdd Restructuring and interface change to ReadBackedPileup. We now lower support the Pileup interface, the BasicPileup static methods, and the ReadBackedPileup class. Now everything is a ReadBackedPileup and all methods to manipulate pileups are off of it. Also provides the recommended iterable() interface of pileup elements so you can use the syntax for (PileupElement p : pileup) and access directly from p.getBase() and p.getQual() and p.getSecondBase(). Only a few straggler walkers use the old style interface -- but those walkers will be retired soon. Documentation coming in the AM. Please everyone use the new syntax, it's safer, and will be more efficient as soon as the LocusIteratorByState directly emits the ReadBackedPileup for the Alignment context, as opposed to the current interface. In the process of the change over, discovered several bugs in the second-best base code due to things getting out of sync, but these changes were resolved manually. All other integrationtests passed without modification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2154 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-25 03:51:41 +00:00
ebanks 2cb3e53b0b Verbose mode shouldn't be printing out 'NaN's and 'Infinity's
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2153 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 22:01:00 +00:00
rpoplin c9ff5f209c Added a CountCovariates integration test that uses a vcf file as the list of variant sites to skip over instead of the usual dbSNP rod.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2152 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:51:38 +00:00
ebanks 3484f652e7 1. Variation is now passed to VariantAnnotator along with the List of Genotypes so non-genotype calls has access to all relevant info.
2. Killed OnOffGenoype
3. SpanningDeletions is now SpanningDeletionFraction



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2151 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:47:20 +00:00
ebanks e05cb346f3 GenotypeLocusData now extends Variation.
Also, Variations should be INSERTIONs or DELETIONs (and not just INDELs).
Technically, VCF records can be indels now.
More changes coming


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2150 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 21:07:55 +00:00
rpoplin 8b30279edc style update
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2149 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 20:56:31 +00:00
rpoplin dffa46b380 BAM files created by TableRecalibration now have the version number and list of covariates used appended to their header with a new 'PG' tag. Eventually the entire list of command line args will be put in there as well. Big thanks to Matt and Aaron. The integration test uses the --no_pg_tag so that the md5 doesn't change every time the version number changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2148 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 20:53:57 +00:00
rpoplin 277e6d6b32 Further optimizations of TableRecalibration. This completes my goal of having the only math done in the map function be addition, subtraction and rounding the quality score to an integer. Everything else has been moved to the initialize method and only done once.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2145 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 18:21:57 +00:00
ebanks 87c1860398 I'm not sure I believe it, but JProfiler claims that calling FourBaseProbs.isVerbose() was taking 5% of my runtime...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2142 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 17:00:32 +00:00
ebanks b3f561710f Optimizations:
1. Only do calculations in UG for alternate allele with highest sum of quality scores (note that this also constitutes a bug fix for a precision problem we were having).
2. Avoid using Strings in DiploidGenotype when we can (it was taking 1.5% of my compute according to JProfiler)

UG now runs in half the time for JOINT_ESTIMATE model.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2141 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 16:27:39 +00:00
rpoplin a59e5b5e1a Added dbSNP sanity check to CountCovariates. If the mismatch rate is too low at dbSNP sites it warns the user that the dbSNP file is suspicious. Added option in CountCovariates and TableRecalibration to ignore read group id's and collapse them together. Also, If the read group is null the walkers no long crash with NullPointerException but instead warn the user the read group and platform are defaulting to some values. Default window size in MinimumNQSCovariate is 5 (two bases in either direction) based on rereading of Chris's analysis.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2140 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 16:16:44 +00:00
alecw e5e6d515c3 Fix misunderstanding of GenomeLoc interval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2138 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 15:12:49 +00:00
ebanks cb6d6f2686 Very minor performance improvements
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2137 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 05:21:07 +00:00
ebanks c90bea39a1 read.getReadString().charAt(offset) --> read.getReadBases()[offset]
[As a courtesy I fixed all instances once I was updating GenotypeLikelihoods]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2136 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:25:19 +00:00
ebanks ec321abd7b Added ability to filter on the QUAL field
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2135 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 04:08:22 +00:00
ebanks 36d493e645 All standard annotations now inherit from StandardVariantAnnotation. Users can specify whether they want all annotations, just the standard annotations, or specific annotations. When calling in from another walker, the default is just the standard ones.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2134 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 03:55:12 +00:00
ebanks ee5093d2c6 -Added VariantFiltration integration tests
-Added integration test for GLFs



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2133 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-24 02:36:27 +00:00
rpoplin 9e4eadc37c CountCovariates v2.0.2: Added a --process_nth_locus <int> argument to only use every Nth covered locus when creating the recalibration table.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2129 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 22:07:38 +00:00
ebanks ed4cf3de57 Check that we're biallelic before calling isSNP()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2127 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:20:48 +00:00
rpoplin 5744a1d968 The covariates don't care about SAMRecord's anymore - Cleaning up the import statements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2126 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:10:12 +00:00
chartl 23983b2fd8 New annotation: ResidualQuality
Computes a metric for how much error is left that isn't explained by ref or snp bases. This is the sum of Q scores, weighted by the proportion of non-ref non-snp bases to non-snp bases. Reported in Log space.

Update to the integration test so bamboo doesn't look as though someone murdered it with a spork




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2124 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:04:01 +00:00
ebanks 70059a0fc9 Refactored joint estimation model to allow subclasses to overload PofD calculation over all frequencies. Pooled model now takes only 20% of time that it used to.
Added integration test for pooled model and updated other joint estimation tests to be more comprehensive now that they are faster.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2123 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 20:03:38 +00:00
rpoplin 7f947f6b60 Updated recalibrator integration tests to use all three platforms as well as a bam with multi-platform reads intermingled. CountCovariates v2.0.1: Once again uses a read filter to filter out zero mapping quality reads. Added --sorted_output option to output the table recalibration file in sorted order
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2122 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 19:51:36 +00:00
ebanks 14bf6ce83c 1. Newest version of the joint estimation model. Faster than previous version and now qscores can get to be > 39.8 for hets.
2. More sanity checks in annotations


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2119 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 17:05:50 +00:00
rpoplin 1d46de6d34 The old recalibrator is replaced with the refactored recalibrator. Added a version message to the logger output. These walkers start at version 2.0.0
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2117 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 14:58:33 +00:00
ebanks dfe7d69471 1. VCF: don't print slod if it's never set
2. UG: don't print slod if lods are infinite (todo: figure out a good guess instead)
3. UG: if probF=0 for 2 alt alleles are both 0 (because of precision), use log values to discriminate



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2116 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 02:55:43 +00:00
ebanks 753cb100a3 Add checks for weird situations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2115 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 02:14:25 +00:00
ebanks bf935a6ab1 1. Fixed bug in PrimaryBaseSecondaryBaseSymmetry code (not checking for null before trying to access object's methods) which was causing Integration Tests to fail.
2. Retired allele frequency range from UG, which wasn't very useful.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2113 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-23 01:31:48 +00:00
depristo 9c206abb97 removing unnecessary printing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2110 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 12:41:48 +00:00
chartl 59416ae06a This is an annotation adapted from one that Mark Daly suggested some time ago. Right now it calculates:
- For all reference bases, the proportion of their second best bases that support the SNP

- the proportion of non-reference bases that support the SNP

and reports the difference between the two. Initially I was taking depth into account as well, but that did not appear to work as nicely as I'd like (even at 20,000x depth, if 95% of the non-reference bases are C, and 98% of the reference second-best-bases are C, then we would want to be suspicious of it; but perhaps slightly less so than if the depth were only 20...)

Anyway it's now available. I'm not sure how useful it will be, but I spawned the FHS annotation jobs again, so we'll see.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2109 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-22 00:47:49 +00:00
depristo 27122f7f97 Performance improvements for pooled caller. Now possible to actually run on real data in a finite amount of time. Minor changes to GL interface (making strandIndex public) to support cached calculations in pooled caller.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2107 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-21 15:07:40 +00:00
ebanks 797bb83209 New VariantFiltration.
Wiki docs are updated.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2105 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 19:50:26 +00:00
ebanks d84444200b The Unified Genotyper now sorts the sample names in the vcf that it outputs.
[There was no reason to enforce that every VCF being output from the GATK should have the samples sorted, since someone might want them ordered non-alphabetically]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2102 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 16:13:18 +00:00
ebanks 2a5349d886 VariantAnnotator now adds dbsnp id if a dbsnp rod is supplied and it's not already set for a record
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2100 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 03:26:09 +00:00
depristo 82fd824c4d Continuing improvements to unified genotyper
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2098 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-20 01:39:29 +00:00
rpoplin 22aaf8c5e0 Added the old recalibrator integration tests to the refactored recalibrator sitting in playground.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2096 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 22:43:28 +00:00
aaron 6ba1f3321d Fixed the sample mix-up bug Kiran discovered, and added a unit test in the VCF reader class (Thanks for the good example files Kiran). Also renamed the toStringRepresentation function to toStringEncoding, and added a matching method in VCFGenotypeRecord.
Updated the integration tests that were failing to due to different ordering of genotyping entries in VCF, I'll check in the VCF diff tool I wrote when I get a cycle or two.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2092 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 18:17:47 +00:00
alecw 7623b39927 Add rodPicardDbSNP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2088 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 17:27:46 +00:00
ebanks 7b957d3e2e Make the whining from Khalid's office stop already
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2079 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-19 03:04:48 +00:00
hanna 85bc9d3e91 (Hopefully) temporary hack: load contig information by contig name rather than contig id to avoid
off-by-one errors.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2078 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 23:33:27 +00:00
ebanks f667bed7fc -Don't annotate allele balance or on-off genotype if there's no genotype data
-If qscore is infinity (because of precision) make a best guess instead


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2076 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 22:01:32 +00:00
ebanks 087e01a439 minor changes for --noSLOD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2074 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 18:48:01 +00:00
ebanks a70cf2b763 A bunch of changes needed to make outputting pooled calls possible
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2073 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 18:42:57 +00:00
ebanks 0a35c8e0ba 1. The joint estimation model now constrains genotypes to be AA,AB,or BB only (i.e. to use a single alternate allele). Note that this doesn't work for the old models (point estimate or SSG) because calculations aren't divided by alternate allele.
2. Allele frequency spectrum is not emitted for single samples (since it doesn't make sense).
3. If in pooled mode, throw an exception of pool size isn't set appropriately.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2072 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:43:15 +00:00
depristo 6fe1c337ff Pileup cleanup; pooled caller v1
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2070 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 17:03:48 +00:00
chartl b68d6e06b7 Rollback of the previous "fix" and implementation of the real fix.
We totally *do* want to annotate the call if called by another walker. Totally boneheaded misenterpretation of what the code was doing -- Eric, please forgive me for being an idiot.

Instead, change the StingException to what it really should be -- an IllegalStateException, which is not coincidentally already handled by the calling function. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2067 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 06:09:24 +00:00
chartl 95f1be94c0 Fix for the broken build:
do **not** attempt to annotate if UnifiedGenotyper is called from another walker! Why this didn't break the build earlier I have no idea.

Ultimately, there should be a better way of interfacing UG with another walker -- what if some other walker wants the annotations from UG? But since we're calling map directly -- and the annotations don't get returned directly from map -- this needs to be handled differently, while the map function should ultimately return the LOD score or quality under the GCM alone.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2066 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 05:56:31 +00:00
ebanks 9fb50e9bd9 Further refactoring so that pooled calling will work.
Okay, Mark, you should be all set.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2065 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 00:18:13 +00:00
chartl 539f6f15e5 Added --
Second base skew annotations and integration tests. Nothing need be given except -A SecondBaseSkew; the statistic it annotates calls with is a chi-square statistic given by the deviation of the observed proportion of reference second-best-bases from the expected 1/3. Future additions may be to ask that the deviation be instead from a given transition table.

A big note for all users: All IllegalStateExceptions from the variation ROD (e.g. the RodGeliText) are dealt with SILENTLY. I understand this isn't optimal, but I'd rather simply not annotate a non-bi-allelic site than fail completely (there are quite a few such sites even on the regions over which the integration test has been written).




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2064 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-18 00:11:13 +00:00
depristo 42a0bbaf46 Minor reformating for pooled calling
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2063 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 22:06:11 +00:00
ebanks 4d9c826766 Integration tests actually run on real data now.
<tries to hide sheepish grin>


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2061 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 21:04:14 +00:00
ebanks a048f5cdf1 -Refactored JointEstimation code so that pooled calling will work
-Use phred-scale for fisher strand test
-Use only 2N allele frequency estimation points



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2059 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-17 20:21:15 +00:00
asivache 21729d9311 Do not print debug message when debug mode is not requested!!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2056 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 20:28:41 +00:00
rpoplin 967215066d The old CountCovariates now warns the user if they didn't supply a dbSNP rod file. Thanks Kiran for the use case.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2055 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 19:16:46 +00:00
ebanks 4558375575 Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated.
VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper.  UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it).

This is a fairly all-encompassing check in.  It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout.  All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing.

Stage 2 of the process will happen later this week.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 02:41:20 +00:00
kiran 103763fc84 An accessor for the VCF header
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2051 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-15 09:28:25 +00:00
ebanks bf451873ff 1. Bug fix: check that AF=0 doesn't contain more probability than 1-fraction
2. Fix for Kiran: allow UG to call SNPs at deletion sites; we'll add an annotation to the VariantAnotator for deletions at the locus (next week).
3. Added integration tests for joint estimation model



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2038 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 18:02:18 +00:00
asivache 1be36ca959 Bug fix: when cleanedReadIterator is initialized, it gets immediately set to the contig of the first cleaned read; when the first uncleaned read coming in is on the lower contig, this would trigger 'readNextContig' with that lower contig as an argument. As the result, the whole cleaned reads file would be read through the end and no cleaned reads would be ever seen by the code afterwards. Now we do not call readNextContig if the (uncleaned) read's contig is lower than the current contig already loaded into cleanedReadIterator. the 'readNextContig' method now also throws an exception if requested contig is less than the currently loaded one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2037 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 15:41:26 +00:00
depristo cff31f2d06 comments for eric
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2035 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 14:19:31 +00:00
aaron 234bb71747 changed the toVariation() method to take a reference base, instead of using the reference base loaded from the underlying data source (if it was reference aware). Also changed some isVariant() methods which weren't using the passed in ref base.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2034 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 06:54:38 +00:00
ebanks 902cf84448 Bug fix: if the most likely allele frequency is 0, don't make a variant call (even if the Qscore for AF=1/n > threshold)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2033 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 04:10:32 +00:00
ebanks 555fb975de 1. Print out allele frequency range (from joint estimation model only).
2. Don't print verbose output from SLOD calculation (it's just a repeat of previous output).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2032 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-13 03:59:13 +00:00
hanna 8145ed4672 Take 2, updating picard with bug fix for bam files containing no reads.
Just stomped on the existing md5s because that's what Eric told me to do.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2029 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 22:52:08 +00:00
ebanks 61b5fb82ce 2 major changes:
1. Add dbsnp RS ID to VCF output from genotyper; to do this I needed to fix the dbsnp rod which did not correctly return this value.

2. Remove AlleleBalanceBacked and instead generalize the arbitrary info fields backing VCFs (and potentially others) in preparation for refactoring VariantFiltration next week.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2028 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 22:51:49 +00:00
aaron c3c001e02e cleanup of the traversal output code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2026 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 06:18:10 +00:00
ebanks 0922400ca9 Don't try to calculate ratios when DoC is zero (which happens when calls are made by an LD-aware genotyper)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2025 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 02:51:44 +00:00
hanna 2ea85fb62b Fix some problematic command-line argument naming and descriptions.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2023 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-12 02:12:26 +00:00
depristo 6c9f86bb4d Removed unnecessary output and added debugging print() routine
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2020 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 18:37:36 +00:00
hanna 8406325247 New Picard is breaking one of the integration tests.
Revert until we find out whether the cause is legit.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2017 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 03:59:32 +00:00
hanna 499e7d1d75 Push forward some more delicate merging routines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2016 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 03:07:34 +00:00
hanna bae4d3f7ea Updated Picard with fix for Doug Voet. Thanks Alec.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2015 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-11 02:01:08 +00:00
hanna 2e4782f202 Command-line arguments for SamReadFilters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2014 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 23:36:17 +00:00
hanna 2cf9670d1e Allow users to directly specify filters from the command-line, applicable to
any walker.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2012 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 18:40:16 +00:00
ebanks 6a37090529 Output changes for VCF and UG:
1. Don't cap q-scores at 99
2. Scale SLOD to allow more resolution in the output
3. UG outputs weighted allele balance (AB) and on-off genotype (OO) info fields for het genotype calls (works for joint estimation model and SSG)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2011 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 16:31:31 +00:00
depristo d316cbad4c VariantFilteration now accepts a VCF rod in addition to an input geli. It will then annotate this VCF file with filtering information in the INFO field too. --OnlyAnnotate will not write in filtering output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2008 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 13:24:58 +00:00
aaron f9819d5f13 a little clean-up
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2007 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 06:18:34 +00:00
aaron 2ed423ed56 print the current location in read walkers (in addition to the number of reads processed), along with some refactoring to support the change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2006 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-10 05:57:01 +00:00
ebanks 2fa2ae43ec Enough people have found this useful, so...
Moving Callset Concordance tool to core and adding integration test.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2003 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:59:18 +00:00
ebanks 3793519bd4 -Added convenience method to VCF record to tell if it's a no call and have rodVCF use it before querying for info fields
-Don't restrict info fields to 2-letter keys
[about to move these to core]


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2002 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 20:52:51 +00:00
ebanks 74751a8ed3 -Some minor fixes to get accurate vcf record merging done
-Improvement to snp genotype concordance test

And with that, it looks like I get revision #2000.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2000 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 06:40:55 +00:00
ebanks 7ce0df76f8 Added accessors to the rod data sources so that walkers can access the name/file/type triplets for input rods. This is necessary if e.g. you want to create a vcf writer based on all of the samples being input.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1994 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 04:25:39 +00:00
ebanks d07f3bb6f6 Added methods to get strand bias and to test if record has allele freq or bias fields set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1993 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-09 04:20:35 +00:00
kiran 3313b0ddb4 Fixed a minor bug where the lodThreshold wasn't being printed in the header.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1992 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:51:36 +00:00
kiran 567f5758d2 Optionally lists read depths by read group.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1990 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 16:39:19 +00:00
hanna 21c5f543fa Fix sharding bug -- loci to which >100,000 (= 1 shard) reads are assigned an
alignment start will confuse the sharding system and cause it to return duplicate reads.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1987 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-08 14:27:26 +00:00
ebanks d549347f25 Refactored GenotypeLikelihoods to use an underlying 4-base model.
It needs to be modified a bit and then hooked up to a pooled model, but that is now possible.
At this point, there is no difference to the Unified Genotyper.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1978 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-05 21:59:25 +00:00
aaron aacd72854f a fix for a bug Andrey discovered: in read-based interval traversals we're dupplicating reads in rare cases. The problem was that to accomidate a bug in SAM JDK indexing, we were forced to add one to the stop of our QueryOverlapping() calls to ensure we always got all of the overlapping reads.
Added a PlusOneFixIterator that wraps other iterators, and eliminates reads that start outside of our intended interval (interval stop - 1).  Updated and checked BamToFastqIntegrationTest MD5 sums.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1976 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-05 05:26:33 +00:00
ebanks a545859c62 Joint Estimation model now emits a reasonable slod
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1969 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 21:12:42 +00:00
ebanks 11d950abe0 No longer allow the lod_threshold argument - use confidence instead.
Have UG output qscores in all cases.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1968 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 16:18:51 +00:00
asivache 2fb45dbd73 Make window size a command line argument
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1967 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 16:13:35 +00:00
asivache 55f61b1f88 Bug fix in adjustment of the shift position.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1966 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-03 16:08:11 +00:00
ebanks 3a33401822 2nd stage of the genotyper output refactoring is complete.
Now, all output is generalized and all of the intelligence lies where it is supposed to.
Next stage is syncing up old and new models and making sure we're outputting exactly what we should.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1960 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 22:43:08 +00:00
aaron ba67c7f02b added a warning for those using bed files; we properly convert bed to the internal representation but the user needs to be aware that any output will be one-based closed intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1959 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 21:09:18 +00:00
aaron b71b66bd88 the underlying parameter is a float so we need to use Float.valueOf() instead; Noticed by external user Hou Huabin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1958 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-02 20:22:25 +00:00
ebanks af6d0003f8 -Generalized the GenotypeConcordance module to deal with any number of individuals (although it will default to its old behavior if the -samples argument is left out).
-Make rods return the appropriate type of Genotype calls from getGenotype().



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1954 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-01 05:35:47 +00:00
asivache 4b0796ba58 After fixing a few glitches and bugs, this version finally works as intended
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1952 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-31 04:59:58 +00:00
asivache ea8d5c7077 Some internal refactoring. Now "safely" ignores duplicate records (NOT duplicate reads but rather malformed bam files!) resulting from the bug/feature in CleanedReadInjector.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1949 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-30 17:50:51 +00:00
ebanks 4ee1d6f733 -Have the calculation models determine whether a call passes the lod/confidence thresholds (as opposed to returning everything and letting the UG decide); this way, walkers which call map() will get only the good calls.
-Do the right thing in all models for all-base-mode (for Kiran).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1940 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-30 02:35:51 +00:00
asivache e3b4d4cbed Genotyper reimplemented. Does the same thing, at least for now, but internal data structures redesign enables collecting various statistics for indel-containing/reference-matching reads. The statistics are not yet used by the caller itself to make a better judgement w.r.t. the validity of the calls it makes, but they are now printed into the output stream (--verbose). The statistics (for both normal and tumor) include: indel observation count/total coverage, av. number of mismatches per indel-containing and per ref-matching read, av. mapping quality, av. mismatch rate and av. base quality within an NQS windoew around the indel, numbers of indel and ref observations per strand.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1936 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 19:09:16 +00:00
ebanks 5cdbdd9e5b now that the design is stable, pull the setReference and setLocation methods back out of Genotype and stick them into constructors of implementing classes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1931 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 13:27:37 +00:00
ebanks 3091443dc7 Sweeping changes to the genotype output system, as per several discussions with Matt & Aaron.
Some things still need to be changed, but it will entail some more design decisions first (which means I get to bug M&A again tomorrow!).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1930 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 03:46:41 +00:00
depristo 86573177d1 Reverting rod walkers to use underlying refwalker implementation while we work on ROD2 and reenable the system. Added some serious sparse file parsing to variant eval tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1929 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-29 01:04:37 +00:00
aaron 5a3bd50537 adding error log reporting to the GATK, and a stream based output method for the argument collection
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1926 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 19:56:05 +00:00
aaron 04e9a494e9 removed the GenotypesBacked interface, which is currently unused. Also cleaned up some documentation lines
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1924 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-28 18:08:14 +00:00
depristo 186a8dd698 Trivial protection for null value
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1918 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 21:52:52 +00:00
depristo 726378be8b Almost ready to stop doing eagar decoding; waiting on Eric
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1914 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-27 15:28:05 +00:00
aaron 3fb3773098 a fix for traverse dupplicates bug: GSA-202. Also removed some debugging output from FastaAltRef walker
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1912 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 20:18:55 +00:00
hanna a1e8a532ad Support for initialize() and onTraversalDone() output from parallelized walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1911 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-26 20:18:31 +00:00
ebanks 75ad6bbef7 Check that map isn't being called passing in null arguments.
(This seems wrong; see JIRA entry GSA-211)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1907 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-25 02:30:36 +00:00
hanna 65b98470f3 Temporary fix: have RodLocusView manage and close its RODs. Really the
relationship between these two classes needs to be rethought; see JIRA
GSA-207.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1904 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 16:00:12 +00:00
aaron ad1fc511b1 intermediate commit for some changes in the Variation system, so Eric can go ahead with his changes. Everything is pretty set, but the Variation interface could use a convenience method that joins all the alternate alleles.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1903 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 06:31:15 +00:00
ebanks 6c338eccb8 Joint Estimation model now emits calls in all formats.
The whole GenotypeCall framework needs to be changed, but this will work for the time being.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1902 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-23 03:07:28 +00:00
ebanks 54c61c663c -Cleanup of the Joint Estimation code
-Don't print verbose/debugging output to logger, but instead specify a file in the argument collection (and then we only need to print conditionally)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1899 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 15:25:29 +00:00
asivache 2cab4c68d4 Added method: isCodingExon(). Returns true if position is simultaneously within an exon AND within coding interval of any single transcript from the list. The old method of detecting coding positions as isExon() && isCoding() is buggy, as the position could be in the UTR part of one transcript (isExon() is true), and within coding region bounds (but not in the exon) of another transcript (isCoding() is true). As a result UTR positions would be erroneously annotated as coding.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1898 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-22 14:55:07 +00:00
ebanks 55fa1cfa06 -Renamed new calculation model and worked out some significant xhanges with Mark
-Allow walkers calling the UG to pass in their own argument collections


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1896 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 20:49:36 +00:00
ebanks 9b9744109c Mark's new unified calculation model is now officially implemented.
Because it doesn't actually use EM, it's no longer a subclass of the EM model.

Note that you can't use it just yet because it doesn't actually emit calls (just prints to logger).  I need to deal with general UG output tomorrow.  Hold off until then, Mark, and then you can go wild.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1891 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-21 02:39:23 +00:00
depristo caa3187af8 Enabling correct high-performance ROD walker and moved VariantEval over to it. Performance improvements in variantEval in general. See wiki for full description
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1890 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 23:31:13 +00:00
depristo 449a6ba75a Deleting lots of code as part of my cleanup. More classes tagged for removal. Many more walkers have their days numbered.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1885 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 12:23:36 +00:00
ebanks b8ab77c91c Don't filter out reads without proper read groups. Instead, allow the user (or another walker calling UG) to specify an assumed sample to use (but then we assume single-sample mode).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1883 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 01:30:53 +00:00
ebanks c29924e7cf Reverting previous change.
Aaron, it's all yours...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1881 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:55:24 +00:00
aaron d21b582b18 memory leak, where the Resource Pool was releasing based on the value and not the key, resulting in the resourceAssignments map growing with each additional shard
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1880 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:39:42 +00:00
ebanks 761a730758 assertBiAllelic -> assertMultiAllelic.
Chris, if this breaks an integration test, you get it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1879 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-20 00:09:46 +00:00
aaron cfa86d52c2 ensure that in the indel case we don't allow identification as both an insertion and deletion at the same location in the VCF ROD
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1875 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-19 18:21:00 +00:00
ebanks 51f9ec0a5c subtract largest posterior value from all values; this hopefully solves any precision issues
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1870 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 05:20:15 +00:00
ebanks b9e8867287 -push allele frequency and genotype likelihood variable definitions down into the subclasses so that they can use different data structures
-use slightly more stringent stability metric
-better integration test



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1869 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-18 04:22:17 +00:00
chartl ad777a9c14 @BasicPileup - made the counts public so they can be used
@PoolUtils - split reads by indel/simple base

@BaseTransitionTable - complete refactoring, nicer now

@UnifiedArgumentCollection - added PoolSize as an argument

@UnifiedGenotyper - checks to ensure pooled sequencing uses the appropriate model

@GenotypeCalculationModel - instantiates with the new PoolSize argument




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1867 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 21:56:56 +00:00
andrewk d1a4cd2f73 Added ValidationData analysis type to VariantEvalWalker; this eval takes a GFF file with validated truth data positions (bound to "validation")and calculates the accuracy of the genotype calls bound to "eval".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1862 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 15:39:08 +00:00
ebanks 418e007ca6 A cleaner interface: now everyone can use UG's initialize method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1860 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 14:09:16 +00:00
aaron 96972c3a5c a fix for a bug Eric found: if your first call contains fewer samples than calls at other loci, your VCFHeader got setup incorrectly.
Also moved a buch of Lists over to Sets for consistancy.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1859 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 04:57:50 +00:00
aaron a69ea9b57c Cleaning up the VCF code, adding lots of tests for a variety of edge cases. Two issues are still outstanding: updating the no call string with the standard 1000g decided on today, and fixing Eric's issue where not all the VCF sample names are present initially.
also: their, I hope your happy Eric, from now on I'll try not to flout my awesomest grammer in the future accept when I need to illicit a strong response :-)

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1858 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 04:11:34 +00:00
ebanks 993c567bd8 I had to remove some of my more agressive optimizations, as they were causing us to get slightly different results as MSG. Results in only small cost to running time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1856 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-16 00:59:32 +00:00
asivache 7d7ff09f54 throw an exception if read has no associated read group
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1855 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 18:11:32 +00:00
depristo 0c2016c19a Improved error messages -- now easier to read, points to the GATK Error Messages wiki, and avoids double printing of stack traces
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1850 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 12:07:44 +00:00
ebanks a32470cea1 Deal with the fact that walkers can call UG's init/map functions directly.
We need to filter contexts in that case since the calling walkers don't get UG's traversal-level filters.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1848 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-15 02:31:45 +00:00
ebanks e740e7a7ce Because walkers call UG's map function, we need to move the actual writing out
to UG's reduce function.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1845 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 20:49:26 +00:00
ebanks 52d2e0ca07 All walkers now use read.getReadGroup()
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1839 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 19:27:40 +00:00
aaron eb90e5c4d7 changes to VCF output, and updated MD5's in the integration tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1836 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:42:48 +00:00
ebanks 89771fef05 -Use read.getReadGroup()
-Add another filter for read groups for Chris


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1835 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 18:08:32 +00:00
ebanks 311ab8da5a A helper class to create the masks for the sequenom design maker.
This project is now officially done.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1834 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:28:51 +00:00
ebanks 0c95d6906f Merge both versions of the Sequenom assay design maker: use Jared's base code and add in indels. [Jared, this still emits the same output for SNPs as your original version)
Remove all sequenom stuff from the FastaAlternateReferenceMaker so it can just concentrate on making alternate references...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1831 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 17:11:45 +00:00
ebanks f2886d88e0 We now emit genotype calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1828 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-14 02:49:56 +00:00
ebanks 96b8499a31 Remodeled version of the UnifiedGenotyper.
We currently get identical lods and slods as MultiSampleCaller (except slods for ref calls, as I discussed with Jared) and are a bit faster in my few test cases.  Single-sample mode still emulates SSG.
The remaining to do items:
1. more testing still needed
2. we currently only output lods/slods, but I need to emit actual calls
3. stubs are in place for Mark's proposed version of the EM calculation and now I need to add the actual code.
More check-ins coming soon...


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1821 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 20:27:01 +00:00
aaron 77499e35ac fixes for GSA-199: Need easier way to write binary outputs to standard output. GLF and VCF now have stream constructors, and can get dumped to standard out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1818 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-13 15:50:20 +00:00
ebanks caf689821f added method to get normalized posteriors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1809 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 02:33:22 +00:00
ebanks cf7a26759d -use the getReadGroup() function that was added to picard for us
-clean up some include lines


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1808 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-12 01:39:32 +00:00
hanna d844d1c496 SAMFileWriters specified as command-line arguments were sometimes incorrectly altering the default short name. Make sure short name is not specified if shortName is not specified but fullName is.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1807 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 19:16:46 +00:00
hanna da084357db Fixed minor typo in output message.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1806 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-09 18:56:54 +00:00
ebanks a9f3d46fa8 Your time has come, SSG.
Fare thee well.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1799 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 20:27:56 +00:00
aaron 98e3a0bf1a VCF can now be emitted from SSG. The basic's are there (the genotype, read depth, our error estimate), but more fields need to be added for each record as nessasary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1797 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 19:50:04 +00:00
kiran 29ad6cd876 Made redundant by BCMMarkDupes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1795 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 18:47:20 +00:00
ebanks 15bf014e0b logger.info -> logger.debug (don't want to risk filling up my log on genome-wide calls)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1792 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 17:53:11 +00:00
ebanks 04fe50cadd *** We no longer have a separate model for the single-sample case. ***
For now, a single sample input will be special-cased in the EM model - but that will change when the EM model degenerates to the single sample output with a single sample as input.  For now, the EM code for multi-samples isn't finished; I'm planning on checking that in soon.

The SingleSampleIntegrationTest now uses the UnifiedCaller instead of SSG, and so should all of you.  More on that in a separate email.
Other minor cleanups added too.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1785 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 14:08:57 +00:00
kiran 829e99413b Rescores a variant after removing duplicates (defined very strictly as reads with the same start points).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1782 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-08 03:07:36 +00:00
ebanks 1905b5defa Hash by chromosome for now to reduce memory. This is a temporary solution until we decide how to reture the Injector for good.
Also, with Picard's latest changes, we need to make sure we don't double-close the sam writer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1779 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 20:06:25 +00:00
ebanks 203c626fc2 A wrapper around the GenotypeLikelihoods class for the UnifiedGenotyper. This wrapper incorporates both strand-based likelihoods and a combined likelihoods over both strands.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1777 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-07 19:57:37 +00:00
depristo 8dd0924b37 Minor performance improvements to VariantEval -- now all of the CPU time is spent dealing with the ROD system...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1772 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 23:40:30 +00:00
aaron 4554ca1b28 more cleanup, depecaited the old genotype, corrected SNPCallsFromGenotypes' imports and two other classes that depend on it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1771 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 19:09:27 +00:00
aaron 3aec76136f Removing the AllelicVariant interface, which is replaced by the Variation interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1770 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 17:44:24 +00:00
aaron 66fc8ea444 GSA-182: Adding support for BED interval files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1767 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-06 02:45:31 +00:00
hanna aec83b401d SSG multithreading doesn't play well with some I/O changes made since I last svn up'd. Reverting until I can find the reason.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1766 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 19:48:57 +00:00
hanna 8a503c86b6 Code supporting SSG proof-of-concept shared memory parallelism.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1765 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 18:56:16 +00:00
ebanks fb619bd593 -Refactoring: make GenotypeCalculationModel constructors empty so that they don't have to be updated every time we add a new parameter; instead put that logic in the super class's initialize method (making everything protected so that only the factory can access them)
-Adding initial version of Multi-sample calculation model.  This still needs much work: it needs to be cleaned up and finished.  Right now, it (purposely) throws a RuntimeException after completing the EM loop.

Also:
-Fix logic in GenotypeLikelihoods.setPriors
-Add logger to the models for output






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1764 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-05 18:10:36 +00:00
aaron 7fc4472e6d A big fix for MergingSamRecordIterator, where we weren't correctly handling the comparisons of SAMRecords correctly (we weren't applying the new reference index first, so sometimes the MT contig would be ID 23, sometimes 24 in different records).
Also a fix to the GLF tests, and a correction to PrintReadsWalker to remove the close() on the output source, the source handles that itself (and you get a double close).

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1758 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 19:35:35 +00:00
ebanks 53a4bd7f51 A better understanding of what's going on means no need for clearing the cache
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1755 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-02 18:07:46 +00:00
aaron e885cc4b21 changes for corrected GLF likelihood output, along with better tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1754 348d0f76-0448-11de-a6fe-93d51630548a
2009-10-01 20:45:05 +00:00
aaron 2e4949c4d6 Rev'ing Picard, which includes the update to get all the reads in the query region (GSA-173). With it come a bunch of fixes, including retiring the FourBaseRecaller code, and updated md5 for some walker tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1751 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:37:59 +00:00
ebanks 303972aa4b Yup, I broke the build...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1750 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 20:20:43 +00:00
ebanks 841d25cc44 Added ability to set the priors after construction (and requiring a flushing of the likelihoods cache)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1749 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-30 19:55:49 +00:00
hanna 70e1aef550 Better integrate the @ArgumentCollection into the command-line argument parser. Walkers can now specify their own @ArgumentCollections. Also cleaned up a bit of the CommandLineProgram template method pattern to minimize duplicate code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1746 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 22:23:19 +00:00
aaron b1c321f161 Adjusted Genotype concordance to more accurately use the new Genotyping code, fixed the VCF rod, and temp. fix the build by reintroducing Shermans ReadCigarFormatter
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1745 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 21:28:21 +00:00
ebanks 9ef80e3c3c One minor addition: to incorporate Pooled calling (and to be as general as possible), we allow the genotype calculation model to use rods if it wants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1741 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 17:05:59 +00:00
ebanks 19bfe43173 First pass at a unified caller, being checked in now so Mark can give feedback if he chooses and so Matt can debug issues with the ArgumentCollection class.
Some notes:
1. This design should be flexible enough to include pooled calling (for now) after discussions with Chris.
2. Using the unified caller with the SingleSampleCalculationModel emits the exact same output as SSG over all of chr20 for NA12878.  Additionally, when we include the "max deletions allowed at a locus" argument (so we don't try to call SNPs at deletion sites), it removed 233 SNP calls in chr20 that were clearly indel artficts.
3. The MultiSampleEMCalculationModel is still a work in progress and will be checked in later this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1740 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 16:48:15 +00:00
andrewk 5dab95aa5a Fix getMergedReadGroupsByReaders so that it provides read groups in the same way Picard does so that it works correctly when input read files have no clashes in their read groups and retain their original read group names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1737 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-29 06:35:50 +00:00
asivache bce2f0d7cf Now instantiates the list of alternative consenses to evaluate as LinkedHashSet to guarantee iterator traversal order. Old implementation used HashSet and exhibited unstable behavior when two alt consenses turned out to be equally good: depending on the run conditions (including size of the interval set being cleaned??), either one could be seen first as selected as the 'best' one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1734 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-28 06:15:46 +00:00
asivache 663175e868 Bug fix: when jumping onto next contig (chromosome), the walker was erasing last mismatch interval from the previous chr it was still holding without printing it; now it gets printed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1733 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 22:24:34 +00:00
asivache aec61c558b moving IndelGenotyper out from playground
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1731 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 19:43:53 +00:00
aaron 2b7d39035a switched over the FastaAlternateReferenceWalker to the Variation system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1726 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 16:09:43 +00:00
aaron 7ffc1d97ef Cut DeNovoSNPWalker over to the new Variation system, some renaming of methods on the Variation interface, and some corrections on the interface.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1724 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-25 04:35:52 +00:00
aaron d2af26e81f Pooled EM SNP Rod converted over to the Variation interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1719 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:33:11 +00:00
ebanks 97105ac001 We need to return a null RODRecordList when the default value is null (as opposed to a list with a single null value), because that's what everyone is expecting.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1718 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:23:12 +00:00
ebanks d4b40bc06f Filter for reads with missing read groups so we can safely assume all reads have valid read groups
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1717 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 16:10:26 +00:00
ebanks 90de2e0cde Added ability to specify whether you want to use a point estimate or fair coin test calculation; for now you can use either but fair coin test is still experimental as it needs to be parametrized correctly. This job will hopefully be done by the future Bioinformatic Analyst...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1716 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 15:29:50 +00:00
aaron d262cbd41c changes to add VCF to the rod system, fix VCF output in VariantsToVCF, and some other minor changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1715 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-24 15:16:11 +00:00
ebanks 423a3ee894 Added a sequenom rod to empower Carrie to convert 1KG validation SNPs to sequenom format
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1706 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 20:22:09 +00:00
hanna 856bbd0320 Let Picard specify the default compression level.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1701 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 19:01:48 +00:00
aaron f783cb30e0 adding an interface so that the current @Requires with ROD annotations work in walkers like VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1700 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:24:05 +00:00
hanna ebfbe56b43 Make sure compression level always gets pushed into SAMFileWriterFactory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1699 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:20:26 +00:00
asivache bf7cd66d53 New, simpler rodRefSeq. Fully relies on the ROD system standard mechanisms. Multiple transcripts over a given location will be now returned by the ROD system itself as RodRecordList<rodRefSeq>; and yes, rodRefSeq does represent a single transcript record now and implements Transcript interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1697 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:18:25 +00:00
asivache 8fa4c93f5a Transcript is now simply an interface
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1696 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:13:31 +00:00
asivache 1bd4c0077c Now that ROD system supports overlapping RODs, we do not need rodRefSeq to be too smart and read in all the overlapping records (transcripts) on its own; leave it to the generic ROD mechanism.
PARTIAL commit; new, simpler rodRefSeq will reappear in a seq.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1694 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-23 18:11:16 +00:00
aaron 11c32b588f fixing VariantEvalWalkerIntegrationTest md5 sums, a couple comment changes, and a little bit of cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1690 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:54:47 +00:00
ebanks 0748d80baa Added a convenience method in rodDbSNP to deal with Andrey's changes to the rod. Now you can just ask for the first real SNP rod from the list and not have to think about how it works.
CountCovariates uses it.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1688 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 20:15:40 +00:00
ebanks 682b765536 bug: need to upper case chars so that == works throughout
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1684 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 18:20:43 +00:00
asivache 57d31b8e9b Filter that discards reads from specific lanes; and also its friend that helps blacklisting a set of lanes from GATK command line a one-liner.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1681 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 16:46:06 +00:00
ebanks 5ce42cbab3 After thinking about this a bit more, it makes sense to pull this functionality out of my walker and into the GenomeLocParser where everyone else can benefit from it...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1677 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-22 01:32:35 +00:00
ebanks b1dc6d65e4 interval merging is now blazingly fast
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1674 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 21:15:04 +00:00
asivache 15135788ca OK, let's bite the bullet. Now rodDbSNP objects are 'isSNP()' only when they are annotated as 'exact', not a 'range'.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1673 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 19:25:16 +00:00
asivache 8ad181f46f Note to myself: do 'ant clean' now and then or old versions of the code that suddenly became invalid will stick around. The world is not perfect, and neither is automatic dependency resolution.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1672 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 17:40:52 +00:00
asivache d2d1354199 Now uses BrokenRODSimulator class to pass the test. CHANGE the code to use new ROD system directly and MODIFY MD5 in corresponding tests, since a few snps are seen differently now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1670 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 17:03:49 +00:00
asivache 29adc0ca1c Little class that can be used to simulate the results returned by the old ROD system. This is needed to keep couple of tests from breaking. All the code that uses this class must be changed urgently to accomodate the data as returned by new ROD system, and the corresponding tests (MD5 sums) have to be modified as well since some data as seen through the new ROD system is indeed different.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1668 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 16:58:56 +00:00
asivache a6bd509593 Changing the carpet under your feet!! New incremental update to th eROD system has arrived.
all the updated classes now make use of new SeekableRodIterator instead of RODIterator. RODIterator class deleted. This batch makes only trivial updates to tests dictated by the change in the ROD system interface. Few less trivial updates to follow. This is a partial commit; a few walkers also still need to be updated, hold on...

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1667 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 16:55:22 +00:00
asivache 4c67a49ccb Removed unused imports
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1666 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 16:45:22 +00:00
hanna e7f44ada98 Make unpackList public static so that Doug can use it in the scatter/gather framework.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1665 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 15:32:49 +00:00
ebanks 7b627fd622 Check for empty interval lists to merge
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1664 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-21 04:34:26 +00:00
hanna 7f5778c966 Update gsadevelopers -> gsahelp.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1663 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-20 23:36:54 +00:00
aaron 3a487dd64e little fixes; also fixed a tyPo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1662 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 22:38:51 +00:00
aaron b6d7d6acc6 fix for the eval tests, and a change to the backedbygenotypes interface, more changes to come
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1661 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 22:25:16 +00:00
depristo 4318f75910 tiny cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1660 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 21:04:25 +00:00
depristo 3a341b2f06 Fixes for VariantEval for genotyping mode
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1659 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 21:01:43 +00:00
aaron 7b39aa4966 Adding the VCF ROD. Also changed the VCF objects to much more user friendly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1658 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 20:19:34 +00:00
ebanks b19fd4d45c Damn unit tests have a null Toolkit()...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1654 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 17:10:49 +00:00
ebanks 90626c843d oops - we don't need reference bases, but we still need reference
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1653 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 16:24:45 +00:00
ebanks 2b2df4e1ba - Fix the CleanedReadInjector to deal with -L intervals correctly.
- Some walkers don't use the ref base, so speed up traversals by not requiring it


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1652 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 16:17:58 +00:00
asivache 94618044e8 Starting an update of ROD system. These basic classes will completely replace old ones, but with this update they are not linked to anything, so this checkpoint should be safe.
The main reason for the change is that there can be (and are!) multiple RODs overlapping with a single reference base position in a single track. There can be two "trivial" RODs at the same location (e.g. samtools pileup will have two point-like records at putative indel sites: one for the reference, the other one for the indel itself). Or there can be one or more "extended" RODs (length >1), eg. dbSNP can report an indel at Z:510-525 AND a SNP at Z:515.

The ReferenceOrderedDatum object (and children) will not be changed, but it is now explicitly interpreted as a single data *record*, possibly out of many available from a given track for the current site. As long as single data record occupies one line in a data file, the new ROD system will take care of loading and keeping multiple records, including extended (length > 1) ones, and will automatically drop the records when they finally go out of scope. For one-line-per-record, multiple-records-per-site RODs, there is no need anymore for the hack used so far that involved passing ROD's own implementation of iterator through reflection mechanism (though it will still work)

* RODRecordList: 
the ROD system (its iterators) will now always return a LIST of all RODs available at current position or at current query interval (see below). This class is a trivial wrapper for a list of ROD objects, with added location argument for the whole collection. The location of the RODRecordList is where the ROD system is currently sitting at: a single, current base on the reference (if next() traversal is performed), or the location of the query interval when returned by seekForward() (see below). The ROD objects themselves will have their locations set according to the original data in the file. Hence, perusing the above example of a dbSNP indel at Z:510-525 and SNP at Z:515, when moving to the position Z:515 the ROD system will return a RODRecorList with location Z:515, and with two ROD objects packaged inside, one with location Z:510-525, the other with Z:515.

*RODRecodIterator:
Almost identical to old SimpleRODIterator used by ReferenceOrderedData; this is a low-level iterator that walks over records in the data file (with a callback to ROD's ::parseLine() to parse real data)

*SeekableRODIterator:
a decorator class that wraps around Iterator<ROD> (such as RODRecordIterator) and makes the data traversable by reference position, rather than record by record. This is reimplementation of the old RODIterator.  SeekableRODIterator's ::next() moves to the next position on the ref and returns all RODs overlapping with that position (as a RODRecordList). This iterator also adds a seekForward(loc) operation, that allows fast forwarding to a specified position or interval. Length > 1 query arguments (extended intervals) are fully supported by seekForward(), the returned RODRecordList wil contain all RODs overlapping with the specified interval, and the location of the returned RODRecordList object will be set to that query interval. NOTE: it is ILLEGAL to perform next() after a seekForward() query with length > 1 interval. seekForward() with point-like (length=1) interval reenables next().


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1650 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-18 15:58:37 +00:00
hanna 355136928e Play nice with other jobs in this VM -- don't close stdout / stderr.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1646 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-17 18:55:08 +00:00
ebanks 5d85bd9671 By default, VF should ask for deleted bases so that they show up in coverage.
The Strand filter then needs to ignore those bases when determining bias.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1636 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-16 16:46:09 +00:00
hanna 01a9b1c63b Fix for problem where err stream remapped to output stream in certain cases, (hopefully) completing Matt's hat trick of fail. Thanks, unit tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1634 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-16 08:33:56 +00:00
hanna 9f7cf73411 Output stream management fixes. I completely screwed up the output stream management system, but cleverly masked this fact by breaking some other stream management functionality that masked the problem.
Sigh.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1630 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 21:06:45 +00:00
hanna 17758b381c Properly initialize redirected output streams in case of out and err.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1629 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 19:47:43 +00:00
andrewk 00dfe014b7 Added option to FastaReferenceWalker to change output FASTA file format's line width and to remove header lines; allows dumping raw sequence using intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1628 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 18:00:30 +00:00
hanna b69eb208a6 Always create output files, even if no output was written to them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1627 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 17:58:14 +00:00
aaron b401929e41 incremental clean-up and changes for VariantEval, moved DiploidGenotype to a better home, and fixed a spelling error.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1624 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-15 04:48:42 +00:00
ebanks 01e7b39c8d 1. Don't print out values in filter field of the VCF.
2. Fix ratio printouts (for params file)
3. Rename ratio filter's get counts method to avoid confusion; more changes on the way this week.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1616 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-14 21:03:39 +00:00
ebanks 436f543b3b I owe Doug a beer for finding this:
don't print out intervals to be merged if they're not within the global -L intervals


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1615 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-14 20:22:30 +00:00
aaron e03fccb223 Changes to switch Variant Eval over to the new Variation system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1611 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-14 05:34:33 +00:00
aaron 5b41ef5f70 rod DBSNP had a bug where the reference wasn't calculated correctly under certain conditions. Fixed getRefBasesFWD and getRefSnpFWD so that they were more in line with getAltBasesFWD and getAltSnpFWD. Also updated Variant Eval tests to reflect this change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1609 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-13 23:48:58 +00:00
ebanks c669e8d5ad Use constant seed in the random generator so we can be stable (and thus unit tests will work)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1607 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-13 17:40:56 +00:00
depristo 6c7a300664 Missing file
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1601 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:17:09 +00:00
depristo 6e13a36059 Framework for ROD walkers -- totally experiment and not working right now
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1600 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:13:15 +00:00
depristo e8d544869d Alignment context now supports the idea of skipped bases -- not currently in use
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1598 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:11:38 +00:00
depristo 3949b4ac72 commented out version of next() and hasNext() that appear to be correct but are causing testing problems
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1596 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:09:21 +00:00
depristo 58105636c8 getBoundRods() convenience method
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1595 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:07:57 +00:00
depristo 4e1eded389 Fixed bad compareTo operator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1594 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-12 19:07:10 +00:00
depristo 7c8b17b456 fix for SSG with pl name
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1591 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-11 20:39:34 +00:00
chartl d6a0b65ac9 Changes:
Rollback of Variant-related changes of r1585, additional PGC code




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1586 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-11 16:23:01 +00:00
chartl 0c54aba92a Changes:
@VariantEvalWalker - added a command line option to input a file path to a pooled call file for pooled genotype concordance checking. This string is to be passed to the PooledGenotypeConcordance object.

@AllelicVariant - added a method isPooled() to distinguish pooled AllelicVariants from unpooled ones.

@ all the rest - implemented isPooled(); for everything other than PooledEMSNProd it simply returns false, for PooledEMSNProd it returns true.

Added:

@PooledGenotypeConcordance - takes in a filepath to a pool file with the names of hapmap individuals for concordance checking with pooled calls
 and does said concordance checking over all pools. Commented out as all the methods are as yet unwritten.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1585 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-11 15:01:50 +00:00
aaron 5a64a80ab5 changes to the variation class, updates to SSG, updated tests based on changes to the SSGenotypeCall, and added the ability to run a single integration test from using the build script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1577 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 04:31:33 +00:00
depristo c988205884 Notes for Aaron in SSG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1576 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 03:18:51 +00:00
ebanks 1362a56227 Added fasta tests and small fix to cleaner test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1575 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-10 03:13:11 +00:00
depristo 0093482c62 N reference base fix for SSG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1572 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 21:19:36 +00:00
ebanks cb31d5a0ab VariantFiltration now outputs VCF. Important changes:
1. VariantsToVCF can now be called statically to output VCF for a single ROD instance; this is temporary until we have a VCF ROD.
2. VariantFiltration now outputs only 2 files, both mandatory: all variants that pass filters in geli text, and all variants in VCF.
If there are any problems, go find Aaron.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1569 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 20:04:32 +00:00
asivache dd0085c428 1) now is tolerant to sloppy cigar strings with 0-length elements (at the price of extra recursive call)
2) when reads with deletions are requested, adds to the pile just those: reads with 'D' over the current reference base, but not 'N'
3) next() now implements a loop: recursive forward iteration calls to next() until ref. position with non-zero coverage is encountered were OK for (short) deletions, but with long stretches of N's they end up with stack overflow

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1568 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 20:04:04 +00:00
ebanks 542af6402e output correct format for Sequenom SNPs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1567 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 19:21:53 +00:00
kiran 3b1e966b4c Lowercases the sequencing platform so that a difference in case doesn't lead to the failure to look up an entry in the hash.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1565 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:35:45 +00:00
kiran d82d6c0665 Excludes variants that fall below a certain LOD that changes as a function of depth.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1564 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:34:16 +00:00
kiran 06eae52292 Throws an exception if you attempt to use a filter that doesn't exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1563 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:33:27 +00:00
asivache 1060b36288 Bug fix: 'N' cigar elements now treated properly; for all practical intents and purposes, N is the same as D and should be treated as such, the difference is only in logical interpretation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1562 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 17:08:35 +00:00
chartl 9d69bd2c84 Modifications:
@CoverageAndPowerWalker - removed a hanging colon that was being printed after the reference position

@VariantEvalWalker - added a command line argument for pool size for eventual use in doing pooled caller evaluations. As now, the variable is unused.

@AlignmentContext - altered the scope of class variables from private to protected in order that child objects might have access to them


New Additions:

Filtered Contexts

Sometimes we want to filter or partition reads by some aspect (quality score, read direction, current base, whatever) and use only those reads as
part of the alignment context. Prior to this I've been doing the split externally and creating a new AlignmentContext object. This new approach makes
it a bit easier, as each of these objects are children of AlignmentContext, and can be instantiated from a "raw" AlignmentContext.

@FilteredAlignmentContext is an abstract class that defines the behavior. The abstract method 'filter' is called on the input AlignmentContext, filtering
those reads and offsets by whatever you can think of. The filtered reads/offsets are then maintained in the reads and offsets fields. These classes can
be passed around as AlignmentContexts themselves. Writing a new kind of read-filtered alignment context boils down to implementing the filter method.

@ReverseReadsContext - a FilteredAlignmentContext that takes only reads in the reverse direction

@ForwardReadsContext - a FilteredAlignmentContext that takes only reads in the forward direction

@QualityScoreThresholdContext - a FilteredAlignmentContext that takes only reads above a given quality score threshold (defaults to 22 if none provided).

A unit test bamfile and associated unit tests for these are in the works.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1559 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:49:52 +00:00
depristo d9588e6083 bug fixes to LIBS and LIBH following ultra-aggressive regression testing across 454, solid, and solexa
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1558 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:36:12 +00:00
asivache df11618092 Set default value of useLocusIteratorByHanger to FALSE. Otherwise the -LIBH flag is useless and there'd be no wayto "unset" the 'true' value. Old version was (always) using LocusIteratorByHanger. Now default iterator is indeed LocusIteratorByState, and -LIBH will switch back to the old one
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1556 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-09 15:09:09 +00:00