Commit Graph

1219 Commits (dab7b6e8257dcb558aaaad246d3aafd1414021d0)

Author SHA1 Message Date
ebanks 8e3c3324fa Added filter for SNPs cleaned out by the realigner.
It uses the realigner output for filtering; in addition, dbsnp indels partially work; IndelGenotyper calls don't yet work.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1489 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:32:32 +00:00
ebanks 8bc7afe781 Smarter SW penalties
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1488 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:29:19 +00:00
ebanks 463f80c03e Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1487 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:37:24 +00:00
ebanks 1a299dd459 Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1486 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:31:37 +00:00
ebanks e70101febc Add a VEC filter for clustered SNP calls that takes advantage of the new windowed approach; delete the old standalone walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1485 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:14:42 +00:00
ebanks 215e908a11 Reworking of the VariantFiltration system to allow for a windowed view of variants and inclusion of more data to the various filters.
This now allows us to incorporate both the clustered SNP filter and a SNP-near-indels filter, which otherwise wasn't possible.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1484 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 02:16:39 +00:00
depristo 813a4e838f Removing old code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1482 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:27:11 +00:00
depristo 49a7babb2c Better organization of Genotype likelihood calculations. NewHotness is now just GenotypeLikelihoods. There are 1, 3, and empirical base error models available as subclasses, along with a simple way to make this (see the factory).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1481 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:16:30 +00:00
depristo 522e4a77ae Caching support across multiple technologies
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1480 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 18:10:14 +00:00
depristo 5af4bb628b Intermediate checking before code reorganization. Full blown support for empirical transition probs in SSG for all platforms. Support for defaultPlatform arg in SSG. Renaming classes for final cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1479 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 17:34:43 +00:00
depristo 6ab9ddf9f5 Significant output formatting improvements. SNPs as indels analysis. heterozygosity rate calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1478 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-29 21:49:09 +00:00
depristo bde67428fd Better formatting of the code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1477 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-29 21:46:47 +00:00
aaron 8331c195fb changed the full name of maximum_reads to maximum_iterations for consistancy
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1475 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 16:03:46 +00:00
depristo 8e129d76fd Support for original quality scores OQ flag. pQ flag in TableRecalibation to preserve quality scores below a threshold (defaulting to 5)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1474 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 14:14:21 +00:00
depristo f0179109fa Removing min confidence for on/off genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1473 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 01:04:13 +00:00
depristo 4f7ed69242 toString() implemented
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1472 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 01:03:58 +00:00
depristo dc9d40eb9a Now requires a minimum genotype LOD before applying tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1471 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:19:23 +00:00
depristo 37a9b84276 corresponding test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1470 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:17:42 +00:00
depristo bf60980653 Experitmental support for empirical P(B_true | B_miscall). --useEmpiricalTransitions flag to SSG enables this support. Much better implementation of Genotype likelihoods -- the system should scream along now. Continuing progress towards deleting old model
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1469 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:17:24 +00:00
depristo 7cf9a54b64 change for new char/byte in BaseUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1467 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 23:47:56 +00:00
depristo a639459112 Trival consistency change from char in to char out, not char in to byte out
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1466 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 23:37:37 +00:00
chartl 6012f7602b @ minor fixes to CoverageAndPowerWalker and AnalyzePowerWalker (switching to By Reference traversal, spitting out Syzygy position for sanity check)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1465 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 21:44:18 +00:00
chartl bd1e679bc5 @ Fixed issues with AnalyzePowerWalker which depended on CoverageAndPowerWalker. The latter was changed but not the former. Now fixed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1464 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 20:23:41 +00:00
kiran a17dad5fa9 Converts from fastq.gz to unaligned BAM format. Accepts a single fastq (for single-end run) or two fastqs (for paired-end run). Also allows you to set certain BAM metadata (read groups, etc.).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1463 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 20:20:09 +00:00
chartl 8740124cda @ListUtils - Bugfix in getQScoreOrderStatistic: method would attempt to access an empty list fed into it. Now it checks for null pointers and returns 0.
@MathUtils - added a new method: cumBinomialProbLog which calculates a cumulant from any start point to any end point using the BinomProbabilityLog calculation.

@PoolUtils - added a new utility class specifically for items related to pooled sequencing. A major part of the power calculation is now to calculate powers
             independently by read direction. The only method in this class (currently) takes your reads and offsets, and splits them into two groups
             by read direction.

@CoverageAndPowerWalker - completely rewritten to split coverage, median qualities, and power by read direction. Makes use of cumBinomialProbLog rather than
                          doing that calculation within the object itself.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1462 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 19:31:53 +00:00
chartl 1da45cffb3 New:
Minor changes to CoverageAndPowerWalker bootstrapping (faster selection of indeces).

Entirely new Aritifical Pool Walker (ArtificialPoolWalkerMk2), will likely replace ArtificialPoolWalker on the next commit. Adapted the method of sampling, and added a helper context class: ArtificialPoolContext which carries much of the burden of calculation and data handling for the walker. The walker itself maps and reduces ArtificialPoolContexts.

Cheers!






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1461 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 21:42:35 +00:00
chartl 92ea947c33 Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
       binomial pmf to avoid the coefficient blowing up and thus
       returning Infinity or NaN (or in some very strange cases
       -Infinity). The log calculation compares very well, it seems
       with our current method. It's in MathUtils but could stand
       testing against rigorous truth data before becoming standard.

Added median calculator functions to ListUtils

getQScoreMedian is a new utility I wrote that given reads and
       offsets will find the median Q score. While I was at it, I wrote
       a similar method, getMedian, which will return the median of any
       list of Comparables, independent of initial order. These are in
       ListUtils.

Added a new poolseq directory and three walkers

CoverageAndPowerWalker is built on top of the PrintCoverage walker
       and prints out the power to detect a mutant allele in a pool of
       2*(number of individuals in the pool) alleles. It can be flagged
       either to do this by boostrapping, or by pure math with a
       probability of error based on the median Q-score. This walker
       compiles, runs, and gives quite reasonable outputs that compare
       visually well to the power calculation computed by Syzygy.

ArtificialPoolWalker is designed to take multiple single-sample
       .bam files and create a (random) artificial pool. The coverage of
       that pool is a user-defined proportion of the total coverage over
       all of the input files. The output is not only a new .bam file,
       but also an auxiliary file that has for each locus, the genotype
       of the individuals, the confidence of that call, and that person's
       representation in the artificial pool .bam at that locus. This
       walker compiles and, uhh, looks pretty. Needs some testing.

AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:27:50 +00:00
kiran 478f426727 Fixed a missing method implementation in these two files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1459 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:21:58 +00:00
kiran f12ea3a27e Added ability for all filters to return a probability for a given variant - interpreted as the probability that the given variant should be included in the final set. The joint probability of all the filters is computed to determine whether a variant should stay or go. At the moment, this is only visible in verbose mode (specify -V). Also removed 'learning mode'; now, filters emit important stats no matter what. Various code cleanups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1458 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:17:56 +00:00
hanna e5115409fa Force columnSpacing to be at least one. We need a general-purpose, working tool for outputting columnar data to a PrintStream; will add JIRA.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1457 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 19:54:54 +00:00
aaron 811503d67b vcf changes from Richards comments, fixed a test case
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1456 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 14:32:16 +00:00
hanna ccdb4a0313 General-purpose management of output streams.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1454 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-23 00:56:02 +00:00
aaron b316abd20f catch a malformed column header name more gracefully
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1453 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-21 21:05:28 +00:00
aaron 0364f8e989 added the ability of the VCFReader to take in compressed gzipped files natively, which is really useful for the validator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1452 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-21 18:40:38 +00:00
aaron 647a367680 Made the size zero interval file checker emit a warnUser if we're not in unsafe mode.
Also changed the default logger level from error to warn.  Does anyone object?  It makes sense for users to always get their warn user statements in the default logging level.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1451 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-21 14:40:57 +00:00
aaron df9133c90b the doc on File.length states it returns 0L if it doesn't exist, added a check to make sure it exists (and length < 1)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1450 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-21 05:55:17 +00:00
aaron cd711d7697 Added detection of interval files with zero length to the GATK, and removed it from the interval merger walker: this was a critical blocking emergency issue for Eric.
also fixed some verbage in the GAEngine.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1449 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-21 05:35:49 +00:00
asivache 0bdecd8651 A most stupid bug. In cases when more than one indel variant was present in cleaned bam file, the "consensus" (max. # of occurences) call was computed incorrectly, and most of the times the call itself was not made at all. Fortunately, the locations where we see multiple indels are a minority, and many of them are suspicious anyway (manifestation of alignment problems?). Could change results of POOLED calls though.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1448 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 22:31:44 +00:00
aaron 6313c465fb we want the RMS of the reads qualities not the RMS of the RMS of the read qualities.
Also the VCF version tag seems to be standardized as VCR.  Updated the VCF code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1447 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 21:56:29 +00:00
kcibul 6c0adc9145 resuse fasta file reader
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1446 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 16:01:58 +00:00
aaron 0386e110cf some documentation changes, add a couple of simple checks
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1445 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 05:20:27 +00:00
ebanks 10c98c418b Walker to determine the concordance of 2 genotype call sets.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1443 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 01:32:44 +00:00
ebanks 1d74143ef4 A convenience argument - for Mark - so that you don't have to specify all the output file names
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1442 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-20 00:49:12 +00:00
aaron 5725de56dc fixes in VCF, some changes to get it ready to move out of the GATK
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1441 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-19 23:31:03 +00:00
aaron 0b927f44fa created a better seperation between instantiation of an VCF object and the object itself
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1440 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-19 20:32:50 +00:00
ebanks ed8c92a12a make isReference do the right thing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1439 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-19 20:32:29 +00:00
hanna 21091b9839 Fix for invalid format error when outputting BAM files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1438 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-19 19:42:39 +00:00
aaron 4cf9110468 Adding a lot of changes to the VCF code, plus a new basic validator. Also removing an extra copy of the Artificial SAM generator that got checked in at some point.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1437 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-19 05:08:28 +00:00
ebanks b3fe566c0c Fix descriptions of walker args
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1436 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 19:46:48 +00:00
ebanks 82e2b7017e Prevent array bounds errors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1435 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-18 16:54:31 +00:00