Commit Graph

1238 Commits (7d0a13d711f2b4fbb17b3c6221838ec3a4f5cd50)

Author SHA1 Message Date
depristo bdd0a6f9fa change to make build work
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1511 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 13:43:10 +00:00
depristo b01ac9de0c High performance LocusIterator implementation. Now with greatly reduced memory impact and 2x (and more potentially) speed ups of raw locus iteration. General performance improvements to SSG with empirical probs. You can enable high-performance locus iteration with the -LIBS arg. It's still testing but passes validing pileup.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1510 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 03:06:25 +00:00
jmaguire e2780c17af Checkin of the Multi-Sample SNP caller.
Doesn't work yet; same command I used to use now causes GATK to throw an exception.

Will check with Matt & Aaron tomorrow, then do a regression test.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1509 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-03 00:23:28 +00:00
hanna e2a79c5cd9 Checkpoint. The BWT that we generate now matches the first 16% of the BWT that BWT-SW generates. Cleaned up output streams to separate the byte packing / word packing from the data structure generation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1508 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 22:18:17 +00:00
ebanks 3dfc77dc89 Add an indel rod which represents the initial point of the indel only
(useful for alternate reference making)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1507 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 19:32:29 +00:00
asivache 58debd7e56 A convenience shortcut isReadUnmapped() added: thanks to SAM format specification, 'read unmapped' flag is not always required to be set for an unmapped read; this method checks both the flag and the alignment reference index/start (if those are set to '*' the flag is not required according to the spec!)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1506 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 17:00:39 +00:00
aaron 0e6feff8f2 fixed locus pile-up limiting problem
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1505 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 16:56:44 +00:00
hanna d8aff9a925 Bug fixes. Was ignoring the '$' character in a few places where I shouldn't have been.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1504 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 16:27:31 +00:00
ebanks 55013eff78 Re-revert back to point estimation for now. We need to do this right, just not yet.
Also, it's safer to let colt do the log factorial calculations for us. 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1503 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 15:33:18 +00:00
hanna 1ada085970 Cruddy implementation of BWT creation, for understanding and testing purposes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1501 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 02:16:56 +00:00
ebanks 24d809133d Oops - comment out the printouts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1500 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 01:45:56 +00:00
ebanks 91ccb0f8c5 Revert to having these filters use integration over binomial probs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1499 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-02 01:40:22 +00:00
aaron 05c164ec69 changing the default behavior to allow any sized read pile-up (which may exceed the memory limit); the user can then select their own read limit. The default of 100K was arbitrary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1498 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 14:46:00 +00:00
ebanks 54c0b6c430 Allow this ROD to consist of just the positions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1497 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 12:43:18 +00:00
aaron 4a1d79cd7b added a flag, maximum_reads_at_locus, shortName "mrl", which limits the number of reads we add to the locusByHanger. In some bam files misalignment produces pile-ups of 750K or more reads. We now limit this to the default of 100K reads.
The user is warned if a locus exceeds this threshold, and no more reads are added.

Also CombineDup walker had an incorrect package name.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1496 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 04:21:58 +00:00
ebanks 0addae967a IndelArtifact filter can now handle filtering false SNPs that occur within the span of an indel but after the first position
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1495 348d0f76-0448-11de-a6fe-93d51630548a
2009-09-01 03:34:39 +00:00
hanna 85ca68fab6 Initial version: creates a packed file from a fasta, suitable for consumption by BWT-SW. Works with E coli fasta, but will not work at this moment with multi-chr fastas.
Will be made into a utility routine when BWA comes together.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1494 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 18:39:19 +00:00
asivache 591f8eedbb Added setName() and getName() (however, not used anywhere yet). Now can set the name of the fasta record manually to whatever, however it will work only if done early enough. If the fasta record already started printing itself (i.e. the header line is already done), setName() will throw an exception. Could be too entangled, may reverse this back...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1493 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 18:09:55 +00:00
asivache c9eb193c7f Now recognizes a special name for a bound rod track: snpmask. If a rod with this name is bound, then ONLY snps from that track will be used (to set alt reference bases to N's), but indels will be ignored. This helps when an alt. ref has to be created for a set of indel calls, and another rod (e.g. dbSNP) is used to put N's in (for sequenom). If dbSNP rod is not marked as "snpmask", the indels reported there will make their way into the alt. reference output and mess it up.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1492 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 18:05:57 +00:00
ebanks 8e3c3324fa Added filter for SNPs cleaned out by the realigner.
It uses the realigner output for filtering; in addition, dbsnp indels partially work; IndelGenotyper calls don't yet work.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1489 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:32:32 +00:00
ebanks 8bc7afe781 Smarter SW penalties
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1488 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 04:29:19 +00:00
ebanks 463f80c03e Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1487 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:37:24 +00:00
ebanks 1a299dd459 Require each filter or feature to declare whether or not they want mapping quality zero reads in the alignment context
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1486 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:31:37 +00:00
ebanks e70101febc Add a VEC filter for clustered SNP calls that takes advantage of the new windowed approach; delete the old standalone walker.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1485 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 03:14:42 +00:00
ebanks 215e908a11 Reworking of the VariantFiltration system to allow for a windowed view of variants and inclusion of more data to the various filters.
This now allows us to incorporate both the clustered SNP filter and a SNP-near-indels filter, which otherwise wasn't possible.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1484 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-31 02:16:39 +00:00
depristo 813a4e838f Removing old code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1482 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:27:11 +00:00
depristo 49a7babb2c Better organization of Genotype likelihood calculations. NewHotness is now just GenotypeLikelihoods. There are 1, 3, and empirical base error models available as subclasses, along with a simple way to make this (see the factory).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1481 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 19:16:30 +00:00
depristo 522e4a77ae Caching support across multiple technologies
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1480 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 18:10:14 +00:00
depristo 5af4bb628b Intermediate checking before code reorganization. Full blown support for empirical transition probs in SSG for all platforms. Support for defaultPlatform arg in SSG. Renaming classes for final cleanup
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1479 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-30 17:34:43 +00:00
depristo 6ab9ddf9f5 Significant output formatting improvements. SNPs as indels analysis. heterozygosity rate calculations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1478 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-29 21:49:09 +00:00
depristo bde67428fd Better formatting of the code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1477 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-29 21:46:47 +00:00
aaron 8331c195fb changed the full name of maximum_reads to maximum_iterations for consistancy
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1475 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 16:03:46 +00:00
depristo 8e129d76fd Support for original quality scores OQ flag. pQ flag in TableRecalibation to preserve quality scores below a threshold (defaulting to 5)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1474 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 14:14:21 +00:00
depristo f0179109fa Removing min confidence for on/off genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1473 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 01:04:13 +00:00
depristo 4f7ed69242 toString() implemented
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1472 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 01:03:58 +00:00
depristo dc9d40eb9a Now requires a minimum genotype LOD before applying tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1471 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:19:23 +00:00
depristo 37a9b84276 corresponding test
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1470 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:17:42 +00:00
depristo bf60980653 Experitmental support for empirical P(B_true | B_miscall). --useEmpiricalTransitions flag to SSG enables this support. Much better implementation of Genotype likelihoods -- the system should scream along now. Continuing progress towards deleting old model
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1469 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-28 00:17:24 +00:00
depristo 7cf9a54b64 change for new char/byte in BaseUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1467 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 23:47:56 +00:00
depristo a639459112 Trival consistency change from char in to char out, not char in to byte out
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1466 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 23:37:37 +00:00
chartl 6012f7602b @ minor fixes to CoverageAndPowerWalker and AnalyzePowerWalker (switching to By Reference traversal, spitting out Syzygy position for sanity check)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1465 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 21:44:18 +00:00
chartl bd1e679bc5 @ Fixed issues with AnalyzePowerWalker which depended on CoverageAndPowerWalker. The latter was changed but not the former. Now fixed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1464 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 20:23:41 +00:00
kiran a17dad5fa9 Converts from fastq.gz to unaligned BAM format. Accepts a single fastq (for single-end run) or two fastqs (for paired-end run). Also allows you to set certain BAM metadata (read groups, etc.).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1463 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 20:20:09 +00:00
chartl 8740124cda @ListUtils - Bugfix in getQScoreOrderStatistic: method would attempt to access an empty list fed into it. Now it checks for null pointers and returns 0.
@MathUtils - added a new method: cumBinomialProbLog which calculates a cumulant from any start point to any end point using the BinomProbabilityLog calculation.

@PoolUtils - added a new utility class specifically for items related to pooled sequencing. A major part of the power calculation is now to calculate powers
             independently by read direction. The only method in this class (currently) takes your reads and offsets, and splits them into two groups
             by read direction.

@CoverageAndPowerWalker - completely rewritten to split coverage, median qualities, and power by read direction. Makes use of cumBinomialProbLog rather than
                          doing that calculation within the object itself.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1462 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 19:31:53 +00:00
chartl 1da45cffb3 New:
Minor changes to CoverageAndPowerWalker bootstrapping (faster selection of indeces).

Entirely new Aritifical Pool Walker (ArtificialPoolWalkerMk2), will likely replace ArtificialPoolWalker on the next commit. Adapted the method of sampling, and added a helper context class: ArtificialPoolContext which carries much of the burden of calculation and data handling for the walker. The walker itself maps and reduces ArtificialPoolContexts.

Cheers!






git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1461 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 21:42:35 +00:00
chartl 92ea947c33 Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
       binomial pmf to avoid the coefficient blowing up and thus
       returning Infinity or NaN (or in some very strange cases
       -Infinity). The log calculation compares very well, it seems
       with our current method. It's in MathUtils but could stand
       testing against rigorous truth data before becoming standard.

Added median calculator functions to ListUtils

getQScoreMedian is a new utility I wrote that given reads and
       offsets will find the median Q score. While I was at it, I wrote
       a similar method, getMedian, which will return the median of any
       list of Comparables, independent of initial order. These are in
       ListUtils.

Added a new poolseq directory and three walkers

CoverageAndPowerWalker is built on top of the PrintCoverage walker
       and prints out the power to detect a mutant allele in a pool of
       2*(number of individuals in the pool) alleles. It can be flagged
       either to do this by boostrapping, or by pure math with a
       probability of error based on the median Q-score. This walker
       compiles, runs, and gives quite reasonable outputs that compare
       visually well to the power calculation computed by Syzygy.

ArtificialPoolWalker is designed to take multiple single-sample
       .bam files and create a (random) artificial pool. The coverage of
       that pool is a user-defined proportion of the total coverage over
       all of the input files. The output is not only a new .bam file,
       but also an auxiliary file that has for each locus, the genotype
       of the individuals, the confidence of that call, and that person's
       representation in the artificial pool .bam at that locus. This
       walker compiles and, uhh, looks pretty. Needs some testing.

AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:27:50 +00:00
kiran 478f426727 Fixed a missing method implementation in these two files.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1459 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:21:58 +00:00
kiran f12ea3a27e Added ability for all filters to return a probability for a given variant - interpreted as the probability that the given variant should be included in the final set. The joint probability of all the filters is computed to determine whether a variant should stay or go. At the moment, this is only visible in verbose mode (specify -V). Also removed 'learning mode'; now, filters emit important stats no matter what. Various code cleanups.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1458 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:17:56 +00:00
hanna e5115409fa Force columnSpacing to be at least one. We need a general-purpose, working tool for outputting columnar data to a PrintStream; will add JIRA.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1457 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 19:54:54 +00:00
aaron 811503d67b vcf changes from Richards comments, fixed a test case
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1456 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 14:32:16 +00:00