Commit Graph

19 Commits (15183ed778a2cfc27aa275696a06c98bb4eb17a9)

Author SHA1 Message Date
depristo 7eeabe534a QSample walker for 1KG -- measures aggregate quality of sequencing. Includes misc. improvements throughtout the code, including using the new Tribble GenotypeLikelihoods class for working with VCF GLs from the UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4211 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 18:21:43 +00:00
asivache d53d5ffbf6 A utility class that computes running average and standard deviation for a stream of numbers it is being fed with. Updates mean/stddev on the fly and does not cache the observations, so it uses no memory and also should be stable against overflow/loss of precision. Simple unit test is also provided (does *not* stress-test the engine with millions of numbers though).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3944 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 21:39:02 +00:00
depristo 504103bd15 Misc. additions to correct utilities
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3329 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-07 21:34:18 +00:00
hanna c1e53d407d The copyright tag that I copied/pasted from a LaTeX document into IntelliJ had
unicode quote characters embedded in it.  These characters were invisible inside
IntelliJ but cause compile warnings for Ryan and Aaron, who for whatever reason
have a different default charset.  Fixed.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3203 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-20 15:26:32 +00:00
hanna 1bc26f69e9 An attempt to cleanup the Utils directory. Email to follow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3198 348d0f76-0448-11de-a6fe-93d51630548a
2010-04-19 23:00:08 +00:00
depristo b8ab74a6dc Minor useful changes to BaseUtils and MathUtils to support a new haplotype score annotation that determines to the two most likely haplotypes over an interval and scores variants by their consistency with a diploid model. Appears to be useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3085 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-28 21:45:22 +00:00
depristo 56092a0fc2 Slight cleanup for mathutils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3042 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 13:18:08 +00:00
chartl b42fc905e8 Added - new tests (Hapmap was re-added)
Modified - Hapmap now takes a -q command to filter out variants by quality
Modified - MathUtils - cumBinomialProbLog now uses BigDecimal to handle some numerical imprecisions
Modified - PowerBelowFrequency - returns 0.0 if called with a negative number (can't be done from inside the walker itself, but since it's called elsewhere one can't be too careful)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2350 348d0f76-0448-11de-a6fe-93d51630548a
2009-12-14 21:57:20 +00:00
ebanks 4558375575 Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated.
VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper.  UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it).

This is a fairly all-encompassing check in.  It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout.  All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing.

Stage 2 of the process will happen later this week.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a
2009-11-16 02:41:20 +00:00
chartl 8740124cda @ListUtils - Bugfix in getQScoreOrderStatistic: method would attempt to access an empty list fed into it. Now it checks for null pointers and returns 0.
@MathUtils - added a new method: cumBinomialProbLog which calculates a cumulant from any start point to any end point using the BinomProbabilityLog calculation.

@PoolUtils - added a new utility class specifically for items related to pooled sequencing. A major part of the power calculation is now to calculate powers
             independently by read direction. The only method in this class (currently) takes your reads and offsets, and splits them into two groups
             by read direction.

@CoverageAndPowerWalker - completely rewritten to split coverage, median qualities, and power by read direction. Makes use of cumBinomialProbLog rather than
                          doing that calculation within the object itself.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1462 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-27 19:31:53 +00:00
chartl 92ea947c33 Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
       binomial pmf to avoid the coefficient blowing up and thus
       returning Infinity or NaN (or in some very strange cases
       -Infinity). The log calculation compares very well, it seems
       with our current method. It's in MathUtils but could stand
       testing against rigorous truth data before becoming standard.

Added median calculator functions to ListUtils

getQScoreMedian is a new utility I wrote that given reads and
       offsets will find the median Q score. While I was at it, I wrote
       a similar method, getMedian, which will return the median of any
       list of Comparables, independent of initial order. These are in
       ListUtils.

Added a new poolseq directory and three walkers

CoverageAndPowerWalker is built on top of the PrintCoverage walker
       and prints out the power to detect a mutant allele in a pool of
       2*(number of individuals in the pool) alleles. It can be flagged
       either to do this by boostrapping, or by pure math with a
       probability of error based on the median Q-score. This walker
       compiles, runs, and gives quite reasonable outputs that compare
       visually well to the power calculation computed by Syzygy.

ArtificialPoolWalker is designed to take multiple single-sample
       .bam files and create a (random) artificial pool. The coverage of
       that pool is a user-defined proportion of the total coverage over
       all of the input files. The output is not only a new .bam file,
       but also an auxiliary file that has for each locus, the genotype
       of the individuals, the confidence of that call, and that person's
       representation in the artificial pool .bam at that locus. This
       walker compiles and, uhh, looks pretty. Needs some testing.

AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-25 21:27:50 +00:00
depristo 5487ab0ee6 Added several useful routines to MathUtils for summing and bounds checking of doubles
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1379 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-05 00:41:31 +00:00
ebanks e3b08f245f Pull out RMS calculation into MathUtils for all to use
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1364 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-03 17:00:20 +00:00
depristo 819862e04e major restructuring of generalized variant analysis framework. Now trivally easy to add additional analyses. Easy partitioning of all analyses by features, such as singleton status. Now has transition/transversional bias, counting, dbSNP coverage, HWE violation, selecting of variants by presence/absense in dbs. Also restructured the ROD system to make it easier to add tracks. Also, added the interval track -- if you provide an interval list, then the system autoatmically makese this available to you as a bound rod -- you can always find out where you are in the interval at every site. Python scripts improved to handle more merging, etc, into population snps.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@918 348d0f76-0448-11de-a6fe-93d51630548a
2009-06-05 23:34:37 +00:00
kiran 16467ae7cf A better (less overflow-y) implementation of multinomialProbability().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@579 348d0f76-0448-11de-a6fe-93d51630548a
2009-05-01 06:28:16 +00:00
kiran b9c9dbb1d7 Added multinomialProbability method.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@545 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-27 15:03:50 +00:00
jmaguire dd408a2a9a First draft of actual pooled EM caller.
Produces sane looking output on region of 1kG pilot1:

    CALL NA12813.SRP000031.2009_02.bam CC 0.609084 0.609084
    CALL NA12003.SRP000031.2009_02.bam CC 2.114234 2.114234 CCCCC
    CALL NA06994.SRP000031.2009_02.bam CC 0.910114 0.910114 C
    CALL NA18940.SRP000031.2009_02.bam CT 2.589749 0.910114 T
    CALL NA18555.SRP000031.2009_02.bam CC 0.609084 0.609084

Next up, eval vs. Baseline pilot1 calls and pilot3 deep-coverage truth.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@525 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 13:42:15 +00:00
kiran 3cda85f2e3 New implementation of binomial probability that accurately computes values down to around 1e-237.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@520 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-24 03:32:04 +00:00
kiran 77e1e9e2f1 Added a static class to house useful math methods. All this has at the moment are methods for comparing doubles and floats, but I suggest that the bulk of our little math methods should be added here to avoid filling up Utils.java with so much random stuff.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@505 348d0f76-0448-11de-a6fe-93d51630548a
2009-04-23 17:45:19 +00:00