2009-04-24 01:45:19 +08:00
|
|
|
package org.broadinstitute.sting.utils;
|
|
|
|
|
|
2009-04-24 11:32:04 +08:00
|
|
|
import cern.jet.math.Arithmetic;
|
|
|
|
|
|
2009-04-24 01:45:19 +08:00
|
|
|
/**
|
|
|
|
|
* MathUtils is a static class (no instantiation allowed!) with some useful math methods.
|
|
|
|
|
*
|
|
|
|
|
* @author Kiran Garimella
|
|
|
|
|
*/
|
|
|
|
|
public class MathUtils {
|
Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
binomial pmf to avoid the coefficient blowing up and thus
returning Infinity or NaN (or in some very strange cases
-Infinity). The log calculation compares very well, it seems
with our current method. It's in MathUtils but could stand
testing against rigorous truth data before becoming standard.
Added median calculator functions to ListUtils
getQScoreMedian is a new utility I wrote that given reads and
offsets will find the median Q score. While I was at it, I wrote
a similar method, getMedian, which will return the median of any
list of Comparables, independent of initial order. These are in
ListUtils.
Added a new poolseq directory and three walkers
CoverageAndPowerWalker is built on top of the PrintCoverage walker
and prints out the power to detect a mutant allele in a pool of
2*(number of individuals in the pool) alleles. It can be flagged
either to do this by boostrapping, or by pure math with a
probability of error based on the median Q-score. This walker
compiles, runs, and gives quite reasonable outputs that compare
visually well to the power calculation computed by Syzygy.
ArtificialPoolWalker is designed to take multiple single-sample
.bam files and create a (random) artificial pool. The coverage of
that pool is a user-defined proportion of the total coverage over
all of the input files. The output is not only a new .bam file,
but also an auxiliary file that has for each locus, the genotype
of the individuals, the confidence of that call, and that person's
representation in the artificial pool .bam at that locus. This
walker compiles and, uhh, looks pretty. Needs some testing.
AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 05:27:50 +08:00
|
|
|
|
|
|
|
|
/** Public constants - used for the Lanczos approximation to the factorial function
|
|
|
|
|
* (for the calculation of the binomial/multinomial probability in logspace)
|
|
|
|
|
* @param LANC_SEQ[] - an array holding the constants which correspond to the product
|
|
|
|
|
* of Chebyshev Polynomial coefficients, and points on the Gamma function (for interpolation)
|
|
|
|
|
* [see A Precision Approximation of the Gamma Function J. SIAM Numer. Anal. Ser. B, Vol. 1 1964. pp. 86-96]
|
|
|
|
|
* @param LANC_G - a value for the Lanczos approximation to the gamma function that works to
|
|
|
|
|
* high precision
|
|
|
|
|
*/
|
|
|
|
|
|
|
|
|
|
|
2009-04-24 01:45:19 +08:00
|
|
|
/** Private constructor. No instantiating this class! */
|
|
|
|
|
private MathUtils() {}
|
|
|
|
|
|
2009-08-05 08:41:31 +08:00
|
|
|
public static double sum(double[] values) {
|
|
|
|
|
double s = 0.0;
|
|
|
|
|
for ( double v : values) s += v;
|
|
|
|
|
return s;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static double sumLog10(double[] log10values) {
|
|
|
|
|
double s = 0.0;
|
|
|
|
|
for ( double v : log10values) s += Math.pow(10.0, v);
|
|
|
|
|
return s;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static boolean wellFormedDouble(double val) {
|
|
|
|
|
return ! Double.isInfinite(val) && ! Double.isNaN(val);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static boolean isBounded(double val, double lower, double upper) {
|
|
|
|
|
return val >= lower && val <= upper;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static boolean isPositive(double val) {
|
|
|
|
|
return ! isNegativeOrZero(val);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static boolean isPositiveOrZero(double val) {
|
|
|
|
|
return isBounded(val, 0.0, Double.POSITIVE_INFINITY);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static boolean isNegativeOrZero(double val) {
|
|
|
|
|
return isBounded(val, Double.NEGATIVE_INFINITY, 0.0);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
public static boolean isNegative(double val) {
|
|
|
|
|
return ! isPositiveOrZero(val);
|
|
|
|
|
}
|
|
|
|
|
|
2009-04-24 01:45:19 +08:00
|
|
|
/**
|
|
|
|
|
* Compares double values for equality (within 1e-6), or inequality.
|
|
|
|
|
*
|
|
|
|
|
* @param a the first double value
|
|
|
|
|
* @param b the second double value
|
|
|
|
|
* @return -1 if a is greater than b, 0 if a is equal to be within 1e-6, 1 if b is greater than a.
|
|
|
|
|
*/
|
|
|
|
|
public static byte compareDoubles(double a, double b) { return compareDoubles(a, b, 1e-6); }
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Compares double values for equality (within epsilon), or inequality.
|
|
|
|
|
*
|
|
|
|
|
* @param a the first double value
|
|
|
|
|
* @param b the second double value
|
|
|
|
|
* @param epsilon the precision within which two double values will be considered equal
|
|
|
|
|
* @return -1 if a is greater than b, 0 if a is equal to be within epsilon, 1 if b is greater than a.
|
|
|
|
|
*/
|
|
|
|
|
public static byte compareDoubles(double a, double b, double epsilon)
|
|
|
|
|
{
|
|
|
|
|
if (Math.abs(a - b) < epsilon) { return 0; }
|
|
|
|
|
if (a > b) { return -1; }
|
|
|
|
|
return 1;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Compares float values for equality (within 1e-6), or inequality.
|
|
|
|
|
*
|
|
|
|
|
* @param a the first float value
|
|
|
|
|
* @param b the second float value
|
|
|
|
|
* @return -1 if a is greater than b, 0 if a is equal to be within 1e-6, 1 if b is greater than a.
|
|
|
|
|
*/
|
|
|
|
|
public static byte compareFloats(float a, float b) { return compareFloats(a, b, 1e-6f); }
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Compares float values for equality (within epsilon), or inequality.
|
|
|
|
|
*
|
|
|
|
|
* @param a the first float value
|
|
|
|
|
* @param b the second float value
|
|
|
|
|
* @param epsilon the precision within which two float values will be considered equal
|
|
|
|
|
* @return -1 if a is greater than b, 0 if a is equal to be within epsilon, 1 if b is greater than a.
|
|
|
|
|
*/
|
|
|
|
|
public static byte compareFloats(float a, float b, float epsilon)
|
|
|
|
|
{
|
|
|
|
|
if (Math.abs(a - b) < epsilon) { return 0; }
|
|
|
|
|
if (a > b) { return -1; }
|
|
|
|
|
return 1;
|
|
|
|
|
}
|
2009-04-24 11:32:04 +08:00
|
|
|
|
2009-04-24 21:42:15 +08:00
|
|
|
public static double NormalDistribution(double mean, double sd, double x)
|
|
|
|
|
{
|
|
|
|
|
double a = 1.0 / (sd*Math.sqrt(2.0 * Math.PI));
|
|
|
|
|
double b = Math.exp(-1.0 * (Math.pow(x - mean,2.0)/(2.0 * sd * sd)));
|
|
|
|
|
return a * b;
|
|
|
|
|
}
|
|
|
|
|
|
2009-04-24 11:32:04 +08:00
|
|
|
/**
|
2009-04-27 23:03:50 +08:00
|
|
|
* Computes a binomial probability. This is computed using the formula
|
|
|
|
|
*
|
|
|
|
|
* B(k; n; p) = [ n! / ( k! (n - k)! ) ] (p^k)( (1-p)^k )
|
|
|
|
|
*
|
|
|
|
|
* where n is the number of trials, k is the number of successes, and p is the probability of success
|
2009-04-24 11:32:04 +08:00
|
|
|
*
|
|
|
|
|
* @param k number of successes
|
|
|
|
|
* @param n number of Bernoulli trials
|
|
|
|
|
* @param p probability of success
|
|
|
|
|
*
|
|
|
|
|
* @return the binomial probability of the specified configuration. Computes values down to about 1e-237.
|
|
|
|
|
*/
|
2009-04-27 23:03:50 +08:00
|
|
|
public static double binomialProbability(int k, int n, double p) {
|
2009-04-24 11:32:04 +08:00
|
|
|
return Arithmetic.binomial(n, k)*Math.pow(p, k)*Math.pow(1.0 - p, n - k);
|
2009-04-27 23:03:50 +08:00
|
|
|
//return (new cern.jet.random.Binomial(n, p, cern.jet.random.engine.RandomEngine.makeDefault())).pdf(k);
|
2009-04-24 11:32:04 +08:00
|
|
|
}
|
2009-04-27 23:03:50 +08:00
|
|
|
|
Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
binomial pmf to avoid the coefficient blowing up and thus
returning Infinity or NaN (or in some very strange cases
-Infinity). The log calculation compares very well, it seems
with our current method. It's in MathUtils but could stand
testing against rigorous truth data before becoming standard.
Added median calculator functions to ListUtils
getQScoreMedian is a new utility I wrote that given reads and
offsets will find the median Q score. While I was at it, I wrote
a similar method, getMedian, which will return the median of any
list of Comparables, independent of initial order. These are in
ListUtils.
Added a new poolseq directory and three walkers
CoverageAndPowerWalker is built on top of the PrintCoverage walker
and prints out the power to detect a mutant allele in a pool of
2*(number of individuals in the pool) alleles. It can be flagged
either to do this by boostrapping, or by pure math with a
probability of error based on the median Q-score. This walker
compiles, runs, and gives quite reasonable outputs that compare
visually well to the power calculation computed by Syzygy.
ArtificialPoolWalker is designed to take multiple single-sample
.bam files and create a (random) artificial pool. The coverage of
that pool is a user-defined proportion of the total coverage over
all of the input files. The output is not only a new .bam file,
but also an auxiliary file that has for each locus, the genotype
of the individuals, the confidence of that call, and that person's
representation in the artificial pool .bam at that locus. This
walker compiles and, uhh, looks pretty. Needs some testing.
AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 05:27:50 +08:00
|
|
|
/**
|
|
|
|
|
* Performs the calculation for a binomial probability in logspace, preventing the blowups of the binomial coefficient
|
|
|
|
|
* consistent with moderately large n and k.
|
|
|
|
|
*
|
|
|
|
|
* @param k - number of successes/observations
|
|
|
|
|
* @param n - number of Bernoulli trials
|
|
|
|
|
* @param p - probability of success/observations
|
|
|
|
|
*
|
|
|
|
|
* @return the probability mass associated with a binomial mass function for the above configuration
|
|
|
|
|
*/
|
|
|
|
|
public static double binomialProbabilityLog(int k, int n, double p) {
|
|
|
|
|
|
|
|
|
|
double log_coef = 0.0;
|
|
|
|
|
int min;
|
|
|
|
|
int max;
|
|
|
|
|
|
|
|
|
|
if(k < n - k) {
|
|
|
|
|
min = k;
|
|
|
|
|
max = n-k;
|
|
|
|
|
} else {
|
|
|
|
|
min = n - k;
|
|
|
|
|
max = k;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
for(int i=2; i <= min; i++) {
|
|
|
|
|
log_coef -= Math.log((double)i);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
for(int i = max+1; i <= n; i++) {
|
|
|
|
|
log_coef += Math.log((double)i);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return Math.exp(log_coef + ((double)k)*Math.log(p) + ((double)(n-k))*Math.log(1-p));
|
|
|
|
|
|
|
|
|
|
// in the future, we may want to precompile a table of exact values, where the entry (a,b) indicates
|
|
|
|
|
// the sum of log(i) from i = (1+(50*a)) to i = 50+(50*b) to increase performance on binomial coefficients
|
|
|
|
|
// which may require many sums.
|
|
|
|
|
}
|
2009-08-28 03:31:53 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Performs the cumulative sum of binomial probabilities, where the probability calculation is done in log space.
|
|
|
|
|
* @param start - start of the cumulant sum (over hits)
|
|
|
|
|
* @param end - end of the cumulant sum (over hits)
|
|
|
|
|
* @param total - number of attempts for the number of hits
|
|
|
|
|
* @param probHit - probability of a successful hit
|
|
|
|
|
* @return - returns the cumulative probability
|
|
|
|
|
*/
|
|
|
|
|
public static double cumBinomialProbLog(int start, int end, int total, double probHit) {
|
|
|
|
|
double cumProb = 0.0;
|
|
|
|
|
for(int hits = start; hits < end; hits++) {
|
|
|
|
|
cumProb += binomialProbabilityLog(hits, total, probHit);
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
return cumProb;
|
|
|
|
|
}
|
Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
binomial pmf to avoid the coefficient blowing up and thus
returning Infinity or NaN (or in some very strange cases
-Infinity). The log calculation compares very well, it seems
with our current method. It's in MathUtils but could stand
testing against rigorous truth data before becoming standard.
Added median calculator functions to ListUtils
getQScoreMedian is a new utility I wrote that given reads and
offsets will find the median Q score. While I was at it, I wrote
a similar method, getMedian, which will return the median of any
list of Comparables, independent of initial order. These are in
ListUtils.
Added a new poolseq directory and three walkers
CoverageAndPowerWalker is built on top of the PrintCoverage walker
and prints out the power to detect a mutant allele in a pool of
2*(number of individuals in the pool) alleles. It can be flagged
either to do this by boostrapping, or by pure math with a
probability of error based on the median Q-score. This walker
compiles, runs, and gives quite reasonable outputs that compare
visually well to the power calculation computed by Syzygy.
ArtificialPoolWalker is designed to take multiple single-sample
.bam files and create a (random) artificial pool. The coverage of
that pool is a user-defined proportion of the total coverage over
all of the input files. The output is not only a new .bam file,
but also an auxiliary file that has for each locus, the genotype
of the individuals, the confidence of that call, and that person's
representation in the artificial pool .bam at that locus. This
walker compiles and, uhh, looks pretty. Needs some testing.
AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 05:27:50 +08:00
|
|
|
|
2009-04-27 23:03:50 +08:00
|
|
|
/**
|
2009-06-06 07:34:37 +08:00
|
|
|
* Computes a multinomial. This is computed using the formula
|
2009-04-27 23:03:50 +08:00
|
|
|
*
|
2009-06-06 07:34:37 +08:00
|
|
|
* M(x1,x2,...,xk; n) = [ n! / (x1! x2! ... xk!) ]
|
2009-04-27 23:03:50 +08:00
|
|
|
*
|
2009-06-06 07:34:37 +08:00
|
|
|
* where xi represents the number of times outcome i was observed, n is the number of total observations.
|
|
|
|
|
* In this implementation, the value of n is inferred as the sum over i of xi.
|
2009-04-27 23:03:50 +08:00
|
|
|
*
|
|
|
|
|
* @param x an int[] of counts, where each element represents the number of times a certain outcome was observed
|
2009-06-06 07:34:37 +08:00
|
|
|
* @return the multinomial of the specified configuration.
|
2009-04-27 23:03:50 +08:00
|
|
|
*/
|
2009-06-06 07:34:37 +08:00
|
|
|
public static double multinomial(int[] x) {
|
2009-05-01 14:28:16 +08:00
|
|
|
// In order to avoid overflow in computing large factorials in the multinomial
|
|
|
|
|
// coefficient, we split the calculation up into the product of a bunch of
|
|
|
|
|
// binomial coefficients.
|
2009-06-06 07:34:37 +08:00
|
|
|
|
2009-05-01 14:28:16 +08:00
|
|
|
double multinomialCoefficient = 1.0;
|
2009-04-27 23:03:50 +08:00
|
|
|
|
2009-05-01 14:28:16 +08:00
|
|
|
for (int i = 0; i < x.length; i++) {
|
|
|
|
|
int n = 0;
|
|
|
|
|
for (int j = 0; j <= i; j++) { n += x[j]; }
|
|
|
|
|
|
|
|
|
|
double multinomialTerm = Arithmetic.binomial(n, x[i]);
|
|
|
|
|
multinomialCoefficient *= multinomialTerm;
|
|
|
|
|
}
|
|
|
|
|
|
2009-06-06 07:34:37 +08:00
|
|
|
return multinomialCoefficient;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* Computes a multinomial probability. This is computed using the formula
|
|
|
|
|
*
|
|
|
|
|
* M(x1,x2,...,xk; n; p1,p2,...,pk) = [ n! / (x1! x2! ... xk!) ] (p1^x1)(p2^x2)(...)(pk^xk)
|
|
|
|
|
*
|
|
|
|
|
* where xi represents the number of times outcome i was observed, n is the number of total observations, and
|
|
|
|
|
* pi represents the probability of the i-th outcome to occur. In this implementation, the value of n is
|
|
|
|
|
* inferred as the sum over i of xi.
|
|
|
|
|
*
|
|
|
|
|
* @param x an int[] of counts, where each element represents the number of times a certain outcome was observed
|
|
|
|
|
* @param p a double[] of probabilities, where each element represents the probability a given outcome can occur
|
|
|
|
|
* @return the multinomial probability of the specified configuration.
|
|
|
|
|
*/
|
|
|
|
|
public static double multinomialProbability(int[] x, double[] p) {
|
|
|
|
|
// In order to avoid overflow in computing large factorials in the multinomial
|
|
|
|
|
// coefficient, we split the calculation up into the product of a bunch of
|
|
|
|
|
// binomial coefficients.
|
|
|
|
|
double multinomialCoefficient = multinomial(x);
|
|
|
|
|
|
2009-05-01 14:28:16 +08:00
|
|
|
double probs = 1.0, totalprob = 0.0;
|
2009-04-27 23:03:50 +08:00
|
|
|
for (int obsCountsIndex = 0; obsCountsIndex < x.length; obsCountsIndex++) {
|
|
|
|
|
probs *= Math.pow(p[obsCountsIndex], x[obsCountsIndex]);
|
|
|
|
|
totalprob += p[obsCountsIndex];
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
assert(MathUtils.compareDoubles(totalprob, 1.0, 0.01) == 0);
|
|
|
|
|
|
2009-05-01 14:28:16 +08:00
|
|
|
return multinomialCoefficient*probs;
|
2009-04-27 23:03:50 +08:00
|
|
|
}
|
2009-08-04 01:00:20 +08:00
|
|
|
|
|
|
|
|
/**
|
|
|
|
|
* calculate the Root Mean Square of an array of integers
|
|
|
|
|
* @param x an int[] of numbers
|
|
|
|
|
* @return the RMS of the specified numbers.
|
|
|
|
|
*/
|
|
|
|
|
public static double rms(int[] x) {
|
|
|
|
|
if ( x.length == 0 )
|
|
|
|
|
return 0.0;
|
|
|
|
|
|
|
|
|
|
double rms = 0.0;
|
|
|
|
|
for (int i : x)
|
|
|
|
|
rms += i * i;
|
|
|
|
|
rms /= x.length;
|
|
|
|
|
return Math.sqrt(rms);
|
|
|
|
|
}
|
Added binomProbabilityLog(int k, int n, double p) to MathUtils:
binomialProbabilityLog uses a log-space calculation of the
binomial pmf to avoid the coefficient blowing up and thus
returning Infinity or NaN (or in some very strange cases
-Infinity). The log calculation compares very well, it seems
with our current method. It's in MathUtils but could stand
testing against rigorous truth data before becoming standard.
Added median calculator functions to ListUtils
getQScoreMedian is a new utility I wrote that given reads and
offsets will find the median Q score. While I was at it, I wrote
a similar method, getMedian, which will return the median of any
list of Comparables, independent of initial order. These are in
ListUtils.
Added a new poolseq directory and three walkers
CoverageAndPowerWalker is built on top of the PrintCoverage walker
and prints out the power to detect a mutant allele in a pool of
2*(number of individuals in the pool) alleles. It can be flagged
either to do this by boostrapping, or by pure math with a
probability of error based on the median Q-score. This walker
compiles, runs, and gives quite reasonable outputs that compare
visually well to the power calculation computed by Syzygy.
ArtificialPoolWalker is designed to take multiple single-sample
.bam files and create a (random) artificial pool. The coverage of
that pool is a user-defined proportion of the total coverage over
all of the input files. The output is not only a new .bam file,
but also an auxiliary file that has for each locus, the genotype
of the individuals, the confidence of that call, and that person's
representation in the artificial pool .bam at that locus. This
walker compiles and, uhh, looks pretty. Needs some testing.
AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power
calcuations (e.g. from Syzygy) and print them to the output file as well for direct
downstream comparisons.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a
2009-08-26 05:27:50 +08:00
|
|
|
|
|
|
|
|
|
2009-04-24 01:45:19 +08:00
|
|
|
}
|