gatk-3.8

Commit Graph

Author	SHA1	Message	Date
rpoplin	3224bbe750	New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-02 19:14:42 +00:00
delangel	59dd79faab	One more optimization: don't use Math.round(), but do my own rouding/casting. UG now about 40% faster calling indels, 30-35% faster calling snp's+indels simultaneously. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5667 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 19:15:58 +00:00
delangel	246d8190b5	Round one of "easy" zero-effort optimizations to UG's indel caller. Mostly inline functions, avoid repeated computation and try to optimize SoftMaxPair() which is by far the bigest runtime hog. More to come... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5666 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 18:57:34 +00:00
chartl	efe6c539ac	Re-enabling disabled test. Apparently T-tests are very picky about your using an unbiased variance. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5622 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 03:05:50 +00:00
chartl	8125b8b901	Old changes to the exome VQSR search. SGA updated to include new proportion-based insert size test. Major fix for dichotomization test: MathUtils now optionally ignores NaN values for sums, averages, variances. In the future this feature can be pushed back into the AssociationContext object iself (e.g. no data? no entry), but it's kept like this for transparency for now. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5618 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-12 23:00:50 +00:00
ebanks	af09170167	As I threatened yesterday, I've moved the various and disparate randomization code out of the walkers. Now they all (except VQSRv1, whose days are numbered anyways) use a static generator available in the engine itself. Please use this from now on. The seed is reset before every individual integration test is run. I think there may still be an issue with the IndelRealigner but I need to confirm with the commit to see what testNG does. Integration tests are already broken anyways, so no big deal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5589 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-07 17:03:48 +00:00
depristo	c1798a7dbc	Whitespace cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5460 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-17 18:04:08 +00:00
depristo	ccc773d175	Refactoring, cleanup, and performance improvements to ProduceBeagleInput. It's really a shame that there's no integration tests... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5418 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-11 13:55:30 +00:00
rpoplin	509daac9f7	Minor bug fix in k-means implementation. Updating VQSR integration tests in preparation for VQSRv2 by removing some unused features such as VariantDatum.weight and ti/tv cutting. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5410 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-09 00:26:28 +00:00
delangel	8c262eb605	Initial commit of new likelihood model to evaluate indel quality. Principle is simple, a plain Pair HMM with affine gap penalties (in log space) that does quasi-local alignment between reads and candidate haplotypes and which in theory should be more solid and more reliable than the older Dindel-based model. It also allows to be easily extensible in the future if we decide to introduce either context-dependent and/or read-dependent gap penalties. Model is disabled by default and we're still using the old Dindel model until I'm more confident that new model is a definitive improvement, so right now this is enabled by hidden command line arguments, and it's not to be used yet. In detail: a) Several refactorings to share softMax() available to other modules, so its now part of MathUtils. b) Refactored a couple of read utilities and moved from BAQ to ReadUtils. c) New PairHMMIndelErrorModel class implementing new likelihood model d) Several new hidden debug arguments in UAC. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5389 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-07 15:31:58 +00:00
depristo	ad51f30244	A trivial, but useful, sum of a list of integers git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5378 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-05 06:09:05 +00:00
chartl	9ca1dd5d62	Miscellaneous changes: - RefMetaDataTracker: grabbing variant contexts given a prefix (not sure where else this was implemented, if someone can show me I'll remove it) - VCFUtils: grabbing VCF headers given a prefix - MathUtils: Useful functions for calculating statistics on collections of Numbers - VariantAnnotator: Made isUniqueHeaderLine a public static method -- maybe this should go into a different class. Not sure. - Associations: PluginManager now used to propagate classes, implementations for Z,T,U tests, slight alteration to format to make the objects stored in the window optionally different from those returned by whatever statistic is run across the window Added: - MannWhitneyU. Started to fix up WilcoxonRankSum but there are comments in there questioning the validity of some of the code, and I'm sure that it's actually doing a U test. This implementation includes the direct calculation of p-values for small sample sizes, and a uniform approximation for when one of the sample sets is small, and the other large. Unit tests to follow. - BootstrapCallsMerger: takes n VCFs which have been called on the same samples; merges them together while averaging the annotations - BootstrapCalls.q: qscript for testing the effectiveness of boostrap low-pass calling on the exome git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5372 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-03 22:43:36 +00:00
asivache	570186fa42	Added (deep) clone() and merge() to the RunningAverage utility class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5350 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-02 00:35:23 +00:00
chartl	0723b0f44c	Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-01 01:23:37 +00:00
ebanks	bb6999b032	Better documentation git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5057 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-23 03:36:09 +00:00
rpoplin	23dbc5ccf3	HaplotypeScore is revamped. It now uses reads' Cigar strings when building the haplotype blocks to skip over soft-clipped bases and factor in insertions and deletions. The statistic now uses only the reads from the filtered context to build the haplotypes but it scores all reads against the two best haplotypes. The score is now computed individually for each sample's reads and then averaged together. Bug fixes throughout. The math for the base quality and mapping quality rank sum tests is fixed. The annotations remain as ExperimentalAnnotations pending more investigation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4934 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-05 00:28:05 +00:00
depristo	a5b3aac864	Engine-level BAQ calculation now available in the GATK [totally experimental right now]. -baq argument to disable (NONE), to only use the tags in the BAM (USE_TAG_ONLY), use the tag when present but calculate on the fly as necessary (CALCULATE_AS_NECESSARY), and to always recalculate (RECALCULATE_ALWAYS). BAQ.java contains the complete implementation, for those interested. ValidateBAQWalker is a useful QC tool for verifying the BAQ is correct. BAQSamIterator applies BAQ to reads, as needed, in the engine. Let me know if you encounter any problems. Before prime-time, needs a caching implementation of IndexedFastaReader to avoid loading lots of reference data all of the time git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4787 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-04 20:23:06 +00:00
depristo	ef2f6d90d2	VQSR now operates on LOD scores in the INFO field directly, and doesn't adjust the QUAL field. New format for tranches file uses LOD score. Old file format no longer supported. log10sumlog10() function, a very useful utility in MathUtils. No more ExtendedPileupElement! Robust math calculations in GMM so that no infinities are generated! HaplotypeScore refactored to enable use of filtered context. Not yet enabled... InferredContext getDouble and getInteger arguments now parse values from Strings if necessary git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4684 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-15 22:19:22 +00:00
depristo	4759fdd2ac	V1 of read and variant simulator and assessor. SimulateReadsForVariants generates BAM and VCF with given combinations of variant and read properties. AssessSimulatedPerformance produces a table suitable for analysis in R git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4637 348d0f76-0448-11de-a6fe-93d51630548a	2010-11-08 21:01:33 +00:00
depristo	23cb399a88	Reasonable first pass at a correct SB calculation. Simple utilities to support it. VariantsToTable no longer prints filtered sites by default. New non-standard variant eval module to print comp sites not present in eval (FN finder) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4601 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-31 12:41:52 +00:00
ebanks	fe3cfb067c	very minor cleanup git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4590 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-28 02:11:33 +00:00
depristo	cbce3e3c83	General support for both GL (log10) and PL (phred-scaled) genotype likelihoods. All walkers now use the Tribble GenotypeLikelihoods object for parsing VCFs with genotype likelihood fields. Please use GenotypeLikelihoods object from now on for seamless support for GL and PL tags. UGv2 now uses PL by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4589 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-28 01:48:47 +00:00
depristo	7eeabe534a	QSample walker for 1KG -- measures aggregate quality of sequencing. Includes misc. improvements throughtout the code, including using the new Tribble GenotypeLikelihoods class for working with VCF GLs from the UG git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4211 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-03 18:21:43 +00:00
asivache	d53d5ffbf6	A utility class that computes running average and standard deviation for a stream of numbers it is being fed with. Updates mean/stddev on the fly and does not cache the observations, so it uses no memory and also should be stable against overflow/loss of precision. Simple unit test is also provided (does not stress-test the engine with millions of numbers though). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3944 348d0f76-0448-11de-a6fe-93d51630548a	2010-08-04 21:39:02 +00:00
depristo	504103bd15	Misc. additions to correct utilities git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3329 348d0f76-0448-11de-a6fe-93d51630548a	2010-05-07 21:34:18 +00:00
hanna	c1e53d407d	The copyright tag that I copied/pasted from a LaTeX document into IntelliJ had unicode quote characters embedded in it. These characters were invisible inside IntelliJ but cause compile warnings for Ryan and Aaron, who for whatever reason have a different default charset. Fixed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3203 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-20 15:26:32 +00:00
hanna	1bc26f69e9	An attempt to cleanup the Utils directory. Email to follow. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3198 348d0f76-0448-11de-a6fe-93d51630548a	2010-04-19 23:00:08 +00:00
depristo	b8ab74a6dc	Minor useful changes to BaseUtils and MathUtils to support a new haplotype score annotation that determines to the two most likely haplotypes over an interval and scores variants by their consistency with a diploid model. Appears to be useful. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3085 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-28 21:45:22 +00:00
depristo	56092a0fc2	Slight cleanup for mathutils git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3042 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-19 13:18:08 +00:00
chartl	b42fc905e8	Added - new tests (Hapmap was re-added) Modified - Hapmap now takes a -q command to filter out variants by quality Modified - MathUtils - cumBinomialProbLog now uses BigDecimal to handle some numerical imprecisions Modified - PowerBelowFrequency - returns 0.0 if called with a negative number (can't be done from inside the walker itself, but since it's called elsewhere one can't be too careful) git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2350 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-14 21:57:20 +00:00
ebanks	4558375575	Stage 1 of the VariantFiltration refactoring is now complete. There now exists a parallel tool called VariantAnnotator which simply takes variant calls and annotates them with the same type of data that we used to use for filtering (e.g. DoC, allele balance). The output is a VCF with the INFO field appropriately annotated. VariantAnnotator can be called as a standalone walker or by another walker, as it is by the UnifiedGenotyper. UG now no longer computes any of this meta data - it relegates the task completely to the annotator (assuming the output format accepts it). This is a fairly all-encompassing check in. It involves changes to all of the UG code, bug fixes to much of the VCF code as things popped up, and other changes throughout. All integration tests pass and I've tediously confirmed that the annotation values are correct, but this framework could use some more rigorous testing. Stage 2 of the process will happen later this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2053 348d0f76-0448-11de-a6fe-93d51630548a	2009-11-16 02:41:20 +00:00
chartl	8740124cda	@ListUtils - Bugfix in getQScoreOrderStatistic: method would attempt to access an empty list fed into it. Now it checks for null pointers and returns 0. @MathUtils - added a new method: cumBinomialProbLog which calculates a cumulant from any start point to any end point using the BinomProbabilityLog calculation. @PoolUtils - added a new utility class specifically for items related to pooled sequencing. A major part of the power calculation is now to calculate powers independently by read direction. The only method in this class (currently) takes your reads and offsets, and splits them into two groups by read direction. @CoverageAndPowerWalker - completely rewritten to split coverage, median qualities, and power by read direction. Makes use of cumBinomialProbLog rather than doing that calculation within the object itself. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1462 348d0f76-0448-11de-a6fe-93d51630548a	2009-08-27 19:31:53 +00:00
chartl	92ea947c33	Added binomProbabilityLog(int k, int n, double p) to MathUtils: binomialProbabilityLog uses a log-space calculation of the binomial pmf to avoid the coefficient blowing up and thus returning Infinity or NaN (or in some very strange cases -Infinity). The log calculation compares very well, it seems with our current method. It's in MathUtils but could stand testing against rigorous truth data before becoming standard. Added median calculator functions to ListUtils getQScoreMedian is a new utility I wrote that given reads and offsets will find the median Q score. While I was at it, I wrote a similar method, getMedian, which will return the median of any list of Comparables, independent of initial order. These are in ListUtils. Added a new poolseq directory and three walkers CoverageAndPowerWalker is built on top of the PrintCoverage walker and prints out the power to detect a mutant allele in a pool of 2*(number of individuals in the pool) alleles. It can be flagged either to do this by boostrapping, or by pure math with a probability of error based on the median Q-score. This walker compiles, runs, and gives quite reasonable outputs that compare visually well to the power calculation computed by Syzygy. ArtificialPoolWalker is designed to take multiple single-sample .bam files and create a (random) artificial pool. The coverage of that pool is a user-defined proportion of the total coverage over all of the input files. The output is not only a new .bam file, but also an auxiliary file that has for each locus, the genotype of the individuals, the confidence of that call, and that person's representation in the artificial pool .bam at that locus. This walker compiles and, uhh, looks pretty. Needs some testing. AnalyzePowerWalker extends CoverageAndPowerWalker so that it can read previous power calcuations (e.g. from Syzygy) and print them to the output file as well for direct downstream comparisons. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1460 348d0f76-0448-11de-a6fe-93d51630548a	2009-08-25 21:27:50 +00:00
depristo	5487ab0ee6	Added several useful routines to MathUtils for summing and bounds checking of doubles git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1379 348d0f76-0448-11de-a6fe-93d51630548a	2009-08-05 00:41:31 +00:00
ebanks	e3b08f245f	Pull out RMS calculation into MathUtils for all to use git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@1364 348d0f76-0448-11de-a6fe-93d51630548a	2009-08-03 17:00:20 +00:00
depristo	819862e04e	major restructuring of generalized variant analysis framework. Now trivally easy to add additional analyses. Easy partitioning of all analyses by features, such as singleton status. Now has transition/transversional bias, counting, dbSNP coverage, HWE violation, selecting of variants by presence/absense in dbs. Also restructured the ROD system to make it easier to add tracks. Also, added the interval track -- if you provide an interval list, then the system autoatmically makese this available to you as a bound rod -- you can always find out where you are in the interval at every site. Python scripts improved to handle more merging, etc, into population snps. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@918 348d0f76-0448-11de-a6fe-93d51630548a	2009-06-05 23:34:37 +00:00
kiran	16467ae7cf	A better (less overflow-y) implementation of multinomialProbability(). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@579 348d0f76-0448-11de-a6fe-93d51630548a	2009-05-01 06:28:16 +00:00
kiran	b9c9dbb1d7	Added multinomialProbability method. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@545 348d0f76-0448-11de-a6fe-93d51630548a	2009-04-27 15:03:50 +00:00
jmaguire	dd408a2a9a	First draft of actual pooled EM caller. Produces sane looking output on region of 1kG pilot1: CALL NA12813.SRP000031.2009_02.bam CC 0.609084 0.609084 CALL NA12003.SRP000031.2009_02.bam CC 2.114234 2.114234 CCCCC CALL NA06994.SRP000031.2009_02.bam CC 0.910114 0.910114 C CALL NA18940.SRP000031.2009_02.bam CT 2.589749 0.910114 T CALL NA18555.SRP000031.2009_02.bam CC 0.609084 0.609084 Next up, eval vs. Baseline pilot1 calls and pilot3 deep-coverage truth. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@525 348d0f76-0448-11de-a6fe-93d51630548a	2009-04-24 13:42:15 +00:00
kiran	3cda85f2e3	New implementation of binomial probability that accurately computes values down to around 1e-237. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@520 348d0f76-0448-11de-a6fe-93d51630548a	2009-04-24 03:32:04 +00:00
kiran	77e1e9e2f1	Added a static class to house useful math methods. All this has at the moment are methods for comparing doubles and floats, but I suggest that the bulk of our little math methods should be added here to avoid filling up Utils.java with so much random stuff. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@505 348d0f76-0448-11de-a6fe-93d51630548a	2009-04-23 17:45:19 +00:00

41 Commits (84dd72e6cb5d4cfd4a5031a75cda250142530d75)