gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Guillermo del Angel	08f7d47d7c	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-28 07:42:09 -04:00
Mark DePristo	12aa72f200	Merged bug fix from Stable into Unstable	2012-03-27 22:43:00 -04:00
Mark DePristo	979a84a252	Bugfix for thread unsafe PL cache -- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic -- Solution is to use a fixed cache that's never updated on the fly. My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.	2012-03-27 22:42:30 -04:00
Guillermo del Angel	8f34412fb8	First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing	2012-03-27 20:59:44 -04:00
Guillermo del Angel	ed322bd73f	Fix again merge issues	2012-03-27 15:03:13 -04:00
Guillermo del Angel	b4a7c0d98d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-27 15:01:03 -04:00
Guillermo del Angel	343a061b1c	Fix merge issues when incorporating new AF calculations changes	2012-03-27 15:00:44 -04:00
Mauricio Carneiro	1b75663178	BQSR Gatherer implementation and integration tests * restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers * optmized empirical qual calculation when merging recalibration reports * centralized the quality score quantization functionalities * unified the creating/loading of all the key manager/hash table structures. * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing) * added integration tests for BQSR and on-the-fly recalibration	2012-03-27 13:50:22 -05:00
Ryan Poplin	5dbd3625cd	Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data.	2012-03-27 13:38:52 -04:00
Eric Banks	c112e0824a	I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there.	2012-03-27 11:09:03 -05:00
Mark DePristo	a638996fe2	Cleanup of VariantEval, diatribe about performance problems with StateKey -- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear -- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.	2012-03-27 11:56:24 -04:00
Mark DePristo	679bb03014	Simple utility function for converting an Iterable<T> to Collection<T>	2012-03-27 11:54:58 -04:00
Mark DePristo	1f5f737c8b	Optimizing the GATKReportTable.write -- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables	2012-03-27 11:54:35 -04:00
Mark DePristo	913c8b231f	Fix ErrorRatePerCycle to overload equals and hashcode -- Fixes failing integration tests	2012-03-27 10:35:32 -04:00
Eric Banks	c07a577ba3	Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously).	2012-03-27 00:27:44 -05:00
Guillermo del Angel	e8bb8ade1a	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-26 16:42:03 -04:00
Guillermo del Angel	1a2a4848e8	Added integration test for ValidationSiteSelector, correct MD5's	2012-03-26 16:39:55 -04:00
Mark DePristo	34ea443cdb	Better algorithm for choosing which indel alleles are present in samples -- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors. -- This breakdown is producing spurious clustered indels (lots of these!) around real common indels -- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted. -- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale. -- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP	2012-03-26 16:28:49 -04:00
Mark DePristo	11b6fd990a	GATKReportColumn optimizations -- Was TreeMap even though the sorting wasn't used. Replaced with LinkedHashMap.	2012-03-26 16:28:49 -04:00
Mark DePristo	6be5e82860	VariantEval scalability optimizations -- StateKey no longer extends TreeMap. It's now a final immutable data structure that caches it's toString and hashcode values. TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function. -- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils -- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects. Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE -- VariantEvaluator base class has a cached getSimpleName() function -- VEUtils: general cleanup due to refactoring of StateKey -- VEWalker: much better iteration of map data structures. If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet. This is far better than iterating over the keys and calling get() on each key.	2012-03-26 16:28:48 -04:00
Guillermo del Angel	1c424c0daf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-26 15:15:50 -04:00
Ryan Poplin	019145175b	Major optimizations to graph construction through better use of built in graph.containsVertex and vertex.equals methods. Minor optimizations to MathUtils.approximateLog10SumLog10 method	2012-03-26 11:32:44 -04:00
Ryan Poplin	1fa66f76c9	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-25 23:04:47 -04:00
Guillermo del Angel	ce617b2dfc	Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code	2012-03-25 10:20:21 -04:00
Guillermo del Angel	db54c2625f	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-25 09:53:35 -04:00
Guillermo del Angel	deb4586559	Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1	2012-03-24 21:49:43 -04:00
Mark DePristo	b063bcd38d	Removing update0 support in VariantEval -- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations. -- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications. -- No need to modify the integration tests, this optimization doesn't change the result of the calculation	2012-03-23 21:02:21 -04:00
Mauricio Carneiro	0509d316d9	More information in the recalibration report * added empirical quality counts to allow quantization during on-the-fly recalibration to any level * added number of observations and errors to all tables to enable plotting of all covariates	2012-03-23 16:15:19 -04:00
Mauricio Carneiro	9f74969e3a	BQSR with GATKReport implementation * restructured BQSR to report recalibrated tables. * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration) * linked quality score quantization to the BQSR stage, outputting a quantization histogram * included the arguments used in BQSR to the GATK Report * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities On-the-fly recalibration with GATK Report * loads all tables from the GATKReport using existing infrastructure (with minor updates) * implemented initialiazation of the covariates using BQSR's argument list * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key * applied quality quantization to the base recalibration * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions	2012-03-23 15:42:32 -04:00
Mauricio Carneiro	f421062b55	Updated read group covariate to use sample.lane instead of the id Added Unit test.	2012-03-23 15:24:07 -04:00
Mauricio Carneiro	539da9e3e1	Fixing GATKReport exception handling when loading a report * allowing tables with no description to go through * GATKReportTable should be more lenient with the format requirements (added to-dos for roger)	2012-03-23 15:23:13 -04:00
Eric Banks	2511839068	Merged bug fix from Stable into Unstable	2012-03-23 13:51:33 -04:00
Eric Banks	d3f2bc4361	Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles.	2012-03-23 13:51:00 -04:00
Mark DePristo	e4ec90cfce	Merged bug fix from Stable into Unstable	2012-03-23 11:27:34 -04:00
Mark DePristo	ff26f2bf68	HierarchicalMicroScheduler no longer attempts to wrap exceptions -- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors. Now HMS simply checks whether the throwable it received on error was a RuntimeException. If so, it is stored and rethrow without wrapping later. If it isn't, only in this case is the exception wrapped in a ReviewedStingException. -- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line -- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled. -- I believe this will finally put to rest all of these annoying HMS captures.	2012-03-23 11:27:21 -04:00
Ryan Poplin	9d22471b79	Merged bug fix from Stable into Unstable	2012-03-23 10:48:34 -04:00
Ryan Poplin	ab288354e9	Better error message for malformed input recal file.	2012-03-23 10:47:01 -04:00
Mark DePristo	fee8d86f63	VariantEval optimization -- Use a LinkedHashMap not a TreeMap so iteration is faster. -- Note that with a lot of stratifications the update0 is taking up a lot of time. For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call	2012-03-22 22:13:24 -04:00
Mark DePristo	6df96644d9	Unified, standard IndelSummary metrics for VariantEval -- Now you always get SNP and indel metrics with VariantEval! -- Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions -- Updated VE integration tests as appropriate	2012-03-22 21:24:37 -04:00
Mark DePristo	bcf80cc7b3	Cleanup in VariantEval. Example of molten VariantEval output -- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share. Code updated to use these routines where appropriate -- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF -- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication. -- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden. -- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets. See IndelHistogram class (in later commit) for example of molten VE output	2012-03-22 21:24:37 -04:00
Mark DePristo	9ddd5aec93	More eval modules being removed from VariantEval -- IndelStatistics is superceded by IndelStatistics	2012-03-22 21:24:36 -04:00
Mark DePristo	bd5b6d1aba	Remove no longer in use Eval modules from VariantEval -- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit) -- No more SamplePreviousGenotypes or PhaseStats -- No more MultiallelicAFs	2012-03-22 21:24:36 -04:00
Menachem Fromer	7faa9938b1	Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-22 17:43:44 -04:00
Menachem Fromer	b9b9219ac7	Added respectPhaseInInput flag to RBP and integration tests	2012-03-22 17:40:21 -04:00
Guillermo del Angel	f198cec5e2	Temp commit: new structure for pool caller, now all work is in the same framework as in UG. There's a new genotype calculation model, PoolGenotypeCalculationModel, that does all the work and plugs into UnifiedGenotyperEngine. A new AF module for pools is upcoming. Old pool caller will be removed once all work is migrated	2012-03-22 15:46:39 -04:00
Menachem Fromer	1dfaacfeb5	Check for consistency of the BAM and VCF sample names, with a command line disable to throw if you know what you are doing	2012-03-22 12:40:15 -04:00
Guillermo del Angel	b02ef95bcf	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-22 12:14:12 -04:00
Guillermo del Angel	92676c63ca	Make constructor of IndelGenotypeLikelihoodsCalculationModel public so it can be used in unit tests	2012-03-22 12:13:59 -04:00
Guillermo del Angel	58965d6a6e	Merged bug fix from Stable into Unstable	2012-03-22 11:04:11 -04:00
Guillermo del Angel	b8cd959461	Potential corner condition bug fix: protect against null pointer exceptions when computing consensus indel bases when UG is discovering alt alleles. If an alt allele has non-standard bases, skip allele gracefully instead of adding null object into list	2012-03-22 10:06:22 -04:00

1 2 3 4 5 ...

1794 Commits (08f7d47d7c45566e22c45bd6e436e433fc0dc352)