gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	2aa2d9aec0	Merged bug fix from Stable into Unstable	2012-04-13 09:25:43 -04:00
Mark DePristo	27e7e17dc7	New way to handle exceptions in multi-threaded GATK -- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now. -- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer -- Better printing of stack traces in WalkerTest	2012-04-13 09:23:33 -04:00
Mark DePristo	e85e9a8cf5	More extensive testing of type of error thrown in multi-threaded walker test -- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown -- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs	2012-04-13 09:23:33 -04:00
Eric Banks	297afc7911	Added unit test to ensure that we genotype correctly cases with really large GLs	2012-04-12 15:43:14 -04:00
Eric Banks	5b7da3831f	Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG	2012-04-11 13:49:50 -04:00
Eric Banks	dc90508104	Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful.	2012-04-11 13:47:10 -04:00
Eric Banks	d2142c3aa7	Adding integration test for Flag Stat	2012-04-10 22:40:38 -04:00
Ryan Poplin	1df0adf862	Fixing ActivityProfile unit test.	2012-04-10 15:28:27 -04:00
Ryan Poplin	e3cc7cc59c	Resolving merge conflict.	2012-04-10 14:50:27 -04:00
Ryan Poplin	a4634624b7	There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function.	2012-04-10 14:48:23 -04:00
Eric Banks	10e74a71eb	We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior.	2012-04-10 12:30:35 -04:00
Eric Banks	f82986ee62	Adding unit tests for the very important log10sumLog10 util method.	2012-04-09 14:28:25 -04:00
Mauricio Carneiro	87e6bea6c1	Adding engine capability to quantize qualities. * Added parameter -qq to quantize qualities using a recalibration report * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization. * Updated BQSR scripts to make use of the new parameters	2012-04-08 21:07:51 -04:00
Mark DePristo	c22a66870c	Modified UnitTests to respect reference padding	2012-04-06 16:27:20 -04:00
Mark DePristo	45fc0ea98d	Improvements to indel analysis capabilities of VariantEval -- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites -- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly: // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a // downstream frameshift, if we make the simplifying assumptions that 3 bp ins // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions // as for deletions, and certainly would expect consistency between in/dels that // multiple methods find and in/dels that are unique to one method (since deletions // are more common and the artifacts differ, it is probably worth looking at the totals, // overlaps and ratios for insertions and deletions separately in the methods // comparison and in this case don't even need to make the simplifying in = del functional assumption -- Added a new VEW argument to bind a gold standard track -- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do -- Deleted random unused functions in IndelUtils	2012-04-06 16:07:46 -04:00
Mark DePristo	52ef4a3e26	Function to compute whether a VariantContext indel is part of a TandemRepeat Returns true iff VC is an non-complex indel where every allele represents an expansion or contraction of a series of identical bases in the reference. The logic of this function is pretty simple. Take all of the non-null alleles in VC. For each insertion allele of n bases, check if that allele matches the next n reference bases. For each deletion allele of n bases, check if this matches the reference bases at n - 2 n, as it must necessarily match the first n bases. If this test returns true for all alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the base differences between the ref and alt alleles	2012-04-06 16:07:46 -04:00
Mauricio Carneiro	7c3b3650bb	BQSR bug triage * fixed bug where some keys were using the same recal datum objects * fixed quantization qual calculations when combining multiple reports * fixed rounding error with empirical quality reported when combining reports * fixed combine routine in the gatk reports due to the primary keys being out of order * added auto-recalibration option to BQSR scala script * reduced the size of the recalibration report by ~15% * updated md5's	2012-04-05 09:32:18 -04:00
Mark DePristo	76e4100d89	By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots -- Updated integration tests as well	2012-04-04 18:48:03 -04:00
Ryan Poplin	bfad26353a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-04 16:04:50 -04:00
Ryan Poplin	dda2173c66	Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.	2012-04-04 16:04:29 -04:00
Mark DePristo	1ccea866d8	VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses -- Updated EvalModules to work with new paramter -- adding test file for keepAC0 to public/testdata and integration tests	2012-04-04 15:37:12 -04:00
Eric Banks	337ff7887a	When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.	2012-04-04 10:57:05 -04:00
Guillermo del Angel	05d8400468	Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)	2012-04-03 20:51:24 -04:00
Guillermo del Angel	5a10f173ea	Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow)	2012-04-03 18:55:52 -04:00
Guillermo del Angel	63b1e737c6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-03 15:43:50 -04:00
Guillermo del Angel	9e11b4f9a7	Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.	2012-04-03 15:43:32 -04:00
Eric Banks	326220c91c	Removing extended event related unit tests	2012-04-02 14:40:36 -04:00
Eric Banks	99d27ddcc4	Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now.	2012-04-02 14:27:36 -04:00
Mark DePristo	4f73ea902f	Final update for VE. VCFStreaming wasn't yet updated	2012-03-30 21:52:01 -04:00
Mark DePristo	fbbb8509ad	Final commits to VariantEval -- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to. -- Cleanup code, reorganize a bit more. -- Fix for broken integrationtests	2012-03-30 20:11:06 -04:00
Mark DePristo	4b45a2c99d	Final version of new VariantEval infrastructure. * WAY FASTER * -- 3x performance for multiple sample analysis with 1000 samples -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2 -- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval -- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also. -- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do. -- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter. -- Commented out GenotypePhasingEvaluator, as it uses the retired TableType -- Stratifications are all fully typed, so it's easy for GATKReports to format them. -- Removed old VE work around from GATKReportColumn -- General code cleanup throughout -- Updated integration tests	2012-03-30 15:31:56 -04:00
Mark DePristo	976bac0452	BaseTest now has a global variable to turn off network connection requirement	2012-03-30 15:31:55 -04:00
Mark DePristo	097ed4ecc4	Memory usage optimizations and safety improvements to StratNode and StratificationManager -- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users. -- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals	2012-03-30 15:31:55 -04:00
Mark DePristo	c8086a79e3	New StratificationManager based VariantEval passes unmodified integration tests -- Now needs cleanup and optimizations	2012-03-30 15:31:55 -04:00
Mark DePristo	8971b54b21	Phase II of Stratification manager -- Renamed and reorganized infrastructure -- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO -- Ready for hookup to VE	2012-03-30 15:31:54 -04:00
Mark DePristo	9f1cd0ff66	Lots of new functionality for StratificationStates manager -- Really working according to unit tests -- A nCombination utils	2012-03-30 15:31:54 -04:00
Mark DePristo	a3d896d80e	Part I of creating a fast state space lookup for VE -- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map. -- Minor code cleanup throughout VE (removing unused headers, for example)	2012-03-30 15:31:53 -04:00
Eric Banks	6b49af253b	Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there.	2012-03-30 10:33:30 -04:00
Eric Banks	16bef191c6	UG integration tests updated. A handful of sites are lost because there are only 5 indels and one starts at the beginning of the read so it no longer passes our min threshold (now consistent with GGA), but mostly the depth changes ever so slightly once in a while between extended and normal pileups (I think the normal pileups are correct). I have looked thoroughly in IGV at ALL differences and am happy with the new results. As an aside, the AD is now calculated more accurately for indels.	2012-03-30 01:35:49 -04:00
Mauricio Carneiro	f80bd4276a	fixed estimated Q reported calculation in the gatherer	2012-03-29 12:28:43 -04:00
Guillermo del Angel	a0843f125e	Forgot to add file itself for new unit test	2012-03-28 21:08:18 -04:00
Roger Zurawicki	63cf7ec7ec	Added more primitives to GATK Report Column Type - The Integer column type now accepts byte and shorts - Updated Unit Tests and added a new testParse() test Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-28 09:07:54 -04:00
Guillermo del Angel	d2586911a4	Forgot to add tolerance to new MathUtils unit tests	2012-03-28 08:18:36 -04:00
Guillermo del Angel	b4a7c0d98d	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-27 15:01:03 -04:00
Guillermo del Angel	343a061b1c	Fix merge issues when incorporating new AF calculations changes	2012-03-27 15:00:44 -04:00
Mauricio Carneiro	1b75663178	BQSR Gatherer implementation and integration tests * restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers * optmized empirical qual calculation when merging recalibration reports * centralized the quality score quantization functionalities * unified the creating/loading of all the key manager/hash table structures. * added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing) * added integration tests for BQSR and on-the-fly recalibration	2012-03-27 13:50:22 -05:00
Eric Banks	c07a577ba3	Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously).	2012-03-27 00:27:44 -05:00
Guillermo del Angel	e8bb8ade1a	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-03-26 16:42:03 -04:00
Guillermo del Angel	1a2a4848e8	Added integration test for ValidationSiteSelector, correct MD5's	2012-03-26 16:39:55 -04:00
Mark DePristo	34ea443cdb	Better algorithm for choosing which indel alleles are present in samples -- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors. -- This breakdown is producing spurious clustered indels (lots of these!) around real common indels -- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted. -- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale. -- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP	2012-03-26 16:28:49 -04:00

1 2 3 4 5 ...

668 Commits (2aa2d9aec0fb7bde7cd0ca38ab38f0d7ca276663)