gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mauricio Carneiro	e39a59594a	BQSR triage and test routines * updated BQSR queue script for faster turnaround * implemented plot generation for scatter/gatherered runs * adjusted output file names to be cooperative with the queue script * added the recalibration report file to the argument table in the report * added ReadCovariates unit test -- guarantees that all the covariates are being generated for every base in the read * added RecalibrationReport unit test -- guarantees the integrity of the delta tables	2012-04-23 11:23:00 -04:00
Eric Banks	cd63bcb1b8	Fixing unit tests to register the user exception being thrown (instead of the NumberFormatException)	2012-04-23 10:06:51 -04:00
Eric Banks	1f23d99dfa	If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did.	2012-04-20 17:00:05 -04:00
Eric Banks	4b81c75642	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-20 14:30:19 -04:00
Eric Banks	f1c5510ec0	When running SelectVariants with the excludeNonVariants option, remove alleles from the ALT field that are no longer polymorphic.	2012-04-20 14:30:04 -04:00
Ryan Poplin	a1596791af	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-20 14:03:04 -04:00
Ryan Poplin	a57295eb75	Fixing a bug when breaking up active regions where the resulting regions would overlap by one base. Adding quality score manipulation from the UG into the haplotype caller (qual capped by mapping quality, min qual threshold).	2012-04-20 14:02:55 -04:00
Guillermo del Angel	de68363c23	Removed experimental feature (aka hack) that was meant for 1000G consensus but remained in VQSR data manager - QD was being scaled by indel length. There's no evidence any more that QD is length-dependent, neither in CEU trio data nor in latest 1000G P2 calls	2012-04-20 10:58:34 -04:00
Mauricio Carneiro	0f8c77391d	BQSR bug triage #3 * fixed context covariate famous "off by one" error * reduced maximum quality score to Q50 (following Eric/Ryan's suggestion) * remove context downsampling in BQSR R script	2012-04-19 17:31:04 -04:00
Khalid Shakir	df5dd841af	AC strat now checks if evals will be merged before throwing an error on multiple eval files. Minor tweaks to WGP script based on new recal VCF format.	2012-04-19 16:08:55 -04:00
Guillermo del Angel	1ae2ab5b63	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-19 12:50:29 -04:00
Guillermo del Angel	02ff930f6a	My changes	2012-04-19 12:45:18 -04:00
Mauricio Carneiro	eb22cd7222	Unit test to guarantee BQSR sequential calculation accuracy This test brings together the old and the new BQSR, building a recalibration table using the two separate frameworks and performing the recalibration calculation using the two different frameworks for 10,000+ bases and asserting that the calculations match in every case.	2012-04-19 09:33:40 -04:00
Mauricio Carneiro	68d0211fa1	Improved BQSR plotting and some new parameters * Refactored CycleCovariate to be a fragment covariate instead of a per read covariate * Refactored the CycleCovariateUnitTest to test the pairing information * Updated BQSR Integration tests accordingly * Made quantization levels parameter not hidden anymore * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted) * Added hidden option not to generate the plots automatically (important for scatter/gathering)	2012-04-19 09:31:41 -04:00
Guillermo del Angel	143e92b797	Rebasing	2012-04-18 20:05:43 -04:00
Ryan Poplin	dcc4871468	minor misc optimizations to PairHMM	2012-04-18 15:02:26 -04:00
Eric Banks	4448a3ea76	Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position.	2012-04-17 23:54:10 -04:00
Eric Banks	c1f52b773a	Minor tweaks and updated integration tests MD5s	2012-04-17 23:17:28 -04:00
Eric Banks	ea793d8e27	Khalid pressured me into adding an integration test that makes sure we don't fail on reads with adjacent I and D events.	2012-04-17 21:21:29 -04:00
Mauricio Carneiro	f0c81b59b0	Implementation of the new BQSR plotting infrastructure * removed low quality bases from the recalibration report. * refactored the Datum (Recal and Accuracy) class structure * created a new plotting csv table for optimized performance with the R script * added a datum object that carries the accuracy information (AccuracyDatum) for plotting * added mean reported quality score to all covariates * added QualityScore as a covariate for plotting purposes * added unit test to the key manager to operate with one required covariate and multiple optional covariates * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)	2012-04-17 19:23:55 -04:00
Khalid Shakir	91cb654791	AggregateMetrics: - By porting from jython to java now accessible to Queue via automatic extension generation. - Better handling for problematic sample names by using PicardAggregationUtils. GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name. CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering. Added SelectHeaders walker for filtering headers for dbGAP submission. Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter. Latest WholeGenomePipeline. Other minor cleanup to utility methods.	2012-04-17 11:45:32 -04:00
Mark DePristo	3f6b2423d8	Update VE IT to reflect new fields and bugfixes	2012-04-13 17:00:37 -04:00
Mark DePristo	f9190b6fcd	VariantEvalUnitTest is better named VariantEvalWalkerUnitTest	2012-04-13 17:00:37 -04:00
Mark DePristo	84d1e8713a	Infrastructure for combining VariantEvaluations -- Not hooked up yet, so the output of VariantEval should be the same as before -- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines -- Better docs throughout	2012-04-13 17:00:36 -04:00
Mark DePristo	2aa2d9aec0	Merged bug fix from Stable into Unstable	2012-04-13 09:25:43 -04:00
Mark DePristo	27e7e17dc7	New way to handle exceptions in multi-threaded GATK -- HMS no longer tries to grab and throw all exceptions. Exceptions are just thrown directly now. -- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer -- Better printing of stack traces in WalkerTest	2012-04-13 09:23:33 -04:00
Mark DePristo	e85e9a8cf5	More extensive testing of type of error thrown in multi-threaded walker test -- Unfortunately the result of the multi-threaded test is non-deterministic so run the test 10x times to see if the right expection is always thrown -- Now prints the stack trace and exception message of the caught exception of the wrong type, if this occurs	2012-04-13 09:23:33 -04:00
Eric Banks	297afc7911	Added unit test to ensure that we genotype correctly cases with really large GLs	2012-04-12 15:43:14 -04:00
Eric Banks	5b7da3831f	Not sure why this didn't make it into the last push, but here's a working MD5 for the NDA annotation in UG	2012-04-11 13:49:50 -04:00
Eric Banks	dc90508104	Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful.	2012-04-11 13:47:10 -04:00
Eric Banks	d2142c3aa7	Adding integration test for Flag Stat	2012-04-10 22:40:38 -04:00
Ryan Poplin	1df0adf862	Fixing ActivityProfile unit test.	2012-04-10 15:28:27 -04:00
Ryan Poplin	e3cc7cc59c	Resolving merge conflict.	2012-04-10 14:50:27 -04:00
Ryan Poplin	a4634624b7	There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function.	2012-04-10 14:48:23 -04:00
Eric Banks	10e74a71eb	We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior.	2012-04-10 12:30:35 -04:00
Eric Banks	f82986ee62	Adding unit tests for the very important log10sumLog10 util method.	2012-04-09 14:28:25 -04:00
Mauricio Carneiro	87e6bea6c1	Adding engine capability to quantize qualities. * Added parameter -qq to quantize qualities using a recalibration report * Added options to quantize using the recalibration report quantization levels, new nLevels and no quantization. * Updated BQSR scripts to make use of the new parameters	2012-04-08 21:07:51 -04:00
Mark DePristo	c22a66870c	Modified UnitTests to respect reference padding	2012-04-06 16:27:20 -04:00
Mark DePristo	45fc0ea98d	Improvements to indel analysis capabilities of VariantEval -- Now calculates the number of Indels overlapping gold standard sites, as well as the percent of indels overlapping gold standard sites -- Removed insertion : deletion ratio for 1 bp event, replaced it with 1 + 2 : 3 bp ratio for insertions and deletions separately. This is based on an old email from Mark Daly: // - Since 1 & 2 bp insertions and 1 & 2 bp deletions are equally likely to cause a // downstream frameshift, if we make the simplifying assumptions that 3 bp ins // and 3bp del (adding/subtracting 1 AA in general) are roughly comparably // selected against, we should see a consistent 1+2 : 3 bp ratio for insertions // as for deletions, and certainly would expect consistency between in/dels that // multiple methods find and in/dels that are unique to one method (since deletions // are more common and the artifacts differ, it is probably worth looking at the totals, // overlaps and ratios for insertions and deletions separately in the methods // comparison and in this case don't even need to make the simplifying in = del functional assumption -- Added a new VEW argument to bind a gold standard track -- Added two new stratifications: OneBPIndel and TandemRepeat which do exactly what you imagine they do -- Deleted random unused functions in IndelUtils	2012-04-06 16:07:46 -04:00
Mark DePristo	52ef4a3e26	Function to compute whether a VariantContext indel is part of a TandemRepeat Returns true iff VC is an non-complex indel where every allele represents an expansion or contraction of a series of identical bases in the reference. The logic of this function is pretty simple. Take all of the non-null alleles in VC. For each insertion allele of n bases, check if that allele matches the next n reference bases. For each deletion allele of n bases, check if this matches the reference bases at n - 2 n, as it must necessarily match the first n bases. If this test returns true for all alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the base differences between the ref and alt alleles	2012-04-06 16:07:46 -04:00
Mauricio Carneiro	7c3b3650bb	BQSR bug triage * fixed bug where some keys were using the same recal datum objects * fixed quantization qual calculations when combining multiple reports * fixed rounding error with empirical quality reported when combining reports * fixed combine routine in the gatk reports due to the primary keys being out of order * added auto-recalibration option to BQSR scala script * reduced the size of the recalibration report by ~15% * updated md5's	2012-04-05 09:32:18 -04:00
Mark DePristo	76e4100d89	By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots -- Updated integration tests as well	2012-04-04 18:48:03 -04:00
Ryan Poplin	bfad26353a	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-04 16:04:50 -04:00
Ryan Poplin	dda2173c66	Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.	2012-04-04 16:04:29 -04:00
Mark DePristo	1ccea866d8	VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses -- Updated EvalModules to work with new paramter -- adding test file for keepAC0 to public/testdata and integration tests	2012-04-04 15:37:12 -04:00
Eric Banks	337ff7887a	When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.	2012-04-04 10:57:05 -04:00
Guillermo del Angel	05d8400468	Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)	2012-04-03 20:51:24 -04:00
Guillermo del Angel	5a10f173ea	Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow)	2012-04-03 18:55:52 -04:00
Guillermo del Angel	63b1e737c6	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-04-03 15:43:50 -04:00
Guillermo del Angel	9e11b4f9a7	Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.	2012-04-03 15:43:32 -04:00

1 2 3 4 5 ...

695 Commits (d6277b70d8e860ac4ef37d7438687480e79eb111)