Commit Graph

1685 Commits (68d0211fa1cf8e0f9803e64d60672b800716a20c)

Author SHA1 Message Date
Mauricio Carneiro 68d0211fa1 Improved BQSR plotting and some new parameters
* Refactored CycleCovariate to be a fragment covariate instead of a per read covariate
   * Refactored the CycleCovariateUnitTest to test the pairing information
   * Updated BQSR Integration tests accordingly
   * Made quantization levels parameter not hidden anymore
   * Added hidden option to keep intermediate plotting files for debug purposes (they're automatically deleted)
   * Added hidden option not to generate the plots automatically (important for scatter/gathering)
2012-04-19 09:31:41 -04:00
Ryan Poplin 4999ae87ad Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-18 15:02:42 -04:00
Ryan Poplin dcc4871468 minor misc optimizations to PairHMM 2012-04-18 15:02:26 -04:00
Eric Banks d3c84e7b1f This should be a User Error since it's provided from the DoC command-line arguments 2012-04-18 13:09:23 -04:00
Eric Banks 392f1903f7 Handling some of the NumberFormatExceptions seen via Tableau that are really user errors. 2012-04-18 12:57:37 -04:00
Ryan Poplin 8a84456626 Following Eric's awesome update to change the VQSR recal file into a VCF file, the ApplyRecalibration step is now scatter/gather-able and tree reducible. 2012-04-18 11:24:04 -04:00
Eric Banks 4448a3ea76 Final tweaks. Added an integration test to cover the case of SNPs and indels that start at the same position. 2012-04-17 23:54:10 -04:00
Eric Banks c1f52b773a Minor tweaks and updated integration tests MD5s 2012-04-17 23:17:28 -04:00
Eric Banks 6d03bce0d3 Important refactoring of the VQSR recal file format: we now use a VCF instead of a CSV file.
The most important reason for this change is that we no longer need to read the entire recal file into memory up front in ApplyRecalibration.  For 1000G calling this was prohibitive in terms of memory requirements.  Now we go through the rod system and pull in just the records we need at a given position.

As an added bonus, once BCF2 is live we can drastically cut down the sizes of these recal files (which can grow large for whole genome calling).
2012-04-17 22:38:18 -04:00
Mauricio Carneiro 46a212d8e9 Added "simplify reads" option to PrintReads. 2012-04-17 19:32:34 -04:00
Mauricio Carneiro f0c81b59b0 Implementation of the new BQSR plotting infrastructure
* removed low quality bases from the recalibration report.
   * refactored the Datum (Recal and Accuracy) class structure
   * created a new plotting csv table for optimized performance with the R script
   * added a datum object that carries the accuracy information (AccuracyDatum) for plotting
   * added mean reported quality score to all covariates
   * added QualityScore as a covariate for plotting purposes
   * added unit test to the key manager to operate with one required covariate and multiple optional covariates
   * integrated the plotting into BQSR (automatically generates the pdf with the recalibration tearsheet)
2012-04-17 19:23:55 -04:00
Ryan Poplin 952280bef1 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-17 17:00:14 -04:00
Ryan Poplin cf705f6c62 Adding read position rank sum test to the list of annotations that get produced with the HaplotypeCaller 2012-04-17 17:00:00 -04:00
Eric Banks 13c800417e Handle NPE in UG indel code: deletions immediately preceding insertions were not handled well in the code. 2012-04-17 15:51:23 -04:00
Khalid Shakir 91cb654791 AggregateMetrics:
- By porting from jython to java now accessible to Queue via automatic extension generation.
- Better handling for problematic sample names by using PicardAggregationUtils.
GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name.
CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering.
Added SelectHeaders walker for filtering headers for dbGAP submission.
Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter.
Latest WholeGenomePipeline.
Other minor cleanup to utility methods.
2012-04-17 11:45:32 -04:00
Ryan Poplin 1a2e92f8db Merged bug fix from Stable into Unstable 2012-04-17 10:23:05 -04:00
Ryan Poplin adad76b36f Fixing NPE in VQSR for the case of very small callsets. 2012-04-17 10:20:43 -04:00
Mark DePristo 23ccf772d4 IndelSummary now emits all of the underlying counts for ratios, percentages, etc it computes 2012-04-13 17:00:36 -04:00
Mark DePristo 84d1e8713a Infrastructure for combining VariantEvaluations
-- Not hooked up yet, so the output of VariantEval should be the same as before
-- Implemented a VariantEvalUnitTest that tests the low level strat / eval combinatorics and counting routines
-- Better docs throughout
2012-04-13 17:00:36 -04:00
Mark DePristo 38986e4240 Documentation for StratificationManager 2012-04-13 17:00:36 -04:00
Mark DePristo ab06d53867 Useful test constructor or Unit tests in RefMetaDataTracker 2012-04-13 17:00:36 -04:00
Mark DePristo 285e61a227 Bugfix for IndelSummary
-- multi allelic count should be % not ratio
2012-04-13 17:00:35 -04:00
Mark DePristo e6d5cb46d2 Improvements and bugfixes to IndelSummary
-- Now properly includes both bi and multi-allelic variants.  These are actually counted as well, and emitted as counts and % of sites with multiple alleles
-- Bug fix for gold standard rate
2012-04-13 17:00:35 -04:00
Mark DePristo bfa966a4e9 Bugfix for OneBPIndel
-- Previously was only including 1 bp insertions in stratification
2012-04-13 17:00:35 -04:00
Mark DePristo 2aa2d9aec0 Merged bug fix from Stable into Unstable 2012-04-13 09:25:43 -04:00
Mark DePristo 27e7e17dc7 New way to handle exceptions in multi-threaded GATK
-- HMS no longer tries to grab and throw all exceptions.  Exceptions are just thrown directly now.
-- Proper error handling is handled by functions in HMS, which are used by ShardTraverser and TreeReducer
-- Better printing of stack traces in WalkerTest
2012-04-13 09:23:33 -04:00
Eric Banks 818e8c2fb9 Resolving merge conflicts 2012-04-12 15:19:44 -04:00
Eric Banks 0dd571928d Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones). 2012-04-12 15:16:29 -04:00
Eric Banks f77a6d18b8 Bad conflict merge before 2012-04-12 09:56:49 -04:00
Eric Banks 33a8bdd75f Resolving merge conflicts 2012-04-12 09:51:55 -04:00
Eric Banks b659b16b31 Generate User Error for bad POS value 2012-04-12 09:49:35 -04:00
Eric Banks cc71baf691 Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing. 2012-04-12 09:18:44 -04:00
Eric Banks 5bf9dd2def A framework to get annotations working in the HaplotypeCaller (and ART walkers in general).
Adding support for active-region-based annotation for most standard annotations.  I need to discuss with Ryan what to do about tests that require offsets into the reads (since I don't have access to the offsets) like e.g. the ReadPosRankSumTest.

IMPORTANT NOTE: this is still very much a dev effort and can only be accessed through private walkers (i.e. the HaplotypeCaller).  The interface is in flux and so we are making no attempt at all to make it clean or to merge this with the Locus-Traversal-based annotation system.  When we are satisfied that it's working properly and have settled on the proper interface, we will clean it up then.
2012-04-11 16:22:12 -04:00
Eric Banks 7aa654d13f New interface for some dev work that Ryan and I are doing; only accessible from private walkers right now 2012-04-11 13:49:09 -04:00
Eric Banks dc90508104 Adding a new annotation to UG calls: NDA = number of discovered (but not necessarily genotyped) alleles for the site. This could help downstream analysis esp. of indels for wonky sites (since we only use the top 2-3 alleles). Not enabled by default but we can change that if this turns out to be useful. 2012-04-11 13:47:10 -04:00
Eric Banks f560611fe8 Merged bug fix from Stable into Unstable 2012-04-10 22:26:53 -04:00
Eric Banks f46f7d0590 Fix the stats coming out of FlagStat. I will add an integration test in unstable 2012-04-10 22:26:10 -04:00
Mauricio Carneiro cd842b650e Optimizing DiagnoseTargets
* Fixed output format to get a valid vcf
   * Optimzed the per sample pileup routine O(n^2) => O(n) pileup for samples
   * Added support to overlapping intervals
   * Removed expand target functionality (for now)
   * Removed total depth (pointless metric)
2012-04-10 17:43:59 -04:00
Ryan Poplin e3cc7cc59c Resolving merge conflict. 2012-04-10 14:50:27 -04:00
Ryan Poplin a4634624b7 There are now three triggering options in the HaplotypeCaller. The default (mismatches, insertions, deletions, high quality soft clips), an external alleles file (from the UG for example), or extended triggers which include low quality soft clips, bad mates and unmapped mates. Added better algorithm for band pass filtering an ActivityProfile and breaking them apart when they get too big. Greatly increased the specificity of the caller by battening down the hatches on things like base quality and mapping quality thresholds for both the assembler and the likelihood function. 2012-04-10 14:48:23 -04:00
Eric Banks 10e74a71eb We now allow arbitrary annotations other than dbSNP (e.g. HM3) to come out of the Unified Genotyper. This was already set up in the Variant Annotator Engine and was just a matter of hooking UG up to it. Added integration test to ensure correct behavior. 2012-04-10 12:30:35 -04:00
Mark DePristo b43d21056b Merged bug fix from Stable into Unstable 2012-04-10 09:42:09 -04:00
Mark DePristo 6885e2d065 UserException fixes for GATK_logs recent errors
-- SamFileReader.java:525
-- BlockCompressedInputStream:376

These were both instances were we weren't catching and rethrowing picard exceptions as UserExceptions.
2012-04-10 07:37:42 -04:00
Mark DePristo 8507cd7440 Throw UserException for bad dict / chain files 2012-04-10 07:22:43 -04:00
Ryan Poplin cd9bf1bfc3 Changing IndelSummary eval module so that PostCallingQC.scala can run with MIXED-record VCFs. 2012-04-10 00:22:40 -04:00
Roger Zurawicki 9ece93ae9c DiagnoseTargets now outputs a VCF file
- refactored the statistics classes
 - concurrent callable statuses by sample are now available.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-04-09 16:40:20 -04:00
Guillermo del Angel 719ec9144a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-09 14:53:19 -04:00
Guillermo del Angel 550179a1f7 Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors. 2012-04-09 14:53:05 -04:00
Eric Banks ea4300d583 Refactoring so that Unified Argument Collection doesn't use deprecated classes. 2012-04-09 13:45:17 -04:00
Eric Banks 6ddf2170b6 More efficient implementation of the sum of the allele frequency posteriors matrix using a pre-allocated cache as discussed in group meeting last week. Now, when the cache is filled, we safely collapse down to a single value in real space and put the un-re-centered log10 value back into the front of the cache. Thanks to all for the help and advice. 2012-04-09 11:46:16 -04:00