Commit Graph

9114 Commits (ce617b2dfc8a24a71c4afaea01b4d885bdbd62a7)

Author SHA1 Message Date
Guillermo del Angel ce617b2dfc Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code 2012-03-25 10:20:21 -04:00
Guillermo del Angel db54c2625f Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-25 09:53:35 -04:00
Eric Banks c4f0d7eb73 Added mode to call SNPs and indels separately. Added functional annotation step back in. 2012-03-25 09:06:28 -04:00
Guillermo del Angel deb4586559 Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1 2012-03-24 21:49:43 -04:00
Mark DePristo b063bcd38d Removing update0 support in VariantEval
-- Now the only use for update0, calculating the number of processed loci, is centrally tracked in the walker itself not the evaluations.
-- This allows us to avoid calling update0 are every genomic base in 100ks of evaluates when there are a lot of stratifications.
-- No need to modify the integration tests, this optimization doesn't change the result of the calculation
2012-03-23 21:02:21 -04:00
Mauricio Carneiro 0509d316d9 More information in the recalibration report
* added empirical quality counts to allow quantization during on-the-fly recalibration to any level
   * added number of observations and errors to all tables to enable plotting of all covariates
2012-03-23 16:15:19 -04:00
Mauricio Carneiro 9f74969e3a BQSR with GATKReport implementation
* restructured BQSR to report recalibrated tables.
   * implemented empirical quality calculation to the BQSR stage (instead of on-the-fly recalibration)
   * linked quality score quantization to the BQSR stage, outputting a quantization histogram
   * included the arguments used in BQSR to the GATK Report
   * included all three tables (RG, QUAL and COVARIATES) to the GATK Report with empirical qualities

On-the-fly recalibration with GATK Report

   * loads all tables from the GATKReport using existing infrastructure (with minor updates)
   * implemented initialiazation of the covariates using BQSR's argument list
   * reduced memory usage significantly by loading only the empirical quality and estimated quality reported for each bit set key
   * applied quality quantization to the base recalibration
   * excluded low quality bases from on-the-fly recalibration for mismatches, insertions or deletions
2012-03-23 15:42:32 -04:00
Mauricio Carneiro f421062b55 Updated read group covariate to use sample.lane instead of the id
Added Unit test.
2012-03-23 15:24:07 -04:00
Mauricio Carneiro 539da9e3e1 Fixing GATKReport exception handling when loading a report
* allowing tables with no description to go through
   * GATKReportTable should be more lenient with the format requirements (added to-dos for roger)
2012-03-23 15:23:13 -04:00
Mark DePristo 4a54a4b11c Bugfixes to variantCallQC.R; improved visual style
-- Still need to deal with the need to have a gold-standard track that isn't a stratification but is actually just provided as is for use to eval modules.  This is currently stopping us from computing the % gold standard calls discovered, and the % of gold standard calls in the call set.
2012-03-23 15:06:13 -04:00
Eric Banks 2511839068 Merged bug fix from Stable into Unstable 2012-03-23 13:51:33 -04:00
Eric Banks d3f2bc4361 Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles. 2012-03-23 13:51:00 -04:00
Mark DePristo e4ec90cfce Merged bug fix from Stable into Unstable 2012-03-23 11:27:34 -04:00
Mark DePristo ff26f2bf68 HierarchicalMicroScheduler no longer attempts to wrap exceptions
-- This behavior, which isn't obviously valuable at all, continued to grab and rethrow exceptions in the HMS that, if run without NT, would show up as more meaningful errors.  Now HMS simply checks whether the throwable it received on error was a RuntimeException.  If so, it is stored and rethrow without wrapping later.  If it isn't, only in this case is the exception wrapped in a ReviewedStingException.
-- Added a QC walker ErrorThrowingWalker that will throw a UserException, ReviewedStingException, and NullPointerException from map as specified on the command line
-- Added IT that ensures that all three types are thrown properly (i.e., you catch a NullPointerException when you ask for one to be thrown) with and without threading enabled.
-- I believe this will finally put to rest all of these annoying HMS captures.
2012-03-23 11:27:21 -04:00
Ryan Poplin 9d22471b79 Merged bug fix from Stable into Unstable 2012-03-23 10:48:34 -04:00
Ryan Poplin ab288354e9 Better error message for malformed input recal file. 2012-03-23 10:47:01 -04:00
Mark DePristo a42fd5e935 More informative stack trace formatting for analyzeRunReports.py 2012-03-23 09:03:49 -04:00
Mark DePristo fee8d86f63 VariantEval optimization
-- Use a LinkedHashMap not a TreeMap so iteration is faster.
-- Note that with a lot of stratifications the update0 is taking up a lot of time.  For example, with 822 samples and functional class and sample on there are 100K contexts and 30% of the runtime is just in the update0 call
2012-03-22 22:13:24 -04:00
Mark DePristo a0602e534c PostCallingQC and variantCallQC.R vastly improved
-- All new IndelSummary features used by default with variantCallQC.R script with PostCallingQC.scala
-- This includes many new per sample and per AC indel QC metrics.
-- Better plotting style throughout, with nice histograms, white backgrounds, etc
-- PostCallingQC is super easy to use now, just give it --evalvcfs and go.  Tell it to run with 8 threads to finish quick.  It's really easy and produces beautiful new figures (will replace with one attached)
2012-03-22 21:24:37 -04:00
Mark DePristo 6df96644d9 Unified, standard IndelSummary metrics for VariantEval
-- Now you always get SNP and indel metrics with VariantEval!
--   Includes Number of SNPs, Number of singleton SNPs, Number of Indels, Number of singleton Indels, Percent of indel sites that are multi-allelic, SNP to indel ratio, Singleton SNP to indel ratio, Indel novelty rate, 1 to 2 bp indel ratio, 1 to 3 bp indel ratio, 2 to 3 bp indel ratio, 1 and 2 to 3 bp indel ratio, Frameshift percent, Insertion to deletion ratio, Insertion to deletion ratio for 1 bp events, Number of indels in protein-coding regions labeled as frameshift, Number of indels in protein-coding regions not labeled as frameshift, Het to hom ratio for SNPs, Het to hom ratio for indels, a Histogram of indel lengths, Number of large (>10 bp) deletions, Number of large (>10 bp) insertions, Ratio of large (>10 bp) insertions to deletions
-- Updated VE integration tests as appropriate
2012-03-22 21:24:37 -04:00
Mark DePristo bcf80cc7b3 Cleanup in VariantEval. Example of molten VariantEval output
-- Moved a variety of useful formatting routines for ratios, percentages, etc, into VariantEvalator.java so everyone can share.  Code updated to use these routines where appropriate
-- Added variantWasSingleton() to VariantEvaluator, which can be used to determine if a site, even after subsetting to specific samples, was a singleton in the original full VCF
-- TableType, which used to be an interface, is now an abstract class, allowing us to implement some generally functionality and avoid duplication.
-- This included creating a getRowName() function that used to be hardcoded as "row" but how can be overridden.
-- #### This allows us implement molten tables, which are vastly easier to use than multi-row data sets.  See IndelHistogram class (in later commit) for example of molten VE output
2012-03-22 21:24:37 -04:00
Mark DePristo e4d49357ce Further cleanup of R 2012-03-22 21:24:37 -04:00
Mark DePristo 503e2ea29e Cleanup R directory 2012-03-22 21:24:37 -04:00
Mark DePristo 5725f72904 Cleanup unused python programs
-- If you happen to use one of these files you can always revert it.
2012-03-22 21:24:36 -04:00
Mark DePristo 9ddd5aec93 More eval modules being removed from VariantEval
-- IndelStatistics is superceded by IndelStatistics
2012-03-22 21:24:36 -04:00
Mark DePristo bd5b6d1aba Remove no longer in use Eval modules from VariantEval
-- No more IndelLengthHistogram (superceded by IndelSummary in subsequent commit)
-- No more SamplePreviousGenotypes or PhaseStats
-- No more MultiallelicAFs
2012-03-22 21:24:36 -04:00
Mark DePristo 6c2290fb6e Performance optimization for gsa.read.gatkreport.R
-- instead of using y = rbind(x, y), which is O(n^2) in a loop when processing lines into a data structure in R, preallocate a matrix and explicitly assign each row to x.  This results in a radical performance improvement when reading large tables into R.  It's possible with this optimization to read in a 70MB table for variantQCReport.R with 200K lines for 800 samples.
2012-03-22 21:24:36 -04:00
Menachem Fromer 7faa9938b1 Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-22 17:43:44 -04:00
Menachem Fromer b9b9219ac7 Added respectPhaseInInput flag to RBP and integration tests 2012-03-22 17:40:21 -04:00
Guillermo del Angel 0a56a14d09 Build fixes to merge pool calculation models with latest interface changes. Reverted build.xml's private debug changes 2012-03-22 16:07:07 -04:00
Guillermo del Angel b92fee711b Added missing new files from previous commit 2012-03-22 15:47:23 -04:00
Guillermo del Angel f198cec5e2 Temp commit: new structure for pool caller, now all work is in the same framework as in UG. There's a new genotype calculation model, PoolGenotypeCalculationModel, that does all the work and plugs into UnifiedGenotyperEngine. A new AF module for pools is upcoming. Old pool caller will be removed once all work is migrated 2012-03-22 15:46:39 -04:00
Menachem Fromer 1dfaacfeb5 Check for consistency of the BAM and VCF sample names, with a command line disable to throw if you know what you are doing 2012-03-22 12:40:15 -04:00
Guillermo del Angel b02ef95bcf Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-22 12:14:12 -04:00
Guillermo del Angel 92676c63ca Make constructor of IndelGenotypeLikelihoodsCalculationModel public so it can be used in unit tests 2012-03-22 12:13:59 -04:00
Guillermo del Angel 58965d6a6e Merged bug fix from Stable into Unstable 2012-03-22 11:04:11 -04:00
Mark DePristo 256e9f001e analyzeRunReports now includes full stack traces with causes 2012-03-22 10:15:44 -04:00
Guillermo del Angel b8cd959461 Potential corner condition bug fix: protect against null pointer exceptions when computing consensus indel bases when UG is discovering alt alleles. If an alt allele has non-standard bases, skip allele gracefully instead of adding null object into list 2012-03-22 10:06:22 -04:00
Eric Banks 8c09ff9459 Merged bug fix from Stable into Unstable 2012-03-21 12:44:43 -04:00
Eric Banks 58245bfa2f Bug fix: check to see whether there's a BasePileup before asking for one. 2012-03-21 12:44:09 -04:00
Eric Banks 07c3bd32b3 Bug fix: merge NO_VARIATION records with those of another type. The sad part is that this WAS covered by integration tests but someone updated the MD5s without actually paying attention... 2012-03-21 12:42:13 -04:00
Eric Banks dcf2fa361d Minor cleanup 2012-03-21 12:14:31 -04:00
Eric Banks ab1c48745b Need to catch RuntimeExceptions coming out of Picard too so that they show up as UserErrors (some BAM errors are thrown as REs). 2012-03-21 12:13:52 -04:00
Ryan Poplin 9e10779fa7 Caching log calculations cut the non-Map runtime of HaplotypeCaller in half. Moved the qual log cache used in HC and PairHMM into a common place and added unit tests. 2012-03-21 08:45:42 -04:00
Mauricio Carneiro 0e93cf5297 Taking care of bad cigars in the GATK
* fixed BadCigarFilter to filter out reads starting/ending in deletion and that have adjacent I/D events.
   * added Unit tests for BadCigarFilter
   * updated all exceptions in LocusIteratorByState to tell the user that he can instead run with -rf BadCigar
   * added the BadCigar filter to ReduceReads and RealignTargetCreator (if your walker blows up with these malformed reads, you may want to add it too)
2012-03-20 14:32:57 -04:00
Eric Banks b290152542 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-20 08:56:04 -04:00
Eric Banks 5e79046c98 Minor change but I realized from Mark's commit that the code I stole it from was flawed 2012-03-20 08:55:56 -04:00
Mark DePristo 5ecfc49f74 Minor cleanup of MergeIntervalLists (example, please look)
-- Note that isDone() is override to return true.  This causes the GATK to cleanly stop processing early.
2012-03-20 07:49:27 -04:00
Mark DePristo 36636eb323 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-20 07:47:24 -04:00
Eric Banks ade1971581 Since we allow any generic header types, there's no longer any reason to check for supported types 2012-03-20 00:12:17 -04:00