Commit Graph

75 Commits (c0261f75ce67b35dfb6c6308785633bf95a7be24)

Author SHA1 Message Date
Eric Banks 66cbaaee31 Fixed nasty bug in BQSR csv file creation:
numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column).

This wasn't caught earlier because there are no integration tests covering the csv.  I'll add one into unstable in a sec.
2012-11-09 08:33:55 -05:00
David Roazen 422e16c62e BaseRecalibration: don't cache instances of ReadCovariates across reads
Caching and reusing ReadCovariates instances across reads sounds good in theory, but:

-it doesn't work unless you zero out the internal arrays before each read
-the internal arrays must be sized proportionally to the maximum POSSIBLE
recalibrated read length (5000!!!), instead of the ACTUAL read lengths

By contrast, creating a new instance per read is basically equivalent to doing an
efficient low-level memset-style clear on a much smaller array (since we use the actual
rather than the maximum read length to create it). So this should be faster than caching
instances and calling clear() but slower than caching instances and not calling clear().

Credit to Ryan to proposing this approach.
2012-10-25 17:02:55 -04:00
Mark DePristo cc8c12b954 Committing a broken version of BaseRecalibration
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails.  But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
2012-10-25 14:46:35 -04:00
David Roazen 32a6d7000a Thread-safe ReadGroupCovariate
The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions
in multithreaded runs of the BQSR where (for example) the same read group could get
inserted into the reverse lookup table twice with different IDs.

Should fix the intermittent crash reported in GSA-492.
2012-10-24 15:22:50 -04:00
David Roazen ac87ed47bb BQSR: allow logging recal table updates to a file
For testing/debugging purposes only
2012-10-01 14:18:34 -04:00
Eric Banks 1316b579f0 Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table. 2012-09-20 14:14:34 -04:00
Mark DePristo 087247f1f0 Allow longs and doubles in recalibration report to allow some backward compatibility 2012-09-19 19:23:44 -04:00
Ryan Poplin 7a7103a757 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-19 10:39:18 -04:00
Ryan Poplin 0ea543e1fd Removing testing scaffolding from delocalized BQSR. The output recal table reports the data as doubles instead of integers. This changes the mapping-based BQSR integration tests. Final intermediate push before delocalized BQSR replaces previous BQSR. 2012-09-19 10:39:06 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
Mark DePristo c9ea213c9b Make BaseRecalibration thread-safe
-- In the process uncovered two strange things
    1 -- qualityScoreByFullCovariateKey was created but never used.  Seems like a cache?
    2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534
2012-08-31 13:42:42 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Mark DePristo dcc972a557 Usability cleanup for BQSR
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community.  They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
2012-08-25 14:53:00 -04:00
Eric Banks ded0e11b45 Killing off some FindBugs 'Realiability' issues 2012-08-16 14:00:48 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks f368e568db Implementing support in BaseRecalibrator for SOLiD no call strategies other than throwing an exception. For some reason we never transfered these capabilities into BQSRv2 earlier. 2012-08-15 22:52:56 -04:00
Mark DePristo f032e0aba4 A bit better output for ContextCovariate context size logging 2012-08-12 13:45:52 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Mark DePristo 458bbdee8f Add useful logger.info telling us the mismatch and indel context sizes 2012-08-12 10:27:05 -04:00
Eric Banks 0a2a646a52 Other random FindBugs fixes 2012-08-08 14:56:27 -04:00
Eric Banks a0196c9f5b Quick pass of FindBugs 'method invokes inefficient Number constructor' fixes. 2012-08-08 14:34:16 -04:00
Mark DePristo 80b94a4f9a AdaptiveContexts implement pruning to a given chi2 p value
-- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold
-- Penalty is now a phred-scaled p-value not the raw chi2 value
-- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning
2012-08-07 17:22:39 -04:00
Mark DePristo 982c735c76 VisualizeAdaptiveTree now considers only leaf nodes when computing max/min penalty 2012-08-07 17:22:39 -04:00
Mark DePristo 2f004665fb Fixing public -> private dep 2012-08-06 11:42:55 -04:00
Mark DePristo 7bf5ca51ee Major bugfix for adaptive contexts
-- Basically I was treating the context history in the wrong direction, effectively predicting the further bases in the context based on the closer one.  Totally backward.  Updated the code to build the tree in the right direction.
-- Added a few more useful outputs for analysis (minPenalty and maxPenalty)
-- Misc. cleanup of the code
-- Overall I'm not 100% certain this is even the right way to think about the problem.  Clearly this is producing a reasonable output but the sum of chi2 values over the entire tree is just enormous.  Perhaps a MCMC convergence / sampling criterion would be a better way to think about this problem?
2012-08-06 11:42:55 -04:00
Mark DePristo b4841548f1 Bug fixes and misc. improvements to running the adaptive context tools
-- Better output file name defaults
-- Fixed nasty bug where I included non-existant quals in the contexts to process because they showed up in the Cycle covariate
-- Data is processed in qual order now, so it's easier to see progress
-- Logger messages explaining where we are in the process
-- When in UPDATE mode we still write out the information for an equivalent prune by depth for post analysis
2012-08-06 11:42:55 -04:00
Mark DePristo e1bba91836 Ready for full-scale evaluation adaptive BQSR contexts
-- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees
-- Docs, algorithm descriptions, etc so that it makes sense what's going on
-- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR
 -- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)
2012-08-03 16:02:53 -04:00
Mark DePristo 0c4e729e13 Working version of adaptive context calculations
-- Uses chi2 test for independences to determine if subcontext is worth representing.   Give excellent visual results
-- Writes out analysis output file producing excellent results in R
-- Trivial reformatting of MathUtils
2012-07-31 08:11:04 -04:00
Mark DePristo 93640b382e Preliminary version of adaptive context covariate algorithm
-- Works according to visual inspection of output tree
2012-07-31 08:11:04 -04:00
Mark DePristo 315d25409f Improvement to RecalDatum and VisualizeContextTree
-- Reorganize functions in RecalDatum so that error rate can be computed indepentently.  Added unit tests.  Removed equals() method, which is a buggy without it's associated implementation for hashcode
-- New class RecalDatumTree based on QualIntervals that inherits from RecalDatum but includes the concept of sub data
-- VisualizeContextTree now uses RecalDatumTree and can trivially compute the penalty function for merging nodes, which it displays in the graph
2012-07-31 08:11:04 -04:00
Mark DePristo 57b45bfb1e Extensive unit tests, contacts, and documentation for RecalDatum 2012-07-31 08:11:03 -04:00
Mark DePristo e00ed8bc5e Cleanup BQSR classes
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration.  It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers.  As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces.  This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Eric Banks 8dbc9cb29c Add the ability to emit the original quals in the OQ tag 2012-07-17 15:52:56 -04:00
Eric Banks 305db8c0d1 Total rewrite of the isGATKLite() functionality with help of Khalid/David. PluginManager was not working for us. 2012-07-17 15:11:03 -04:00
Eric Banks 62c5228048 1) Revert previous change - indel recalibration is turned on by default and users of the Lite version will need to turn it off to avoid a User Error. 2) Implemented the engine.isGATKLite() method. 2012-07-17 12:23:40 -04:00
Eric Banks 40618ac471 A bunch of BQSR changes: 1) by default we do not emit indel quals, but they can be turned on with --enable_indel_quals. 2) We check whether or not we are running in Lite mode (not done yet) and if so and the user is trying to recalibrate indels, we throw a User Error (not supported). 3) Like v1 we now allow the user to set the qual value below which we don't recalibrate (this was the remaining source of differences in the v1 vs. v2 plots). 2012-07-17 10:52:43 -04:00
Eric Banks dd571d9aa0 Added a --no_indel_quals argument that when used with -BQSR inhibits the writing of base insertion and base deletion quality tags. 2012-07-04 01:22:20 -04:00
Eric Banks a4670113bd Refactored/renamed the nested integer array; cleaned up code a bit. 2012-07-03 00:12:33 -04:00
Eric Banks cac72bce91 Initial version of int indexed mapping for BQSR. Will be cleaned up in a bit. 2012-07-02 14:33:33 -04:00
Eric Banks 96ea334bf2 Disable caching in BQSR for now since it significantly slows down computation; will look into this in a bit. 2012-06-28 15:27:44 -04:00
Eric Banks 1fafd9f6c8 NestedHashMap-based implementation of BQSRv2 along with a few minor optimizations. Not a huge runtime upgrade over the long bitset approach, but it allows us to implement further optimizations going forward. Integration test change because the original version had a bug in the quantized qual table creation. 2012-06-27 16:55:49 -04:00
Eric Banks 783b7f6899 Misc cleanup 2012-06-15 10:39:19 -04:00
Eric Banks 0c218e4822 Refactoring mostly for readability (and small performance improvement) 2012-06-15 10:36:41 -04:00
Eric Banks 4895fe2289 No more extraneous array creation in BQSR covariate classes; now covariates push their data directly to the ReadCovariates class as it's calculated (no more going through CovariateValues.java) 2012-06-15 02:32:00 -04:00
Eric Banks 5c3c6cbc40 Long -> long conversions in BQSR 2012-06-14 09:07:02 -04:00