Commit Graph

97 Commits (04e3978b04f6ca0d63ee5c16664b4d604c763a52)

Author SHA1 Message Date
Eric Banks ea21dc9cfb I just committed this - why didn't it work before? Trying again... 2013-01-06 12:44:13 -05:00
Eric Banks 52067f0549 Handle merge conflicts 2013-01-06 12:29:12 -05:00
Eric Banks bf25e151ff Handle long->int precision in Bayesian estimate 2013-01-06 12:26:32 -05:00
Mark DePristo 69bf70c42e Cleanup and more unit tests for RecalibrationTables in BQSR
-- Added unit tests for combining RecalibrationTables.  As a side effect now has serious tests for incrementDatumOrPutIfNecessary
-- Removed unnecessary enum.index system from RecalibrationTables.
-- Moved what were really static utility methods out of RecalibrationEngine and into RecalUtils.
2013-01-05 12:50:27 -05:00
Eric Banks bce6fce58d Resolving merge conflicts after Mark's latest push 2013-01-04 14:46:39 -05:00
Eric Banks dd7f5e2be7 Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests. 2013-01-04 14:43:11 -05:00
Tad Jordan fe06912a87 Removed sorting by row from walkers 2013-01-04 11:52:33 -05:00
Mark DePristo 810e2da1d4 Cleanup and unit tests for EventType and ReadRecalibrationInfo in BQSR
-- Added unit tests for EventType and ReadRecalibrationInfo
-- Simplified interface of EventType.  Previously this enum carried an index with it, but this is redundant with the enum.ordinal function.  Now just using that function instead.
2013-01-04 11:39:25 -05:00
Mark DePristo 7df47418d8 BQSR optimization: make RecalibrationTables thread-local, and merge results in onTraversalDone
-- With the newer, faster BQSR, scaling was limited by the NestedIntegerArray.  The solution to this is to make the entire table thread-local, so that each nct thread has its own data and doesn't have any collisions.
-- Removed the previous partial solution of having a thread-local quality score table
-- Added a new argument -lowMemory
2013-01-04 11:39:24 -05:00
Tad Jordan c1ba12d71a Added unit test for outputting sorted GATKReport Tables
- Made few small modifications to code
- Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table
2013-01-03 16:53:59 -05:00
Eric Banks c7039a9b71 Pushing in implementation of the Bayesian estimate of Qemp for the BQSR.
This isn't hooked up yet with BQSR; it's just a static method used in my testing walker.  I'll hook this into BQSR after more testing and the addition of unit tests.
Most of the changes in this commit are actually documentation-related.
2013-01-02 15:21:44 -05:00
Eric Banks 75d5b88a3d Enabling the Recal Report unit test (which looks like it was never ever enabled) 2012-12-26 15:35:50 -05:00
Mark DePristo 04cc75aaec Minor cleanup and expansion of the RecalDatum unit tests 2012-12-24 13:35:58 -05:00
Mark DePristo 7bf1f67273 BQSR optimization: read group x quality score calibration table is thread-local
-- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table.  Radically reduces the contention for RecalDatum in this very highly used table
-- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables.  Updated combine in RecalibrationReport to use this table combiner function
-- Made several core functions in RecalDatum into final methods for performance
-- Added RecalibrationTestUtils, a home for recalibration testing utilities
2012-12-24 13:35:58 -05:00
Tad Jordan b491c177ff Added functionality of outputting sorted GATKReport Tables
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
2012-12-20 14:02:21 -05:00
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Mark DePristo b33f804cdc Inline increment function in RecalDatum to avoid minor duplication of work and multiple synchronized method calls 2012-12-17 16:47:27 -05:00
Mark DePristo 1de2f527b9 Optimization of recalibrateRead
-- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application
-- Trivial comment on how we might use BAQ better in BaseRecalibrator
2012-12-17 16:47:27 -05:00
Mark DePristo a481d006f0 Optimizations for applying BQSR table with PrintReads
-- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops
-- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one
-- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable.  This stops us from recreating an expensive array over and over.  In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping.  Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.
2012-12-17 16:47:27 -05:00
Mark DePristo 5ec25797b3 Optimizations for BaseRecalibrator
-- No longer computes at each update the overall read group table.  Now computes this derived table only at the end of the computation, using the ByQual table as input.  Reduces BQSR runtime by 1/3 in my test
2012-12-17 16:47:27 -05:00
Ryan Poplin f0395b457a Adding the work-in-progress, experimental RepeatLengthCovariate to the BQSR so Chris can continue the development. 2012-11-28 13:56:32 -05:00
Eric Banks 72e2d569c5 The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality. 2012-11-20 22:41:57 -05:00
Eric Banks 66cbaaee31 Fixed nasty bug in BQSR csv file creation:
numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column).

This wasn't caught earlier because there are no integration tests covering the csv.  I'll add one into unstable in a sec.
2012-11-09 08:33:55 -05:00
David Roazen 422e16c62e BaseRecalibration: don't cache instances of ReadCovariates across reads
Caching and reusing ReadCovariates instances across reads sounds good in theory, but:

-it doesn't work unless you zero out the internal arrays before each read
-the internal arrays must be sized proportionally to the maximum POSSIBLE
recalibrated read length (5000!!!), instead of the ACTUAL read lengths

By contrast, creating a new instance per read is basically equivalent to doing an
efficient low-level memset-style clear on a much smaller array (since we use the actual
rather than the maximum read length to create it). So this should be faster than caching
instances and calling clear() but slower than caching instances and not calling clear().

Credit to Ryan to proposing this approach.
2012-10-25 17:02:55 -04:00
Mark DePristo cc8c12b954 Committing a broken version of BaseRecalibration
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails.  But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
2012-10-25 14:46:35 -04:00
David Roazen 32a6d7000a Thread-safe ReadGroupCovariate
The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions
in multithreaded runs of the BQSR where (for example) the same read group could get
inserted into the reverse lookup table twice with different IDs.

Should fix the intermittent crash reported in GSA-492.
2012-10-24 15:22:50 -04:00
David Roazen ac87ed47bb BQSR: allow logging recal table updates to a file
For testing/debugging purposes only
2012-10-01 14:18:34 -04:00
Eric Banks 1316b579f0 Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table. 2012-09-20 14:14:34 -04:00
Mark DePristo 087247f1f0 Allow longs and doubles in recalibration report to allow some backward compatibility 2012-09-19 19:23:44 -04:00
Ryan Poplin 7a7103a757 Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-09-19 10:39:18 -04:00
Ryan Poplin 0ea543e1fd Removing testing scaffolding from delocalized BQSR. The output recal table reports the data as doubles instead of integers. This changes the mapping-based BQSR integration tests. Final intermediate push before delocalized BQSR replaces previous BQSR. 2012-09-19 10:39:06 -04:00
Eric Banks d94d0d15c2 Complete overhaul of previous commits to make it all work with scatter-gather. Now tracks output files correctly and can print to stdout. 2012-09-12 15:15:40 -04:00
Eric Banks 994a4ff387 Track all outputs from BQSR (.table, .csv., and .pdf) as @Output arguments. Updated integration tests because we no longer have command-line options not to generate plots (now just don't provide a pdf) or to keep the intermediate csv (now, just provide a filename on the command-line). This is currently busted because we can't access the original filenames from the Engine's storage/stub system and therefore cannot call out to the Rscript with the executor (which requires filename strings). 2012-09-12 11:24:53 -04:00
David Roazen d2f3d6d22f Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)"
This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.
2012-09-10 15:52:39 -04:00
Menachem Fromer 0b717e2e2e Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples) 2012-09-10 15:32:41 -04:00
Mark DePristo c9ea213c9b Make BaseRecalibration thread-safe
-- In the process uncovered two strange things
    1 -- qualityScoreByFullCovariateKey was created but never used.  Seems like a cache?
    2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534
2012-08-31 13:42:42 -04:00
Mark DePristo 817ece37a2 General infrastructure for ReadTransformers
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
    -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties!  Yeah!
-- BQSR now uses this framework.  We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
2012-08-31 13:42:41 -04:00
Ryan Poplin e12ae65d33 Changing the commenting style in the BQSR 2012-08-29 11:27:45 -04:00
Ryan Poplin 18eca3544e Initial commit of the delocalized BQSR written as a read walker. 2012-08-28 15:24:20 -04:00
Mark DePristo dcc972a557 Usability cleanup for BQSR
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community.  They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
2012-08-25 14:53:00 -04:00
Eric Banks ded0e11b45 Killing off some FindBugs 'Realiability' issues 2012-08-16 14:00:48 -04:00
Eric Banks dac3958461 Killing off some FindBugs 'Usability' issues 2012-08-16 13:32:44 -04:00
Eric Banks f368e568db Implementing support in BaseRecalibrator for SOLiD no call strategies other than throwing an exception. For some reason we never transfered these capabilities into BQSRv2 earlier. 2012-08-15 22:52:56 -04:00
Mark DePristo f032e0aba4 A bit better output for ContextCovariate context size logging 2012-08-12 13:45:52 -04:00
Mark DePristo 243af0adb1 Expanded the BQSR reporting script
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
2012-08-12 13:45:14 -04:00
Mark DePristo 458bbdee8f Add useful logger.info telling us the mismatch and indel context sizes 2012-08-12 10:27:05 -04:00
Eric Banks 0a2a646a52 Other random FindBugs fixes 2012-08-08 14:56:27 -04:00
Eric Banks a0196c9f5b Quick pass of FindBugs 'method invokes inefficient Number constructor' fixes. 2012-08-08 14:34:16 -04:00
Mark DePristo 80b94a4f9a AdaptiveContexts implement pruning to a given chi2 p value
-- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold
-- Penalty is now a phred-scaled p-value not the raw chi2 value
-- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning
2012-08-07 17:22:39 -04:00
Mark DePristo 982c735c76 VisualizeAdaptiveTree now considers only leaf nodes when computing max/min penalty 2012-08-07 17:22:39 -04:00