Commit Graph

1222 Commits (f2b7c6f0e19d84ed4b07c79de6327409a13634d5)

Author SHA1 Message Date
Mark DePristo 04cc75aaec Minor cleanup and expansion of the RecalDatum unit tests 2012-12-24 13:35:58 -05:00
Mark DePristo 7bf1f67273 BQSR optimization: read group x quality score calibration table is thread-local
-- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table.  Radically reduces the contention for RecalDatum in this very highly used table
-- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables.  Updated combine in RecalibrationReport to use this table combiner function
-- Made several core functions in RecalDatum into final methods for performance
-- Added RecalibrationTestUtils, a home for recalibration testing utilities
2012-12-24 13:35:58 -05:00
Mark DePristo 295455eee2 NanoScheduler optimizations and simplification
-- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc.  All hideously complex.
-- This new model is vastly simply.  The reducer basically knows nothing about the control mechanisms in the NanoScheduler.  It just supports multi-threaded reduce.  The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down.  The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1.  The master thread gets the final reduce result when its free by the latch reaching 0.  It's all super super simple.
-- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.
2012-12-24 13:35:57 -05:00
Mark DePristo aa3ee29929 Handle case where the ReadGroup is null in GATKSAMRecord 2012-12-24 13:35:57 -05:00
Mark DePristo bf81db40f7 NanoScheduler reducer optimizations
-- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access.  If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work.  They removes one major source of blocking contention in the NanoScheduler
2012-12-24 13:35:57 -05:00
Mark DePristo 940816f16a GATKSamRecord now checks that the read group is a GATKReadGroupRecord, and if not makes one 2012-12-24 13:35:57 -05:00
Mark DePristo 14944b5d73 Incorporating clover into build.xml
-- See http://gatkforums.broadinstitute.org/discussion/2002/clover-coverage-analysis-with-ant for use docs
-- Fix for artificial reads not having proper read groups, causing NPE in some tests
-- Added clover itself to private/resources
2012-12-24 13:35:57 -05:00
Mark DePristo 7796ba7601 Minor optimizations for NanoScheduler
-- Reducer.maybeReleaseLatch is no longer synchronized
-- NanoScheduler only prints progress every 100 or so map calls
2012-12-24 13:35:56 -05:00
Mark DePristo 0f04485c24 NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue
-- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue.
-- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N.  This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step
-- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code.
-- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc)
-- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs
2012-12-24 13:35:56 -05:00
Mark DePristo 0f0188ddb1 Optimization of BQSR
-- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine.  This object has a restrictive interface to just get information about specific qual and error values at offset and for event type.  This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads.  Shaves 5% of the runtime off the entire code.
-- Cleaned up code and added lots more docs
-- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR.  95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.
2012-12-24 13:35:09 -05:00
Mark DePristo f6d5499582 The GATK engine now ensures that incoming GATKSAMRecords have GATKSAMReadGroupRecord objects in their header
-- Update SAMDataSource so that the merged header contains GATKSAMReadGroupRecord
-- Now getting the NGSPlatform for a GATKSAMRecord is actually efficient, instead of computing the NGS platform over and over from the PL string
-- Updated a few places in the code where the input argument is actually a GATKSAMRecord, not a SAMRecord for type safety
2012-12-24 13:35:09 -05:00
Tad Jordan b491c177ff Added functionality of outputting sorted GATKReport Tables
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
2012-12-20 14:02:21 -05:00
Ryan Poplin 54e5c84018 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-19 11:31:40 -05:00
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Ryan Poplin cda0c48570 auto-merge 2012-12-19 10:12:49 -05:00
Mark DePristo 1ca13f9581 Fundamentally better model for the NanoScheduler
-- Now each map job reads a value, performs map, and does as much reducing as possible.  This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc.  All of this is accomplished using exactly NCT% of the CPU of the machine.
-- Has the additional value of actually simplifying the code
-- Resolves a long-standing annoyance with the nano scheduler.
2012-12-19 09:31:31 -05:00
Ryan Poplin 902ca7ea70 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-18 15:45:33 -05:00
Ryan Poplin 3950f7b3e3 Increasing the INFORMATIVE_LIKELIHOOD_THRESHOLD value to 0.2 2012-12-18 15:45:12 -05:00
Ryan Poplin b5d590ba92 Based on NA12878 knowledge base experiments updating HC to allow for a much smaller minimum kmer length in the assembly graph. 2012-12-18 15:43:56 -05:00
Mark DePristo b33f804cdc Inline increment function in RecalDatum to avoid minor duplication of work and multiple synchronized method calls 2012-12-17 16:47:27 -05:00
Mark DePristo 66d32f646b Minor cleanup of BAQ calculation (final variables, etc) 2012-12-17 16:47:27 -05:00
Mark DePristo 67fe81391c ProgressMeter optimization: don't do genome loc formatting, but instead create an object that only formats when printing is actually needed 2012-12-17 16:47:27 -05:00
Mark DePristo 1de2f527b9 Optimization of recalibrateRead
-- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application
-- Trivial comment on how we might use BAQ better in BaseRecalibrator
2012-12-17 16:47:27 -05:00
Mark DePristo a481d006f0 Optimizations for applying BQSR table with PrintReads
-- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops
-- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one
-- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable.  This stops us from recreating an expensive array over and over.  In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping.  Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.
2012-12-17 16:47:27 -05:00
Mark DePristo 5ec25797b3 Optimizations for BaseRecalibrator
-- No longer computes at each update the overall read group table.  Now computes this derived table only at the end of the computation, using the ByQual table as input.  Reduces BQSR runtime by 1/3 in my test
2012-12-17 16:47:27 -05:00
Eric Banks e6f468b647 Refactored the quasi-useful IndelType annotation into the more useful VariantType.
The indels are still annotated as before, but now all other variant types are annotated too.
I'm doing this because of requests on the forum but am not making it standard.  If we find it to be useful we can turn it on by default later.
2012-12-17 11:54:47 -05:00
Mauricio Carneiro 74344a3871 Bringing in the changes from the CMI repo 2012-12-13 21:59:37 -05:00
Mark DePristo aeab932c63 Actual working version of unflushing VCFWriter
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
2012-12-13 16:15:08 -05:00
Mauricio Carneiro a52e3c7e15 Revert "Bug fix for RR: don't let the softclip start position be less than 1"
this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT).
DEV-322 #resolve #time 4m

This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.
2012-12-12 13:09:39 -05:00
Mark DePristo 5632c13bf2 Resolves GSA-681 / Compressed VCF.gz output is too big because of unnecessary call to flush().
-- Now compressed output VCFs are properly blocked compressed (i.e., they are actually smaller than the uncompressed VCF)
2012-12-12 10:27:07 -05:00
Ami Levy-Moonshine 2e3284f306 Continue to fix the case where PRIORITIZE is used but no priority list is given. While fixing that case I also removed unnecessary sorting, when the prioeity list is not provied. When the priority list is not provided, it will continue to be null. Thus, the number of original Variant Contexts should be given as a new parameter to simpleMerge (since priority might be null). This new parameter is used for checking if there are filtered VC, when annotationOrigin is true. 2012-12-10 22:23:58 -05:00
Ami Levy-Moonshine 573ace4403 restore the right version of VariantContextUtils.java in my unstable dir 2012-12-10 10:28:56 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Ami Levy-Moonshine 3a420d163e (1) changes in catVariants (work still under development) (2) changes to CV to throw an error when GenotypeMergeType is PRIORITIZE but no priority (rod_priority_list) is not given. Reported by TechnicalVault on the forum on Nov 14 2012 2012-12-09 23:40:03 -05:00
Ami Levy-Moonshine 5d78a61f7a Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-05 15:07:12 -05:00
Mark DePristo 465694078e Major performance improvement to the GATK engine
-- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers.  Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests.
-- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x.  For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement).
-- Took this opportunity to improve the GATK ProgressMeter.  NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress.  This removes all inner loop calls to the GATK timers.
-- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402
2012-12-05 14:49:22 -05:00
Mark DePristo 2b601571e7 Better error handling in NanoScheduler
-- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown.  Errors, like out of memory, would cause the whole system to die.  This bugfix resolves that issue
2012-12-05 14:49:22 -05:00
Mauricio Carneiro 30f013aeb0 Added a copy() method for ReadBackedPileups
necessary to create new alignment contexts with hard-copies of the pileup.
2012-12-05 01:32:18 -05:00
Randal Moore 8d2d0253a2 introduce a level of indirection for the forum URLs - this new function will allow me a place to morph the URL into something that is supported by Confluence
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
2012-12-03 22:33:02 -05:00
Eric Banks 67932b357d Bug fix for RR: don't let the softclip start position be less than 1 2012-12-03 15:59:14 -05:00
Ryan Poplin a47da9bb2f Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 14:30:14 -05:00
Eric Banks 5fed9df295 Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh). 2012-12-03 12:18:20 -05:00
Eric Banks b6839b3049 Added checking in the GATK for mis-encoded quality scores.
The check is performed by a Read Transformer that samples (currently set to once
every 1000 reads so that we don't hurt overall GATK performance) from the input
reads and checks to make sure that none of the base quals is too high (> Q60). If
we encounter such a base then we fail with a User Error.

* Can be over-ridden with --allow_potentially_misencoded_quality_scores.
* Also, the user can choose to fix his quals on the fly (presumably using PrintReads
  to write out a fixed bam) with the --fix_misencoded_quality_scores argument.

Added unit tests.
2012-12-03 11:18:41 -05:00
Ryan Poplin 18b002c99c Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-03 10:08:56 -05:00
Ryan Poplin 1bdf17ef53 Reworking of how the likelihood calculation is organized in the HaplotypeCaller to facilitate the inclusion of per allele downsampling. We now use the downsampling for both the GL calculations and the annotation calculations. 2012-12-02 11:58:32 -05:00
Ami Levy-Moonshine d0b8cc7773 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2012-12-01 00:08:25 -05:00
Ami Levy-Moonshine 969c995298 work under development - catVariants. Changes to AssessRRQuals based on Eric todo comments. bug fix in CombineVariants 2012-12-01 00:08:19 -05:00
Mauricio Carneiro fc7fab5f3b Fixed ReadBackedPileup downsampling
Downsampling in the PerSampleReadBackedPileup was broken, it didn't downsample anything, always returning a copy the original pileup.
2012-11-30 00:42:05 -05:00
Joel Thibault 198923b597 Add ActiveRegionReadState handling 2012-11-28 13:59:57 -05:00
Ryan Poplin f0395b457a Adding the work-in-progress, experimental RepeatLengthCovariate to the BQSR so Chris can continue the development. 2012-11-28 13:56:32 -05:00