-- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue.
-- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N. This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step
-- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code.
-- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc)
-- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs
-- Pre-read MapData into a list, which is actually faster than dealing with future lock contention issues with lots of map threads
-- Increase the ReadShard default size to 100K reads by default
-- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine. This object has a restrictive interface to just get information about specific qual and error values at offset and for event type. This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads. Shaves 5% of the runtime off the entire code.
-- Cleaned up code and added lots more docs
-- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR. 95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.
-- Update SAMDataSource so that the merged header contains GATKSAMReadGroupRecord
-- Now getting the NGSPlatform for a GATKSAMRecord is actually efficient, instead of computing the NGS platform over and over from the PL string
-- Updated a few places in the code where the input argument is actually a GATKSAMRecord, not a SAMRecord for type safety
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
the function ls_getLicenseUsage() is not supported by LSF v8.x, comment the line:
public static native lsfLicUsage.ByReference ls_getLicenseUsage()
Signed-off-by: Eric Banks <ebanks@broadinstitute.org>
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.
-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)
-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)
-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
-- Now each map job reads a value, performs map, and does as much reducing as possible. This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc. All of this is accomplished using exactly NCT% of the CPU of the machine.
-- Has the additional value of actually simplifying the code
-- Resolves a long-standing annoyance with the nano scheduler.
-- Don't just read all inputs into a list, and then provide an iterator to that list, actually make a real iterator so NanoScheduler input thread can contribute meaningfully to the work load
-- Use NanoScheduler progress function, instead of home-grown updater
-- Refactor calculation so that upfront constant values are pre-computed, and cached, and their values just looked up during application
-- Trivial comment on how we might use BAQ better in BaseRecalibrator
-- Cleaned up code in updateDataForRead so that constant values where not computed in inner loops
-- BaseRecalibrator doesn't create it's own fasta index reader, it just piggy backs on the GATK one
-- ReadCovariates <init> now uses a thread local cache for it's int[][][] keys member variable. This stops us from recreating an expensive array over and over. In order to make this really work had to update recordValues in ContextCovariate so it writes 0s over base values its skipping because of low quality base clipping. Previously the values in the ReadCovariates keys were 0 because they were never modified by ContextCovariates. Now these values are actually zero'd out explicitly by the covariates.
-- No longer computes at each update the overall read group table. Now computes this derived table only at the end of the computation, using the ByQual table as input. Reduces BQSR runtime by 1/3 in my test
The indels are still annotated as before, but now all other variant types are annotated too.
I'm doing this because of requests on the forum but am not making it standard. If we find it to be useful we can turn it on by default later.
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT).
DEV-322 #resolve #time 4m
This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.