Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration. It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers. As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces. This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.
* Sites with more soft clipped bases than regular will force-trigger a variant region
* No more unclipping/reclipping, RR machinery now handles soft clips natively.
* implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
* GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.
note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
* Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it.
* Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute
* Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads
* Updated all integration tests
Infrastructure:
* Added static interface to all different clipping algorithms of low quality tail clipping
* Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
* Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
* EventType is now an independent enum with added capabilities. All functionality is now centralized.
BQSR and RecalibrateBases:
* On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
* Refactored the object creation to take advantage of the compact key structure
* Replaced nested hash maps with single hash maps indexed by bitsets
* Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
* Excluded contexts with N's from the output file.
* Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
* Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
* Added the covariate ID (for optional covariates) to the output for disambiguation purposes
* Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
* Reduced memory usage of the BQSR script to 4
Tests:
* Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
* Added tests for keys without optional covariates
* Added tests for on-the-fly recalibration (but more tests are necessary)
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
* Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
* Updated all ReadClipper unit tests
* ReduceReads does not hard clip leading insertions by default anymore
* SlidingWindow adjusts start location if read starts with insertion
* SlidingWindow creates an empty element with insertions to the right
* Fixed all potential divide by zero with totalCount() (from BaseCounts)
* Updated all Integration tests
* Added new integration test for multiple interval reducing
* Creates an empty GATKSAMRecord with empty (not null) Cigar, bases and quals. Allows empty reads to be probed without breaking.
* All ReadClipper utilities now emit empty reads for fully clipped reads
* Described the ReadClipper contract in the top of the class
* Added contracts where applicable
* Added descriptive information to all tools in the read clipper
* Organized public members and static methods together with the same javadoc
For simplified access to the hard clipping utilities. No need to create a ReadClipper object if you are not doing multiple complicated clipping operations, just use the static methods.
examples:
ReadClipper.hardClipLowQualEnds(2);
ReadClipper.hardClipAdaptorSequence();
* New ClippingOp REVERT_SOFTCLIPPED_BASES turns soft clipped bases into matches.
* Added functionality to clipping op to revert all soft clip bases in a read into matches
* Added revertSoftClipBases function to the ReadClipper for public use
* Wrote systematic unit tests