With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
-- New algorithm will only try to create an active region if there's at least maxREgionSize + propagation distance states in the list. When that's true, we are guaranteed to actually find a region. So this algorithm is not only truly correct but as super fast, as we only ever do the search for the end of the region when we will certainly find one, and actually generate a region.
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions. This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined. This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
-- Now records the position of the current locus, as well as that of the last read. Necessary when passing through regions with no reads. The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found
-- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information
-- Allows us to avoid doing a lot of misc. work to set up the genotype when we don't have any data to genotype. Valuable in the case where we are passing through large regions without any data
- directionality only influences 'chop' operation (since split will maintain all bases of the original read)
- added directional unit test
GSATDG-25 #resolved
- Fixed shell glob limitation that was failing license updates on big commits
- Hook will now force user to re-commit after updating the licenses (pre-commit hook can't update and commit in the same process)
- Moved all scripts to bash/zsh
- Separated the license utilities in a separate python module to avoid copying code
GSATDG-28 #resolve
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
-- Previously we allowed band pass filter size to be specified along with the sigma. But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel. This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
- walker simulates sequencing with different lengths to evaluate mapping/alignment biases relative to read length
- split : splits reads n-ways generating 2^n reads for each read of the same length.
- chop : chops the right end tail of the read creating 1 smaller read as if the sequencer stopped short.
- mate information is preserved for chopped reads, and re-indexed for split reads so that each split still points at the corresponding split on the mate.
- added systematic unit tests
GSATDG-23