With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions. This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined. This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
-- Previously we allowed band pass filter size to be specified along with the sigma. But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel. This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
-- Based on the new incremental activity profile
-- Unit Tested! Fixed a few bugs with the old band pass filter
-- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties
-- Add new UnitTest for BandPassIncrementalActivityProfile
-- Added normalizeFromRealSpace to MathUtils
-- Cleanup unused code in new activity profiles
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version. This means that ART walkers will now take much less memory than previously. On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory. Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries. The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries. This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART. The new contract ensures that you get exactly the regions you requested. the isActive function is still called, but its result has no impact on the regions. With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality. Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
-- Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start)
-- This stream can be handed back to the caller immediately, or written to an indexed BAM file
-- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).
As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...
-- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position. Will crash with an out-of-memory error if we're holding reads anywhere in the system.
-- Is there a better way to test this behavior?
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
-- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM
-- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls
-- Moved updateReadStates to ReadStateManager
-- Removed the unnecessary wrapping of an iterator in ReadStateManager
-- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples. Now those are just map native iterations
-- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks. The new piece is code is much clearer and avoids several unnecessary function calls
-- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl
-- ReadBackedPileupImpl was literally a shell class after we removed extended events. A few bits of code cleanup and we reduced a bunch of class complexity in the gatk
-- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed
-- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well.
-- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups.
-- Added commons-collections for IteratorChain
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
-- Optimizations to AlignmentStateMachine
-- Properly count deletions. Added unit test for counting routines
-- AlignmentStateMachine.java is no longer recursive
-- Traversals now use new LIBS, not the old one
-- AlignmentStateMachine does what SAMRecordAlignmentState should really do. It's correct in that it's more accurate than the LIB_position tests themselves. This is a non-broken, correct implementation. Needs cleanup, contracts, etc.
-- This version is like 6x slower than the original implementation (according to the google caliper benchmark here). Obvious optimizations for future commit