This class, being unused, was no longer getting packaged into the
GATK release jar by bcel, and so attempting to run its unit test
on the release jar was producing an error.
-Only used when experimental downsampling is enabled
-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.
-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.
-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.
-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
When aggregating raw BAM file spans into shards, the IntervalSharder tries to combine
file spans when it can. Unfortunately, the method that combines two BAM file
spans was seriously flawed, and would produce a truncated union if the file spans
overlapped in certain ways. This could cause entire regions of the BAM file containing
reads within the requested intervals to be dropped.
Modified GATKBAMFileSpan.union() to correct this problem, and added unit tests
to verify that the correct union is produced regardless of how the file spans
happen to overlap.
Thanks to Khalid, who did at least as much work on this bug as I did.
This change and my previous change have dropped runtime when dynamically merging 2k BAM files from 72.6min/1M reads to 46.8sec/1M reads.
Note that many of these changes are stopgaps -- the real problem is the way ReadWalkers interface with Picard, and I'll have to work with
Tim&Co to produce a more maintainable patch.