Merge all FilePointers for each contig into a single, merged, optimized FilePointer
representing all regions to visit in all BAM files for a given contig.
This helps us in several ways:
-It allows us to create a single, persistent set of iterators for each contig,
finally and definitively eliminating all Shard/FilePointer boundary issues for
the new experimental ReadWalker downsampling
-We no longer need to track low-level file positions in the sharding system (which
was no longer possible anyway given the new experimental downsampling system)
-We no longer revisit BAM file chunks that we've visited in the past -- all BAM
file access is purely sequential
-We no longer need to constantly recreate our full chain of read iterators
There are also potential dangers:
-We hold more BAM index data in memory at once. Given that we merge and optimize
the index data during the merge, and only hold one contig's worth of data at a
time, this does not appear to be a major issue. TODO: confirm this!
-With a huge number of samples and intervals, the FilePointer merge operation
might become expensive. With the latest implementation, this does not
appear to be an issue even with a huge number of intervals (for one sample, at least),
but if it turns out to be a problem for > 1 sample there are things we can do.
Still TODO: unit tests for the new FilePointer.union() method
-- Calls NA12878 with and without the expt. downsampler on chr1
-- Creates combined vcf, annotating sites as overlapping omni SNPs and Mills indels
-- Creates simple combined.table that has chr, pos, set, and type to easily ID missed good sites with the new downsampler
-- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker.
-- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map. Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead)
-- Removed unnecessary debug statement in GenomeLocParser
-- nt and nct officially work together now
TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.
This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.
Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
It's now possible to run with experimental downsampling enabled
using the --enable_experimental_downsampling engine argument.
This is scheduled to become the GATK-wide default next week after
diff engine output for failing tests has been examined.
Notify all downsamplers in our pool of the current global genomic position every
DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL position changes, not every single
positional change after that threshold is first reached.
-Only used when experimental downsampling is enabled
-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.
-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.
-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.
-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
-- See https://jira.broadinstitute.org/browse/GSA-573
-- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread.
-- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.