Problem:
-Downsamplers were treating reduced reads the same as normal reads,
with occasionally catastrophic results on variant calling when an
entire reduced read happened to get eliminated.
Solution:
-Since reduced reads lack the information we need to do position-based
downsampling on them, best available option for now is to simply
exempt all reduced reads from elimination during downsampling.
Details:
-Add generic capability of exempting items from elimination to
the Downsampler interface via new doNotDiscardItem() method.
Default inherited version of this method exempts all reduced reads
(or objects encapsulating reduced reads) from elimination.
-Switch from interfaces to abstract classes to facilitate this change,
and do some minor refactoring of the Downsampler interface (push
implementation of some methods into the abstract classes, improve
names of the confusing clear() and reset() methods).
-Rewrite TAROrderedReadCache. This class was incorrectly relying
on the ReservoirDownsampler to preserve the relative ordering of
items in some circumstances, which was behavior not guaranteed by
the API and only happened to work due to implementation details
which no longer apply. Restructured this class around the assumption
that the ReservoirDownsampler will not preserve relative ordering
at all.
-Add disclaimer to description of -dcov argument explaining that
coverage targets are approximate goals that will not always be
precisely met.
-Unit tests for all individual downsamplers to verify that reduced
reads are exempted from elimination
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
since low dcov values can result in problematic downsampling artifacts for locus-based
traversals.
-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
controls the number of reads per alignment start position, and even a dcov value of 1 might
be safe/desirable in some circumstances.
-Also reorganize the global downsampling defaults so that they are specified as annotations
to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
DownsamplingMethod class.
-The default downsampling settings have not been changed: they are still -dcov 1000
for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly. Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.
The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly. Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).
Also:
1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.
Added lots of unit tests for new functionality.
-- This method provides client with the current number of elements, without having to retreive the underlying list<T>. Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones. All of the others are trivially obviously correct.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet
-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.
-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.
Thanks to Khalid for his feedback on this issue.
Also:
-added a unit test to verify GATKSAMRecord support with no typecasting required
-added some unit tests for the FractionalDownsampler that Mauricio will/might be using
-moved classes from private to public to better sync up with my local development
branch for engine integration