-- The previous version of TribbleIndexedFeatureReader.query() would open a RandomAccessFile each time the GATK crossed a shard boundary. When running with -L wex.intervals (or any time there were lots of intervals) we'd be opening and closing enormous numbers of files, radically slowing down the GATK. With these patched versions of Tribble we see something like the following performance improvements:
SelectVariants with -L wex.intervals on my local machine against non-local file
pre-patch => 3 hours
post-patch => 30 seconds
Calculates reference bias based on the AD genotype field instead of AB. This is slightly more meaningful for indels and still a good estimator for snps.
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.
Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.
Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.
Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.
Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.
2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.
Refactored it so that it can only remove non-consensus reads.
Now only filtered reads are unstranded. All consensus reads have strand, so that we
emit 2 consensus reads in general now: one for each strand.
This involved some refactoring of the sliding window which cleaned it up a lot.
Also included is a bug fix:
insertions downstream of a variant region weren't triggering a stop to the compression.
1. Fix for the -okayToMiss argument for indels.
In cases where we make calls with different alleles, it wasn't allowing us to skip the site for FNs.
2. Need to add confidence and isComplexEvent attributes to the equality and duplicate checks in MVC.
3. Treat unknown confidences as reviewed for now.
We need this until IGV gets updated to use confidences for reviews.
Try to reduce the number of tests failing with file not found
errors due to random automount failures by cd'ing into a
preset list of directories at the start of each job in an
effort to trigger automount.
ant -p outputs only targets that have description attributes.
Modify build.xml so only important targets that users might actually
want to use are output by ant -p.