Pipeline runs end-to-end using example metadata and has been tested only for cases where everything is ideal.
Next step is to bring this to the cloud, test all different scenario (multiple tumors, single ended, missing parameters etc).
Parallel next step is to add QC metrics.
Nasty, nasty bug -- if we were extremely unlucky with shard boundaries, we might
end up with a shard containing only unmapped mates of mapped reads. In this case,
ReadShard.getReadsSpan() would not behave correctly, since the shard as a whole would
be marked "mapped" (since it refers to mapped intervals) yet consist only of unmapped
mates of mapped reads located within those intervals.
1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts.
2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases.
3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.
Merge all FilePointers for each contig into a single, merged, optimized FilePointer
representing all regions to visit in all BAM files for a given contig.
This helps us in several ways:
-It allows us to create a single, persistent set of iterators for each contig,
finally and definitively eliminating all Shard/FilePointer boundary issues for
the new experimental ReadWalker downsampling
-We no longer need to track low-level file positions in the sharding system (which
was no longer possible anyway given the new experimental downsampling system)
-We no longer revisit BAM file chunks that we've visited in the past -- all BAM
file access is purely sequential
-We no longer need to constantly recreate our full chain of read iterators
There are also potential dangers:
-We hold more BAM index data in memory at once. Given that we merge and optimize
the index data during the merge, and only hold one contig's worth of data at a
time, this does not appear to be a major issue. TODO: confirm this!
-With a huge number of samples and intervals, the FilePointer merge operation
might become expensive. With the latest implementation, this does not
appear to be an issue even with a huge number of intervals (for one sample, at least),
but if it turns out to be a problem for > 1 sample there are things we can do.
Still TODO: unit tests for the new FilePointer.union() method
-- Calls NA12878 with and without the expt. downsampler on chr1
-- Creates combined vcf, annotating sites as overlapping omni SNPs and Mills indels
-- Creates simple combined.table that has chr, pos, set, and type to easily ID missed good sites with the new downsampler
-- Fixes monster bug in the way that traversal engines interacted with the NanoScheduler via the output tracker.
-- ThreadLocalOutputTracker is now a ThreadBasedOutputTracker that associates via a map from a master thread -> the storage map. Lookups occur by walking through threads in the same thread group, not just the thread itself (TBD -- should have a map from ThreadGroup instead)
-- Removed unnecessary debug statement in GenomeLocParser
-- nt and nct officially work together now