The indels are still annotated as before, but now all other variant types are annotated too.
I'm doing this because of requests on the forum but am not making it standard. If we find it to be useful we can turn it on by default later.
Reads that are soft-clipped off the contig (before the beginning of the contig) were being soft-clipped to position 0 instead of 1 because of an off-by-one issue. Fixed and included in the integration test.
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
this introduced a bug in reduce reads by de-activating it's hard clipping of the out of bounds soft-clips (specially in the MT).
DEV-322 #resolve #time 4m
This reverts commit 42acfd9d0bccfc0411944c342a5b889f5feae736.
-Switch back to the old implementation, if needed, with --use_legacy_downsampler
-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState
-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer
-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.
-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.
-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
-- Was screwing up mixed reviewed / non-reviewed sites. Now only considered reviewed calls, if any are present, or all calls if no reviewed sites are found
-- Was just taking the first genotype, now it properly looks at all of the genotype calls and makes a reasonable guess what the answer should be
-- Added unit tests for the consensus creation algorithm
-- The current implementation of AFCalcResult contains a map from allele -> log10pNonRef. The only use of this field is to support the isPolymorphic function per allele. The call to this function looks like isPolymorphic(allele, QUAL). The QUAL is a phred-scaled threshold where you want to include alleles where the log10pNonRef >= QUAL (appropriately transformed). The problem is that when log10pNonRef is large, it quickly gets set to 0, while it's complementary log10pRef value has a meaningful log10 value. For example, if log10pRef = -100 (not an uncommonly large value), log10pNonRef = 0.0.
-- In order to preserve precision and allow us to more finally differentiate high QUAL from low QUAL (but still poly) sites we should store log10pRef values instead, and test that log10pRef <= threshold.
-- See https://jira.broadinstitute.org/browse/GSA-671 for more information.
-- The previous approach tried to remove the entire MongoVariantContext but when it was malformed was prone to error. Now just grabs the _id and uses it to remove the bad record.