Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items:
1) Docs
2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp)
3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)
Completed todo item: for sites like
(eval)
20 12345 A C
20 12345 A AC
(comp)
20 12345 A C
20 12345 A ACCC
the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally
(eval)
20 12345 A AC
(comp)
20 12345 A ACCC
would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively.
+ integration test
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list. Not ideal)
-- Trivial read count differences in HC
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
The weird part is that the comments claimed it was doing what it was supposed to, but it didn't actually do it.
Now we maintain the last header element of the consensus (but without bases and quals) if it adjoins an element with an insertion.
Added the user's test file as an integration test.
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.
-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)
-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)
-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
Reads that are soft-clipped off the contig (before the beginning of the contig) were being soft-clipped to position 0 instead of 1 because of an off-by-one issue. Fixed and included in the integration test.
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
-Switch back to the old implementation, if needed, with --use_legacy_downsampler
-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState
-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer
-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.
-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.
-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.