a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items:
1) Docs
2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp)
3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)
2. While making the previous fix and unifying FS for SNPs and indels, I noticed that FS was slightly broken in the general case for indels too; fixed.
3. I also fixed a minor bug in the Allele Biased Downsampling code for reduced reads.
I've resigned myself instead to create a mapping from Allele to Haplotype. It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).
As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...
Completed todo item: for sites like
(eval)
20 12345 A C
20 12345 A AC
(comp)
20 12345 A C
20 12345 A ACCC
the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally
(eval)
20 12345 A AC
(comp)
20 12345 A ACCC
would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively.
+ integration test
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs. Until the HC uses meta data though, this won't work.
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list. Not ideal)
-- Trivial read count differences in HC
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script