Now that Ron updated the GATK so that we use star to represent spanning
deletions, we need to catch those cases in the code that remaps alleles.
Otherwise, we try to pad the stars and that's just bad.
Added test from actual failing data.
When a sample has multiple spanning deletions and we are asked to assign
likelihoods to the spanning deletion allele, we currently choose the first
deletion. Valentin pointed out that this isn't desired behavior. I
promised Valentin that I would address this issue, so here it is.
I do not believe that the correct thing to do is to sum the likelihoods
over all spanning deletions (I came up with problematic cases where this
breaks down).
So instead I'm using a simple heuristic approach: using the hom alt PLs, find
the most likely spanning deletion for this position and use its likelihoods.
In the 10K-sample VCF from Monkol there were only 2 cases that this problem
popped up. In both cases the heuristic approach works well.
Add oxoG read count annotation and add as default annotation
Add ##SAMPLE VCF header line in accordance with TCGA VCF spec, specifying "File" line in sample header with BAM file name and "SampleName" with BAM sample name (Don't print sample file path if --no_cmdline_in_header is specified to help with test consistency)
Turn on active region assembly-based physical phasing for M2
Clean up M2-related annotations so UG doesn't crash if M2 annotations are called
increased runtime java memory, changed default PON for NN to be new ICE PON
updated FP rates, when using new default PON. SNPs up by ~3%, INDELs down by 40%
updated git hash reference
updated git hash reference
added "str_contraction" artifact filter (improves specificity, especially in exomes)
refactored out VCF constants and added descriptions
added "artifact detection mode" for PON creation
added "str_contraction" artifact filter (improves specificity, especially in exomes)
added new dream evaulation markdown
added results for SMC 4
fixed up documentation, moved location to /dsde/working/mutect/dream_smc, and checked in scala script
added "artifact detection mode" for PON creation
added "str_contraction" artifact filter (improves specificity, especially in exomes)
fixed bug which would overwrite germline_risk filter errors
updated "how to" documents and records
fixed license text
thinned down FP regression test from 700 sites to 100. we have better ways (DREAM, NN) to check accuracy of the method and 100 is good enough to catch regressions
why oh why do the MD5-based unit tests produce different results on different machine architectures? I hate that :/
Thanks to GG, LDG and DR -- test should now produce the same results regardless of machine architecture
disabled downsampling... hopefully in the final attempt to make this work cross architecture!
enforced LOGLESS_CACHING... hopefully in the final final attempt to make this work cross architecture!
refactored out VCF constants and added descriptions