When using CatVariants, VCF files were being sorted solely on the base
pair position of the first record, ignoring the chromosome. This can
become problematic when merging files from different chromosomes,
espeically if you have multiple VCFs per chromosome.
As an example, assume the following 3 lines are all in separate files:
1 10
1 100
2 20
The merged VCF from CatVariants (without -assumeSorted) would read:
1 10
2 20
1 100
This has the potential to break tools that expect chromosomes to be
contiguous within a VCF file.
This commit changes the comparator from one of Pair<Integer, File> to
one of Pair<VariantContext, File>. We construct a
VariantContextComparator from the provided reference, which will sort
the first record by chromosome and position properly. Additionally, if
-assumeSorted is given, we simply use a null VariantContext as the first
record, which will all be equal (as all will be null)
Now that Ron updated the GATK so that we use star to represent spanning
deletions, we need to catch those cases in the code that remaps alleles.
Otherwise, we try to pad the stars and that's just bad.
Added test from actual failing data.
When a sample has multiple spanning deletions and we are asked to assign
likelihoods to the spanning deletion allele, we currently choose the first
deletion. Valentin pointed out that this isn't desired behavior. I
promised Valentin that I would address this issue, so here it is.
I do not believe that the correct thing to do is to sum the likelihoods
over all spanning deletions (I came up with problematic cases where this
breaks down).
So instead I'm using a simple heuristic approach: using the hom alt PLs, find
the most likely spanning deletion for this position and use its likelihoods.
In the 10K-sample VCF from Monkol there were only 2 cases that this problem
popped up. In both cases the heuristic approach works well.
Add oxoG read count annotation and add as default annotation
Add ##SAMPLE VCF header line in accordance with TCGA VCF spec, specifying "File" line in sample header with BAM file name and "SampleName" with BAM sample name (Don't print sample file path if --no_cmdline_in_header is specified to help with test consistency)
Turn on active region assembly-based physical phasing for M2
Clean up M2-related annotations so UG doesn't crash if M2 annotations are called
increased runtime java memory, changed default PON for NN to be new ICE PON
updated FP rates, when using new default PON. SNPs up by ~3%, INDELs down by 40%
updated git hash reference
updated git hash reference
added "str_contraction" artifact filter (improves specificity, especially in exomes)
refactored out VCF constants and added descriptions
added "artifact detection mode" for PON creation
added "str_contraction" artifact filter (improves specificity, especially in exomes)
added new dream evaulation markdown
added results for SMC 4
fixed up documentation, moved location to /dsde/working/mutect/dream_smc, and checked in scala script
added "artifact detection mode" for PON creation
added "str_contraction" artifact filter (improves specificity, especially in exomes)
fixed bug which would overwrite germline_risk filter errors
updated "how to" documents and records
fixed license text
thinned down FP regression test from 700 sites to 100. we have better ways (DREAM, NN) to check accuracy of the method and 100 is good enough to catch regressions
why oh why do the MD5-based unit tests produce different results on different machine architectures? I hate that :/
Thanks to GG, LDG and DR -- test should now produce the same results regardless of machine architecture
disabled downsampling... hopefully in the final attempt to make this work cross architecture!
enforced LOGLESS_CACHING... hopefully in the final final attempt to make this work cross architecture!
refactored out VCF constants and added descriptions