When a sample has multiple spanning deletions and we are asked to assign
likelihoods to the spanning deletion allele, we currently choose the first
deletion. Valentin pointed out that this isn't desired behavior. I
promised Valentin that I would address this issue, so here it is.
I do not believe that the correct thing to do is to sum the likelihoods
over all spanning deletions (I came up with problematic cases where this
breaks down).
So instead I'm using a simple heuristic approach: using the hom alt PLs, find
the most likely spanning deletion for this position and use its likelihoods.
In the 10K-sample VCF from Monkol there were only 2 cases that this problem
popped up. In both cases the heuristic approach works well.
-We now pull htsjdk and picard from maven central.
-Updated the GATK codebase as necessary to adapt to changes in the Feature
interface.
-Since VCFHeader now requires that all header lines have unique keys, uniquified
the keys of GVCFBlock header lines by including the min/max GQ in the key.
Updated MD5s accordingly.
-Other MD5s changed as a result of an htsjdk fix to eliminate "-0" in VCF output.
In the case where there's a low quality SNP under a spanning deletion in the gvcfs:
if the SNP is not genotyped by GenotypeGVCFs (because it's just noise) we were still
emitting a record with just the symbolic DEL allele (because that allele is high quality).
We no longer do that.
Previously, if a SNP occurred in sample A at a position that was in the middle of a deletion for sample B,
sample B would be genotyped as homozygous reference there (but it's NOT reference - there's a deletion).
Now, sample B is genotyped as having a symbolic DEL allele.
Minor cleanup added. Note that I also removed Laura's previous fix for this problem.
Existing integration tests change because I've added a new header line to the VCF being output.
I also added several tests for the new functionality showing:
1. genotyping from separate and already combined gvcfs give the same output
2. genotyping over multiple spanning deletions works
3. combining works too
Existing unit tests also cover this case.
New unit test for deprecated mergeVariantsViaLD
Update HaplotypeCallerIntegrationTest.java
Delete duplicate testHaplotypeCallerMergeVariantsViaLDException test.
Exclude MQ0BySample
Move SD and TRA to new StandardUGAnnotation interface
There is now annotation interface (StandardUGAnnotation) holding annots that are standard in UG but should't be used as they are now with HC. This allows us to not have to exclude these annotations explicitly in HC, but still be able to use them for development purposes.
Build a ReferenceContext in ActiveRegionWalkers to pass in to annotation engine so we can call the TandemRepeatAnnotator from M2
Make TandemRepeatAnnotator default annotation for M2.
Setup (but don't use yet) HC-style contamination downsampling.
New HC integration test with TandemRepeatAnnotator
Now, instead of stripping out the GQs for mono sites, we transfer them to the RGQ.
This is extremely useful for people who want to know how confident the hom ref genotype calls are.
Perhaps this is just what CRSP needs for pertinent negatives.
Note that I also changed the tool to no longer use the GenotypeSummaries annotation by default since
it was adding some seemingly unnecessary annotations (like mean GQ now that we keep the GQ around and
number of no-calls). Let me know if this was a mistake (although Laura gave me a thumbs up).
Using --breakBandsAtMultiplesOf N will ensure that no reference blocks span across
genomic positions that are multiples of N. This is especially important in the
case of scatter-gather where you don't want your scatter intervals to start in the
middle of blocks (because of a limitation in the way -L works in the GATK for VCF
records with the END tag).
For example, running with --breakBandsAtMultiplesOf 5 on this record:
1 69491 . G <NON_REF> . . END=69523 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
Will produce the following records:
1 69491 . G <NON_REF> . . END=69494 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
1 69495 . C <NON_REF> . . END=69499 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
1 69500 . T <NON_REF> . . END=69504 GT:DP:GQ:MIN_DP:MIN_GQ:PL ./.:94:99:82:99:0,120,1800
etc.
Added docs and a new test.
GenotypeGVCFs now has the ability to unique-ify samples so I can genotype together two different datasets containing the same sample
Modify InbreedingCoeff so that it works when genotyping uniquified samples
* TextCigarCodec.decode() is now static, and the getSingleton() method is gone
* MergingSamRecordIterator now wants a Collection<SamReader> rather than Collection<SAMFileReader> in the constructor
* SeekableBufferedStream now correctly reads the requested number of bytes, removed workaround in GATKBAMIndex
* Removed unused annotations (CCC and HWP)
* Renamed one of the two GC annotations to "IGC" (for Interval GC)
* Revved picard & htsjdk (GATK constants are now removed from htsjdk)
* PT 82046038
-- Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
-- HaplotypeCallerGVCFIntegrationTest PL/annotation changes due to using more reads in those tests
-- Removed a CountReadsInActiveRegionsIntegrationTest test for excessive coverage because the read coverage no longer goes over the limits in ART
Story:
=====
- https://www.pivotaltracker.com/story/show/83803796
Changes:
=======
- From a fix maximum ploidy indel RCM likelihood cache to a
dynamically resizable one.
- Used the occassion to removed an unused and deprecated method from ReferenceConfidenceModel
Testing:
=======
- Added integration test to check on ploidies larger than the previous limit of 20.