So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before.
Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.
1. All NA12878DBWalkers that export/emit sites need to do so in order; also one should be able
to use -L with them and not have it iterate over all possible sites.
Updated ExportReviews and ExtractConsensusSites to adhere to these constraints.
2. Added the option to AssessNA12878 to have it ignore FNs that overlap with a provided VCF.
This is useful if you have a list of sites from reviews that are okay to be missed in
particular techs only (because for some reason there is coverage but no evidence of the
alternate allele in them) - intended to be used with Jenkins.
3. Hooked up the logic of complex events all the way through the KB.
Now the consensus incorporates whether a call is complex and the assessor does not penalize for them.
4. Fixed long-standing bug that I managed to find accidentally:
AssessNA12878 was closing its DB connection before its final call to includeMissingCalls().
5. Hooked up the per-call confidences through the KB.
We no longer have a 2-tiered priority system in the KB (reviews and everything else) but instead
use a quasi-Bayesian estimator (will update to proper Bayesian treatment if needed).
Now ImportCallset and ImportReviews assigns confidences as appropriate.
Also needed to fix up the consensus logic for calls with UNKNOWN status.
-Two SAMReaderIDs that pointed at the same underlying bam file through
a relative vs. an absolute path were not being treated as equal, and
had different hash codes. This was causing problems in the engine, since
SAMReaderIDs are often used as the keys of HashMaps.
-Fix: explicitly use the absolute path to the encapsulated bam file in
hashCode() and equals()
-Added tests to ensure this doesn't break again
Problem
-------
Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters.
Solution
-------
There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle).
ps: Also added an average depth column to the output
- Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF
- overrode default contamination value in test
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.
2. Updates to the KB assessment functionality:
a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.
3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived
the DeBruijn assembler.
4. Improvements to the likelihoods for the HC
a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need
to divide e by 3 because the observed base could have come from any of the non-observed alleles.
b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is
dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more
than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
turned into Q0s for now.
c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real
sites with off kilter allele balances and we occasionally lose them to contamination removal.
5. Improved the dangling tail merging implementation.
-Add ability to manually specify dependencies on the command line. This allows one
to specify, for example, that all walkers depend on the GeneralCallingPipeline
QScript, even though they don't have any compile-time dependencies on that QScript.
-Check that the provided walker class is valid in DependencyAnalyzer.xml
-Check ant exit status in the front-end script
-Fix bug where analyzer would give incorrect results if the list of changed
Java classes was empty
-This test is failing intermittently for unexplained reasons (see GSA-943)
-In the interest of keeping the rest of the pipeline test suite running, it's
best to disable this one test until GSA-943 is resolved
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119
20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121
20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119
20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217
20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240
20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268
20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267
We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line. Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions