Changes:
-------
* Updated current unit and integration test to use the new API components.
* Added unit tests for new classes AFPriorProvider and AFCalculatorProviders.
* Added integration test for mixed ploidy GenotypeGVCFs and CombineGVCFs
Changes:
-------
* GenotypingEngine uses now a AFCalc provider instead of
its own thread-local with one-time initialized and fixed
AF calculator.
* All walkers that use a GenotypingEngine now are passing
the appropiate AF calculator provider. For now most
just use a fix calculator (FixedAFCalculatorProvider)
except GenotypeGVCFs as this one now can cope with
mixture of ploidies failing-over to a general-ploidy
calculator when the preferred implementation is not
capable to handle a site's analysis.
to the total-ploidy (added ploidy accross samples).
Changes:
--------
* Instead of calculate a fixed log10 prior array with a fix
total likelihood we use a new component, the AFPriorProvider
to generate the priors for different total plodies on
demand; these are cached however so there is no unecessary
recompute involved.
with mixed ploidies and max-alt-allele number changes dynamically.
Changes:
--------
* Moved the AFCalcFactory.Calculation enum in a top level class
AFCalculatorImplementation.
* Given more reponsabilities to the enum like resolving the constructor
method once per implementation and the best-model selection algorithm.
* Removed test-code only fields and methods from AFCalc; just used to perform
unit-testing and not any actual functionality of this component.
* Removed the fixed ploidy constraint of GeneralPloidyExactAFCalc
implementation... now can deal with mixed ploidies that may change
per site and sample.
* Removed the fixed maxAltAllele restriction by allowing resizing of
the stateTracker structures.
* Due to previous two points now call the the AFCalc object are passed
the default-ploidy to assume in case some genotype in the input
VC does not have it and the max-alt-allele.
* Also due to those changes, removed the now totally useless 3 int
parameters from all AFCalc constructors.
* Cleaned the code a bit from no further used components and methods.
Dangling head merging (like with tails) in now enabled by default.
The --recoverDanglingHeads argument is now deprecated so that users know not to use it anymore.
We now also allow the user to set the minimum branch length for merging. This will be different
for exomes and RNA (see below).
The other changes in the code itself:
1. We no longer allow an arbitrarily large number of mismatches in the dangling head for merging
2. The max number of mismatches allowed in a dangling head is proportional to the kmer size
There will be a difference in the RNA calling pipeline. Instead of invoking '--recoverDanglingHeads'
the user will instead want to use '--minDanglingBranchLength 0'.
Below are the knowledgebase results of the master branch vs. this one.
For NA12878 DNA Exome:
master SNPS TRUE_POSITIVE 36722
master SNPS CALLED_NOT_IN_DB_AT_ALL 2699
master SNPS REASONABLE_FILTERS_WOULD_FILTER_FP_SITE 292
master SNPS FALSE_POSITIVE_SITE_IS_FP 70
branch SNPS TRUE_POSITIVE 36867
branch SNPS CALLED_NOT_IN_DB_AT_ALL 2952
branch SNPS REASONABLE_FILTERS_WOULD_FILTER_FP_SITE 387
branch SNPS FALSE_POSITIVE_SITE_IS_FP 94
As I discussed with Ryan in person, there are a good number of FPs that are called in the new
code, but they nearly all have bad strand bias and should be easily filtered by VQSR.
Note that there is no change for indels.
For NA12878 RNA from Ami:
master SNPS TRUE_POSITIVE 11055
master SNPS CALLED_NOT_IN_DB_AT_ALL 831
master SNPS REASONABLE_FILTERS_WOULD_FILTER_FP_SITE 44
master SNPS FALSE_POSITIVE_SITE_IS_FP 96
branch SNPS TRUE_POSITIVE 11113
branch SNPS CALLED_NOT_IN_DB_AT_ALL 874
branch SNPS REASONABLE_FILTERS_WOULD_FILTER_FP_SITE 47
branch SNPS FALSE_POSITIVE_SITE_IS_FP 92
Again, there's basically no change for indels.
* Arguments involved are --no_cmdline_in_header, --sites_only, and --bcf for VCF files and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files
* PT 52740563
* Removed ReadUtils.createSAMFileWriterWithCompression(), replaced with ReadUtils.createSAMFileWriter(), which applies all appropriate engine-level arguments
* Replaced hard-coded field names in ArgumentDefinitionField (Queue extension generator) with a Reflections-based lookup that will fail noisily during extension generation if there's an error
Explicitly including gatk/queue test-jar artifacts in package test classpaths.
SelectVariantsIntegrationTest#testInvalidJexl now resets the JexlEngine silent flag that VariantFiltration.initialize() toggles.
External example no longer tries to unpack nonexistent gatk artifact jars during package tests.
Same changes fixed the problem for GenotypeGVCFs and CombineGVCFs.
Stories:
- https://www.pivotaltracker.com/story/show/77626044
- https://www.pivotaltracker.com/story/show/77626854
Changes:
- Generalized the code for the merging in GATKVariantContextUtils to cope
with ploidy != 2.
- GenotypeGVCFs now check that the input's ploidy conform to the '-ploidy'
argument.
- Moved out Refernce Confidence VC merging code from GATKVariantContextUtils
so that we can keep new code in protected.
Caveats:
- GenotypeGVCFs only can deal with input files that have the same ploidy in
all positions; the one that the user MUST indicate in the -ploidy argument
(if different to the default 2).
- CombineGVCFs won't necessarely complain if its passed mixed ploidy
inputs but you won't be able to genotype it with GenotypeGVCFs.
Test:
- Removed deprecated unit tests for GATKVariantContextUtils.
- Moved unit-tests regarding GVCF merging from GATKVariantContextUtilsUnitTest
to ReferenceConfidenceVariantContextUtilsUnitTest.
- Added unit test for new code for mapping genotype indices between allele
index encoding in GenotypeLikelihoodCalculator.
- GenotypeGVCFs and CombineGVCFs original integration test are unaffected
by the change.
- Added tetraploid run integration tests to check on non-diploid execution
of GenotypeGVCFs and CombineGVCFs.
Changed tests and scripts to use gatkdir full path instead of relative testdata/qscripts symbolic links.
Although symlinks not created, left the symlink deletion script execution with a comment about future removal.
Re-enabled example UG pipeline queue test.
Replaced all hardcoded strings of {public,private}/testdata with BaseTest variables.
Refactored temp list creation method from ListFileUtilsUnitTest to BaseTest.createTempListFile.
Removed list files with hardcoded paths, now using createTempListFile instead with private test dir variable.
We do this for technical reasons, mostly because we don't genotype in the HC anymore; it's all
done downstream by GenotypeGVCFs so we can't be sure that the genotype will be hom var. Also,
there are steps in the downstream pipeline where genotypes can change, so assuming anything in
the HC is a bad idea, and if we have phasing info in the het state, we want to propagate that forward.
Now, PGT tag fixing happens downstream in GenotypeGVCFs.
While I was in there I also cleaned up the code a bit and fixed a bug where annotation was happening
before genotype creation when using the --includeNonVariantSites argument.
Added tests accordingly.
* This is a shortcut for people who have multi-sample BAMs but would like to use GVCF mode. Rather than creating single-sample BAMs with PrintReads, one could use the --sample_name argument to HaplotypeCaller to specify the single sample to make calls on
* Completes PT 73075482
Story:
https://www.pivotaltracker.com/story/show/77250524
Changes:
- Remove the annotating code in GeneralPloidyExactAFCalc (GPEAFC) class.
- Added the asAlleleList to GenotypeAlleleCounts class and get (GPEAFC) to use that instead of implementing its own (nicer and more reusable code).
- Removed the explicit addition of AlleleCountBySample fields to the VCF header by the walker initialize
- Added utility methods in Utils to wrap and int[] array into a List<Integer>, and double[] array into a List<Double> efficiently.
Test:
- Added unit-testing for asAlleleList in GenotypeAlleleCountsUnitTest (within testFirst and testNext).
- Added unit-testing for new methods in Utils : asList(int[]) and asList(double[])
- Changed UG General Ploidy test to add explicitly those annotations.
- Non-trivial changes in integration tests involving non-diploid runs (namelly haploid and tetraploid) as they are not showing
those annotations anylonger, so the MD5s have been changed accordingly.
It turns out that there can be some really complex situations even with a single sample where
there are lots of unphasable hets around a hom. Previously we were trying to phase each of the
hets against the hom, but that wasn't correct. Instead we now detect that situation and don't
attempt to phase anything.
Added a unit test to cover this situation.
New annotation for low= and high-confidence de novos (only annotates biallelics)
FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations
Restrict population priors based on discovered allele count to be valid for 10 or more samples.
Fix for the GeneralPloidyExactAFCalc implementation that was preventing -ploidy != 2 GVCF/BP_RESOLUTION output to work.
Story:
https://www.pivotaltracker.com/story/show/74471252
Tests:
Enabled GVCF tests with ploidy != 2 and other checking for the original ArrayIndexOutOfBounds exception.