As reported by Menachem Fromer: a critical bug in AFCalcResult:
Specifically, the implementation:
public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) {
return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef;
}
seems incorrect and should probably be:
getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef
The issue here is that the 30 represents a Phred-scaled probability of *error* and it's currently being compared to a log probability of *non-error*.
Instead, we need to require that our probability of error be less than the error threshold.
This bug has only a minor impact on the calls -- hardly any sites change -- which is good. But the inverted logic effects multi-allelic sites significantly. Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones.
Change was to create a new function that properly handles thresholds that are PhredScaled quality scores:
/**
* Same as #isPolymorphic but takes a phred-scaled quality score as input
*/
public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) {
if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 ");
final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual));
return isPolymorphic(allele, log10Threshold);
}
ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples.
Also:
* Added integrationtest for co-reduction
* Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly)
DEV-200 #resolve #time 8m
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
The MD5s for these tests were changed in commit 87435f1074615b2cd016f042980109fd53962c8d
to match the output of a broken version of BaseRecalibration. With the patch in
commit c397102ecc1fd1d2cd8f209a8f358ab4a60b50a7, the output once again matches the
*original* MD5s for these tests, and does not vary as you increase -nct.
Final resolution to GSA-632
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
The old BaseRecalibrator walker is and never will be thread-safe, since it's a
LocusWalker that uses read attributes to track state.
ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe
at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1
for testing purposes, but wait for further testing to be done before using it
for production purposes in multithreaded mode.
-With this change, BQSR performance scales properly by thread rather
than gaining nothing from additional threads.
-Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct
(NanoScheduler)
-Removes high-level locks in the recalibration engines and NestedIntegerArray
in favor of maximally-granular locks on and around manipulation of the leaf
nodes of the NestedIntegerArray.
-NestedIntegerArray now creates all interior nodes upfront rather than on
the fly to avoid the need for locking during tree traversals. This uses
more memory in the initial part of BQSR runs, but the BQSR would eventually
converge to use this memory anyway over the course of a typical run.
IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator
walker with multiple threads. The BaseRecalibrator walker is and will never be
thread-safe, as it's a LocusWalker that uses read attributes to track
state information. ONLY the newer DelocalizedBaseRecalibrator can be made
thread-safe (and will hopefully be made so in my subsequent commits). This
commit addresses performance, not correctness.
Bringing in the following relevant changes:
* Fixes the indel realigner N-Way out null pointer exception DEV-10
* Optimizations to ReduceReads that bring the run time to 1/3rd.
Conflicts:
protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java
DEV-10 #resolve #time 2m
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic. That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele. The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating. This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime. Might be worth it through
-- Created a JIRA ticket https://jira.broadinstitute.org/browse/GSA-623 for Guillermo to look at the differences as the multi-allelic nature of many sites seems to change with the new more protected infrastructure. This may be due to implementation issues in the pooled caller, problems with my interface, or could be a genuine improvement.
-- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case
-- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations
-- Added systematic unit tests covering this implementation, and comparing it to others
-- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases
-- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right
-- Changed UG / HC to use this one via the StandardCallerArgumentCollection
-- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums
-- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests
-- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base
-- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X. Added unit test to ensure this behavior is correct
-- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order. Add contract to ensure this doesn't ever happen again
-- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model
--