Problem
--------
the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations.
Solution
---------
initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:
ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.
ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.
-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all
General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
-- Ensure that BQSR works properly for an Ion Torrent BAM. (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator. For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed. Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well. Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp. In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M. The new code supports this. UnitTested and documented as well. LDMerger handles case where merging two alleles results in a no-op event. Merging CA/C + A/AA -> CAA/CAA -> no op. Handles this case by removing the two events. UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right). Here's the new algorithm:
* Compute probability that two variants are in phase with each other and that no
* compound hets exist in the population.
*
* Implemented as a likelihood ratio test of the hypothesis:
*
* x11 and x22 are the only haplotypes in the populations
*
* vs.
*
* all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
*
* Now, since we have to have both variants in the population, we exclude the x11 & x11 state. So the
* p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
*
* Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
*
* - P(x11 & x12 & x21) -- we have hom-ref and both hets
* - P(x22 & x12 & x21) -- we have hom-alt and both hets
* - P(x22 & x12) -- one haplotype is 22 and the other is het 12
* - P(x22 & x21) -- one haplotype is 22 and the other is het 21
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)
Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.
Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)
[fixes#47399227]
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
# Initial conditions were not being set properly
# Emission probabilities in the last row were not adding up to 1
The following commit fixes both by
# averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
# discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)
Summarized changes:
* Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
* Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
* Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
* Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
* Rename metric lengths to read and haplotype lengths for clarity
* Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
* Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
* Remove unnecessary parameters from updateCell()
* Fix the expected probabilities coming from the exact model in PairHMMUnitTest
* Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
* Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.
[fix 47164949]
--Mostly doc block tweaks
--Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
* ReadTransformers can say they must be first, must be last, or don't care.
* By default, none of the existing ones care about ordering except BQSR (must be first).
* This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
* The engine now orders the read transformers up front before applying iterators.
* The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
* Added unit tests.
* Removed from codebase NestedHashMap since it is unused and untested.
* Integration tests change because the BQSR CSV is now sorted automatically.
* Resolves GSA-732
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
- Throws user exception if it is.
- Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
- Added test to check this is working.
- Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
- Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
- Added comment about not changing the program record name, as per reviewer comments.
- Removed unused variable.
- It's now written into the recal report so that it can be used in the PrintReads step.
- Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before!
- Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR.
(This is my last non-branch commit; all future pushes will follow new GATK practices)
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit
Conflicts:
public/java/test/org/broadinstitute/variant/VariantBaseTest.java
-Moved some of the more specialized / complex VariantContext and VCF utility
methods back to the GATK.
-Due to this re-shuffling, was able to return things like the Pair class back
to the GATK as well.
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
Instead of the GATK Engine creating a new BaseRecalibrator (not clean), it just keeps track of the arguments (clean).
There are still some dependency issues, but it looks like they are related to Ami's code. Need to look into it further.
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.
-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)
-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)
-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.