Commit Graph

4 Commits (d55dddfdbab3a0faac4fa562fbc4090d8b7b331b)

Author SHA1 Message Date
Mauricio Carneiro ebe2edbef3 Fix caching indices in the PairHMM
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)

Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.

Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)

[fixes #47399227]
2013-04-08 11:05:12 -04:00
Mauricio Carneiro 0de6f55660 PairHMM rework
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
   # Initial conditions were not being set properly
   # Emission probabilities in the last row were not adding up to 1

The following commit fixes both by
   # averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
   # discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)

Summarized changes:
   * Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
   * Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
   * Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
   * Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
   * Rename metric lengths to read and haplotype lengths for clarity
   * Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
   * Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
   * Remove unnecessary parameters from updateCell()
   * Fix the expected probabilities coming from the exact model in PairHMMUnitTest
   * Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
   * Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.

[fix 47164949]
2013-03-30 10:50:06 -04:00
David Roazen 12a3d7ecad Fix licenses on files modified in 2.4-1 2013-02-26 21:53:17 -05:00
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00