gatk-3.8/protected
Valentin Ruano-Rubio 2e964c59b4 Improved criteria to select best haplotypes out from the assembly graph.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.

The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.

Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:

As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
  ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.

This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:

Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.

This pull-request addresses the following stories:

https://www.pivotaltracker.com/story/show/66691418
https://www.pivotaltracker.com/story/show/64319760

Change Summary:

1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
2014-03-14 18:37:01 -04:00
..
gatk-protected Improved criteria to select best haplotypes out from the assembly graph. 2014-03-14 18:37:01 -04:00
pom.xml Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00