gatk-3.8/public
Valentin Ruano-Rubio 2e964c59b4 Improved criteria to select best haplotypes out from the assembly graph.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.

The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.

Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:

As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
  ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.

This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:

Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.

This pull-request addresses the following stories:

https://www.pivotaltracker.com/story/show/66691418
https://www.pivotaltracker.com/story/show/64319760

Change Summary:

1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
2014-03-14 18:37:01 -04:00
..
VectorPairHMM Removed g_haplotype* global variables in native code so that it works 2014-03-06 22:08:35 -08:00
c At chartl's request, add the bwa aln -N and bwa aln -m parameters to the bindings. 2012-01-17 14:47:53 -05:00
chainFiles
doc Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs) 2013-03-12 10:57:14 -04:00
external-example Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00
gatk-framework Improved criteria to select best haplotypes out from the assembly graph. 2014-03-14 18:37:01 -04:00
gatk-package Unconditionally include all of commons-httpclient in the GATK/Queue jars 2014-03-14 10:50:15 -04:00
gatk-queue-extgen Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00
gsalib Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00
java/config Moved files to maven directories. 2014-02-03 13:50:44 -05:00
package-tests Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests 2014-03-10 21:24:03 -04:00
perl Fixing the liftover script to not require strict VCF header validation. 2013-11-07 09:02:17 -05:00
queue-framework Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests 2014-03-10 21:24:03 -04:00
queue-package Unconditionally include all of commons-httpclient in the GATK/Queue jars 2014-03-14 10:50:15 -04:00
repo Replaced local drmaa and Jama artifacts with versions from maven central. 2014-02-22 01:21:35 +08:00
sting-root Unconditionally include all of commons-httpclient in the GATK/Queue jars 2014-03-14 10:50:15 -04:00
sting-utils Merge remote-tracking branch 'origin/master' into intel 2014-03-10 14:07:36 -04:00
pom.xml Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests 2014-03-10 21:24:03 -04:00