a bit more on short read mapping

The tech note still needs improvement. Will do that after the release of v2.3.
2017-10-22 18:38:35 -04:00 · 2017-10-22 18:38:35 -04:00 · 1dd221ad82
parent c6b6392b70
commit 1dd221ad82
1 changed files with 19 additions and 9 deletions
--- a/tex/minimap2.tex
+++ b/tex/minimap2.tex
@ -68,7 +68,7 @@ approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
 generating base-level alignment, which in turn inspired us to develop minimap2
 towards higher accuracy and more practical functionality.

-Both SMRT and ONT have been applied to sequence spliced mRNAs (RNA-seq). While
+Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While
 traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
 optimized for long noisy sequence reads and are tens of times slower than
 dedicated long-read aligners. When developing minimap2 initially for aligning
@ -111,8 +111,11 @@ distance between two anchors is too large); otherwise
 \begin{equation}\label{eq:chain-gap}
 \beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
 \end{equation}
-In implementation, a gap of length $l$ costs $\gamma_c(l)=0.01\cdot \bar{w}\cdot
-|l|+0.5\log_2|l|$, where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
+In implementation, a gap of length $l$ costs
+\[
+\gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l|
+\]
+where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
 Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
 chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
 are inapplicable to generic gap cost, complex to implement and usually
@ -363,12 +366,19 @@ alignment.
 \subsection{Aligning short paired-end reads}

 During chainging, minimap2 takes a pair of reads as one read with a gap of
-unknown length in the middle. It does not break a chain if there is a long
-reference gap between seeds on different reads. After identifying primary
-chains (Section~\ref{sec:primary}), we split each fragment chain into two read
-chains and perform alignment for each read as in Section~\ref{sec:genomic}.
-Finally, we pair hits of each read end to find consistent paired-end
-alignments.
+unknown length in the middle. It applies a normal gap cost between seeds on the
+same read but is a more permissive gap cost between seeds on different reads.
+More precisely, the gap cost during chaining is:
+\[
+\gamma_c(l)=\left\{\begin{array}{ll}
+0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\
+\min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise}
+\end{array}\right.
+\]
+After identifying primary chains (Section~\ref{sec:primary}), we split each
+fragment chain into two read chains and perform alignment for each read as in
+Section~\ref{sec:genomic}.  Finally, we pair hits of each read end to find
+consistent paired-end alignments.

 \end{methods}