a bit more on short read mapping

The tech note still needs improvement. Will do that after the release of v2.3.
This commit is contained in:
Heng Li 2017-10-22 18:38:35 -04:00
parent c6b6392b70
commit 1dd221ad82
1 changed files with 19 additions and 9 deletions

View File

@ -68,7 +68,7 @@ approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
generating base-level alignment, which in turn inspired us to develop minimap2 generating base-level alignment, which in turn inspired us to develop minimap2
towards higher accuracy and more practical functionality. towards higher accuracy and more practical functionality.
Both SMRT and ONT have been applied to sequence spliced mRNAs (RNA-seq). While Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
optimized for long noisy sequence reads and are tens of times slower than optimized for long noisy sequence reads and are tens of times slower than
dedicated long-read aligners. When developing minimap2 initially for aligning dedicated long-read aligners. When developing minimap2 initially for aligning
@ -111,8 +111,11 @@ distance between two anchors is too large); otherwise
\begin{equation}\label{eq:chain-gap} \begin{equation}\label{eq:chain-gap}
\beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big) \beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
\end{equation} \end{equation}
In implementation, a gap of length $l$ costs $\gamma_c(l)=0.01\cdot \bar{w}\cdot In implementation, a gap of length $l$ costs
|l|+0.5\log_2|l|$, where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with \[
\gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l|
\]
where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
chaining algorithms exist~\citep{Abouelhoda:2005aa}, they chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
are inapplicable to generic gap cost, complex to implement and usually are inapplicable to generic gap cost, complex to implement and usually
@ -363,12 +366,19 @@ alignment.
\subsection{Aligning short paired-end reads} \subsection{Aligning short paired-end reads}
During chainging, minimap2 takes a pair of reads as one read with a gap of During chainging, minimap2 takes a pair of reads as one read with a gap of
unknown length in the middle. It does not break a chain if there is a long unknown length in the middle. It applies a normal gap cost between seeds on the
reference gap between seeds on different reads. After identifying primary same read but is a more permissive gap cost between seeds on different reads.
chains (Section~\ref{sec:primary}), we split each fragment chain into two read More precisely, the gap cost during chaining is:
chains and perform alignment for each read as in Section~\ref{sec:genomic}. \[
Finally, we pair hits of each read end to find consistent paired-end \gamma_c(l)=\left\{\begin{array}{ll}
alignments. 0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\
\min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise}
\end{array}\right.
\]
After identifying primary chains (Section~\ref{sec:primary}), we split each
fragment chain into two read chains and perform alignment for each read as in
Section~\ref{sec:genomic}. Finally, we pair hits of each read end to find
consistent paired-end alignments.
\end{methods} \end{methods}