the first draft

This commit is contained in:
Heng Li 2017-08-03 12:12:55 -04:00
parent 28ab4d1f72
commit 9ec5f51afb
2 changed files with 85 additions and 36 deletions

View File

@ -160,6 +160,22 @@
Author = {Lau, Bayo and others}, Author = {Lau, Bayo and others},
Journal = {Bioinformatics}, Journal = {Bioinformatics},
Pages = {3829-3832}, Pages = {3829-3832},
Title = {LongISLND: in silico sequencing of lengthy and noisy datatypes}, Title = {{LongISLND}: in silico sequencing of lengthy and noisy datatypes},
Volume = {32}, Volume = {32},
Year = {2016}} Year = {2016}}
@article{Robinson:2011aa,
Author = {Robinson, James T and others},
Journal = {Nat Biotechnol},
Pages = {24-6},
Title = {Integrative genomics viewer},
Volume = {29},
Year = {2011}}
@article {Suzuki130633,
author = {Suzuki, Hajime and Kasahara, Masahiro},
title = {Acceleration Of Nucleotide Semi-Global Alignment With Adaptive Banded Dynamic Programming},
year = {2017},
note = {doi:10.1101/130633},
publisher = {Cold Spring Harbor Labs Journals},
journal = {bioRxiv}}

View File

@ -60,8 +60,8 @@ to develop minimap2 towards higher accuracy and more practical functionality.
\section{Methods} \section{Methods}
Minimap2 is the successor of minimap~\citep{Li:2016aa}. It uses similar Minimap2 is the successor of minimap~\citep{Li:2016aa}. It uses similar
indexing and seeding algorithms, and further a more accurate chaining algorithm indexing and seeding algorithms, and furthers it with a more accurate chaining
and adds the ability to produce detailed alignment. algorithm and adds the ability to produce detailed alignment.
\subsection{Chaining} \subsection{Chaining}
@ -92,26 +92,31 @@ chaining.
We note that if anchor $i$ is appended to $j$, appending $i$ to a predecessor We note that if anchor $i$ is appended to $j$, appending $i$ to a predecessor
of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}), of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
we start from anchor $i-1$ and stop the evaluation if we cannot find a better we start from anchor $i-1$ and stop the evaluation if we cannot find a better
score after up to $h$ iterations. This heuristic reduces the average time to score after up to $h$ iterations. This approach reduces the average time to
$O(h\cdot m)$. In practice, we can almost always find the optimal chain with $O(h\cdot m)$. In practice, we can almost always find the optimal chain with
$h=50$; even if the heuristic fails, the optimal chain often looks dubious. $h=50$; even if the heuristic fails, the optimal chain is often not
trustworthy, either.
\subsubsection{Backtracking} \subsubsection{Backtracking}
Let $P(i)$ be the index of the best predecessor of anchor $i$. It equals 0 if Let $P(i)$ be the index of the best predecessor of anchor $i$. It equals 0 if
$f(i)=w_i$ or $\argmax_j\{f(j)+\eta(j,i)-\gamma(j,i)\}$ otherwise. For each $f(i)=w_i$ or $\argmax_j\{f(j)+\eta(j,i)-\gamma(j,i)\}$ otherwise. For each
anchor $i$ in the descending order of $f(i)$, we apply $P(\cdot)$ repeatedly to anchor $i$ in the descending order of $f(i)$, we apply $P(\cdot)$ repeatedly to
find its predecessor and mark each visited $i$ as `used'. This process stops at find its predecessor and mark each visited $i$ as `used', until $P(i)=0$ or we
$P(j)=0$ or at a `used' $j$. This way we find all chains with no anchors used reach an already `used' $i$. This way we find all chains with no anchors used
in more than one chains. in more than one chains.
\subsubsection{Identifying primary chains} \subsubsection{Identifying primary chains}
Primary chains are chains that do not greatly overlap on the query sequence. In the absence of copy number changes, each query segment should not be mapped
Minimap2 uses a greedy algorithm to identify them. Let $Q$ be the set of to two places in the reference. However, chains found at the previous step may
primary chains, which is an empty set initially. For each chain from the best have significant or complete overlaps due to repeats in the reference.
to the worst according to their chaining scores: if on the query, the chain Minimap2 used the following procedure to identify \emph{primary chains} that do
overlaps with a chain in $Q$ by 50\% (by default) or higher fraction of the not greatly overlap on the query. Let $Q$ be an empty set initially. For each
shorter chain, mark the chain as secondary to the chain in $Q$; otherwise, add chain from the best to the worst according to their chaining scores: if on the
the chain to $Q$. query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of
the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise,
add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
not choose a more sophisticated data structure (e.g. range tree or k-d tree)
because this step is not the performance bottleneck.
\subsection{Alignment} \subsection{Alignment}
@ -136,7 +141,7 @@ On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, this
cost function is concave. It applies cost $q+l\cdot e$ to gaps shorter than cost function is concave. It applies cost $q+l\cdot e$ to gaps shorter than
$\lceil(\tilde{q}-q)/(e-\tilde{e})\rceil$ and applies $\lceil(\tilde{q}-q)/(e-\tilde{e})\rceil$ and applies
$\tilde{q}+l\cdot\tilde{e}$ to longer gaps. This scheme helps to recover $\tilde{q}+l\cdot\tilde{e}$ to longer gaps. This scheme helps to recover
longer insertions and deletions~\citep{Gotoh:1990aa}. longer insertions and deletions~(INDEL; \citealp{Gotoh:1990aa});
With global alignment, minimap2 may force to align unrelated sequences between With global alignment, minimap2 may force to align unrelated sequences between
two adjacent anchors. To avoid such an artifact, we compute accumulative two adjacent anchors. To avoid such an artifact, we compute accumulative
@ -151,15 +156,16 @@ S(i',j')-S(i,j)>Z+e\cdot(\max\{i-i',j-j'\}-\min\{i-i',j-j'\})
where $e$ is the gap extension penalty and $Z$ is an aribitrary threshold. where $e$ is the gap extension penalty and $Z$ is an aribitrary threshold.
This strategy is similar to X-drop employed in BLAST~\citep{Altschul:1997vn}. This strategy is similar to X-drop employed in BLAST~\citep{Altschul:1997vn}.
However, unlike X-drop, it would not break the alignment in the presence of a However, unlike X-drop, it would not break the alignment in the presence of a
single long gap. When minimap2 breaks a global alignment between two anchors, single long gap.
it performs local alignment between the two subsequences involved in the global
alignment, but this time with the one subsequence reverse complemented. This When minimap2 breaks a global alignment between two anchors, it performs local
additional alignment step may identify short inversions that are missed during alignment between the two subsequences involved in the global alignment, but
chaining. this time with the one subsequence reverse complemented. This additional
alignment step may identify short inversions that are missed during chaining.
\end{methods} \end{methods}
\section{Results and discussions} \section{Results}
\begin{figure}[!tb] \begin{figure}[!tb]
\centering \centering
\includegraphics[width=.5\textwidth]{roc-color.pdf} \includegraphics[width=.5\textwidth]{roc-color.pdf}
@ -183,20 +189,47 @@ NGMLR~\citep{Sedlazeck169557}. We excluded rHAT~\citep{Liu:2016ab},
LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either
crashed or produced malformatted SAM. In this evaluation, Minimap2 has a crashed or produced malformatted SAM. In this evaluation, Minimap2 has a
higher power to distinguish unique and repetitive hits, and achieves overall higher power to distinguish unique and repetitive hits, and achieves overall
higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and NGMLR provide higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
better mapping quality estimate: they rarely give repetitive hits high mapping even if we skip DP-based alignment (data not shown), suggesting chaining alone
quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may occasionally is sufficient to achieve high accuracy for approaximate mapping. Minimap2 and
miss close suboptimal hits and be overconfident in wrong mappings. On run time, NGMLR provide better mapping quality estimate: they rarely give repetitive hits
minialign is slightly faster than minimap2. They are over 30 times faster than high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may
the rest. occasionally miss close suboptimal hits and be overconfident in wrong mappings.
On run time, minialign is slightly faster than minimap2. They are over 30 times
faster than the rest. Minimap2 consumed 6.1GB memory at the peak, more than
BWA-MEM but less than others.
On real SMRT reads from human, the relative performance and sensitivity of On real human SMRT reads, the relative performance and sensitivity of
these aligners are broadly similar to those on simulated data. We are unable to these aligners are broadly similar to those on simulated data. We are unable to
provide a good estimate of mapping error rate due to the lack of the truth. On provide a good estimate of mapping error rate due to the lack of the truth. On
ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and
minimap2 are over 70 times faster than others. In addition to reference-based minimap2 are over 70 times faster than others. We have also examined tens of
read mapping, minimap2 can also find overlaps between long reads and align $\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can confirm the
long-read assemblies. observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks them into
shorter gaps. Minimap2 does not have this issue.
\section{Discussions}
Minialign and minimap2 are fast because a) with chaining, they can quickly
filter out most false seed hits~\citep{Li:2016aa} and reduce unsuccessful but
costly DP-based alignments; b) they implemented so far the fastest DP-based
alignment algorithm for long sequences~\citep{Suzuki:2016}. It is possible to
further accelerate minimap2 with a few other tweaks such as adaptive
banding~\citep{Suzuki130633} or incremental banding.
In addition to reference-based read mapping, minimap2 inherits minimap's
ability to search against huge multi-species data and to find read overlaps. On
a few test data sets, minimap2 appears to yield slightly better miniasm
assembly. Minimap2 can also align long assemblies or closely related genomes,
though more thorough evaluation is needed. Genome alignment is an intricate
topic.
\section*{Acknowledgements}
We owe a debt of gratitude to Hajime Suzuki for releasing his masterpiece and
insightful notes before formal publication. We thank M. Schatz, P. Rescheneder
and F. Sedlazeck for pointing out the limitation of BWA-MEM. We are also
grateful to early minimap2 testers who have greatly helped to fix various
issues.
\bibliography{minimap2} \bibliography{minimap2}
@ -264,10 +297,10 @@ In conclusion, all values in Eq.~(\ref{eq:suzuki}) are bounded: $x$ and $y$ by
$[-q-e,-e]$ and $\tilde{x}$, $\tilde{y}$ by $[-q-e,-e]$ and $\tilde{x}$, $\tilde{y}$ by
$[-\tilde{q}-\tilde{e},-\tilde{e}]$, and $u$ and $v$ by $[-q-e,M+q+e]$. When $[-\tilde{q}-\tilde{e},-\tilde{e}]$, and $u$ and $v$ by $[-q-e,M+q+e]$. When
matching score and gap cost are small, each of them can be stored as a 8-bit matching score and gap cost are small, each of them can be stored as a 8-bit
integer. This enables efficient SSE vectorization regardless of the peak score integer. This enables 16-way SSE vectorization regardless of the peak score of
of the alignment. the alignment.
For more efficient SSE implementation, we transform the row-column coordinate For a more efficient SSE implementation, we transform the row-column coordinate
to diagonal-anti-diagonal coordinate by letting $r\gets i+j$ and $t\gets i$. to diagonal-anti-diagonal coordinate by letting $r\gets i+j$ and $t\gets i$.
Eq.~(\ref{eq:suzuki}) becomes: Eq.~(\ref{eq:suzuki}) becomes:
\begin{equation*} \begin{equation*}
@ -283,8 +316,8 @@ y_{rt}&=&\max\{0,y_{r-1,t}+u_{r-1,t}-z_{rt}+q\}-q-e\\
\end{array}\right. \end{array}\right.
\end{equation*} \end{equation*}
In this formulation, cells with the same row index $r$ are independent of each In this formulation, cells with the same row index $r$ are independent of each
other. This allows us to vectorize the computation of all cells on the same other. This allows us to fully vectorize the computation of all cells on the
anti-diagonal in one inner loop. same anti-diagonal in one inner loop.
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
condition of this equation in the diagonal-anti-diagonal coordinate is condition of this equation in the diagonal-anti-diagonal coordinate is