the first draft
This commit is contained in:
parent
28ab4d1f72
commit
9ec5f51afb
|
|
@ -160,6 +160,22 @@
|
||||||
Author = {Lau, Bayo and others},
|
Author = {Lau, Bayo and others},
|
||||||
Journal = {Bioinformatics},
|
Journal = {Bioinformatics},
|
||||||
Pages = {3829-3832},
|
Pages = {3829-3832},
|
||||||
Title = {LongISLND: in silico sequencing of lengthy and noisy datatypes},
|
Title = {{LongISLND}: in silico sequencing of lengthy and noisy datatypes},
|
||||||
Volume = {32},
|
Volume = {32},
|
||||||
Year = {2016}}
|
Year = {2016}}
|
||||||
|
|
||||||
|
@article{Robinson:2011aa,
|
||||||
|
Author = {Robinson, James T and others},
|
||||||
|
Journal = {Nat Biotechnol},
|
||||||
|
Pages = {24-6},
|
||||||
|
Title = {Integrative genomics viewer},
|
||||||
|
Volume = {29},
|
||||||
|
Year = {2011}}
|
||||||
|
|
||||||
|
@article {Suzuki130633,
|
||||||
|
author = {Suzuki, Hajime and Kasahara, Masahiro},
|
||||||
|
title = {Acceleration Of Nucleotide Semi-Global Alignment With Adaptive Banded Dynamic Programming},
|
||||||
|
year = {2017},
|
||||||
|
note = {doi:10.1101/130633},
|
||||||
|
publisher = {Cold Spring Harbor Labs Journals},
|
||||||
|
journal = {bioRxiv}}
|
||||||
|
|
|
||||||
103
tex/minimap2.tex
103
tex/minimap2.tex
|
|
@ -60,8 +60,8 @@ to develop minimap2 towards higher accuracy and more practical functionality.
|
||||||
\section{Methods}
|
\section{Methods}
|
||||||
|
|
||||||
Minimap2 is the successor of minimap~\citep{Li:2016aa}. It uses similar
|
Minimap2 is the successor of minimap~\citep{Li:2016aa}. It uses similar
|
||||||
indexing and seeding algorithms, and further a more accurate chaining algorithm
|
indexing and seeding algorithms, and furthers it with a more accurate chaining
|
||||||
and adds the ability to produce detailed alignment.
|
algorithm and adds the ability to produce detailed alignment.
|
||||||
|
|
||||||
\subsection{Chaining}
|
\subsection{Chaining}
|
||||||
|
|
||||||
|
|
@ -92,26 +92,31 @@ chaining.
|
||||||
We note that if anchor $i$ is appended to $j$, appending $i$ to a predecessor
|
We note that if anchor $i$ is appended to $j$, appending $i$ to a predecessor
|
||||||
of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
|
of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
|
||||||
we start from anchor $i-1$ and stop the evaluation if we cannot find a better
|
we start from anchor $i-1$ and stop the evaluation if we cannot find a better
|
||||||
score after up to $h$ iterations. This heuristic reduces the average time to
|
score after up to $h$ iterations. This approach reduces the average time to
|
||||||
$O(h\cdot m)$. In practice, we can almost always find the optimal chain with
|
$O(h\cdot m)$. In practice, we can almost always find the optimal chain with
|
||||||
$h=50$; even if the heuristic fails, the optimal chain often looks dubious.
|
$h=50$; even if the heuristic fails, the optimal chain is often not
|
||||||
|
trustworthy, either.
|
||||||
|
|
||||||
\subsubsection{Backtracking}
|
\subsubsection{Backtracking}
|
||||||
Let $P(i)$ be the index of the best predecessor of anchor $i$. It equals 0 if
|
Let $P(i)$ be the index of the best predecessor of anchor $i$. It equals 0 if
|
||||||
$f(i)=w_i$ or $\argmax_j\{f(j)+\eta(j,i)-\gamma(j,i)\}$ otherwise. For each
|
$f(i)=w_i$ or $\argmax_j\{f(j)+\eta(j,i)-\gamma(j,i)\}$ otherwise. For each
|
||||||
anchor $i$ in the descending order of $f(i)$, we apply $P(\cdot)$ repeatedly to
|
anchor $i$ in the descending order of $f(i)$, we apply $P(\cdot)$ repeatedly to
|
||||||
find its predecessor and mark each visited $i$ as `used'. This process stops at
|
find its predecessor and mark each visited $i$ as `used', until $P(i)=0$ or we
|
||||||
$P(j)=0$ or at a `used' $j$. This way we find all chains with no anchors used
|
reach an already `used' $i$. This way we find all chains with no anchors used
|
||||||
in more than one chains.
|
in more than one chains.
|
||||||
|
|
||||||
\subsubsection{Identifying primary chains}
|
\subsubsection{Identifying primary chains}
|
||||||
Primary chains are chains that do not greatly overlap on the query sequence.
|
In the absence of copy number changes, each query segment should not be mapped
|
||||||
Minimap2 uses a greedy algorithm to identify them. Let $Q$ be the set of
|
to two places in the reference. However, chains found at the previous step may
|
||||||
primary chains, which is an empty set initially. For each chain from the best
|
have significant or complete overlaps due to repeats in the reference.
|
||||||
to the worst according to their chaining scores: if on the query, the chain
|
Minimap2 used the following procedure to identify \emph{primary chains} that do
|
||||||
overlaps with a chain in $Q$ by 50\% (by default) or higher fraction of the
|
not greatly overlap on the query. Let $Q$ be an empty set initially. For each
|
||||||
shorter chain, mark the chain as secondary to the chain in $Q$; otherwise, add
|
chain from the best to the worst according to their chaining scores: if on the
|
||||||
the chain to $Q$.
|
query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of
|
||||||
|
the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise,
|
||||||
|
add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
|
||||||
|
not choose a more sophisticated data structure (e.g. range tree or k-d tree)
|
||||||
|
because this step is not the performance bottleneck.
|
||||||
|
|
||||||
\subsection{Alignment}
|
\subsection{Alignment}
|
||||||
|
|
||||||
|
|
@ -136,7 +141,7 @@ On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, this
|
||||||
cost function is concave. It applies cost $q+l\cdot e$ to gaps shorter than
|
cost function is concave. It applies cost $q+l\cdot e$ to gaps shorter than
|
||||||
$\lceil(\tilde{q}-q)/(e-\tilde{e})\rceil$ and applies
|
$\lceil(\tilde{q}-q)/(e-\tilde{e})\rceil$ and applies
|
||||||
$\tilde{q}+l\cdot\tilde{e}$ to longer gaps. This scheme helps to recover
|
$\tilde{q}+l\cdot\tilde{e}$ to longer gaps. This scheme helps to recover
|
||||||
longer insertions and deletions~\citep{Gotoh:1990aa}.
|
longer insertions and deletions~(INDEL; \citealp{Gotoh:1990aa});
|
||||||
|
|
||||||
With global alignment, minimap2 may force to align unrelated sequences between
|
With global alignment, minimap2 may force to align unrelated sequences between
|
||||||
two adjacent anchors. To avoid such an artifact, we compute accumulative
|
two adjacent anchors. To avoid such an artifact, we compute accumulative
|
||||||
|
|
@ -151,15 +156,16 @@ S(i',j')-S(i,j)>Z+e\cdot(\max\{i-i',j-j'\}-\min\{i-i',j-j'\})
|
||||||
where $e$ is the gap extension penalty and $Z$ is an aribitrary threshold.
|
where $e$ is the gap extension penalty and $Z$ is an aribitrary threshold.
|
||||||
This strategy is similar to X-drop employed in BLAST~\citep{Altschul:1997vn}.
|
This strategy is similar to X-drop employed in BLAST~\citep{Altschul:1997vn}.
|
||||||
However, unlike X-drop, it would not break the alignment in the presence of a
|
However, unlike X-drop, it would not break the alignment in the presence of a
|
||||||
single long gap. When minimap2 breaks a global alignment between two anchors,
|
single long gap.
|
||||||
it performs local alignment between the two subsequences involved in the global
|
|
||||||
alignment, but this time with the one subsequence reverse complemented. This
|
When minimap2 breaks a global alignment between two anchors, it performs local
|
||||||
additional alignment step may identify short inversions that are missed during
|
alignment between the two subsequences involved in the global alignment, but
|
||||||
chaining.
|
this time with the one subsequence reverse complemented. This additional
|
||||||
|
alignment step may identify short inversions that are missed during chaining.
|
||||||
|
|
||||||
\end{methods}
|
\end{methods}
|
||||||
|
|
||||||
\section{Results and discussions}
|
\section{Results}
|
||||||
\begin{figure}[!tb]
|
\begin{figure}[!tb]
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=.5\textwidth]{roc-color.pdf}
|
\includegraphics[width=.5\textwidth]{roc-color.pdf}
|
||||||
|
|
@ -183,20 +189,47 @@ NGMLR~\citep{Sedlazeck169557}. We excluded rHAT~\citep{Liu:2016ab},
|
||||||
LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either
|
LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either
|
||||||
crashed or produced malformatted SAM. In this evaluation, Minimap2 has a
|
crashed or produced malformatted SAM. In this evaluation, Minimap2 has a
|
||||||
higher power to distinguish unique and repetitive hits, and achieves overall
|
higher power to distinguish unique and repetitive hits, and achieves overall
|
||||||
higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and NGMLR provide
|
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
|
||||||
better mapping quality estimate: they rarely give repetitive hits high mapping
|
even if we skip DP-based alignment (data not shown), suggesting chaining alone
|
||||||
quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may occasionally
|
is sufficient to achieve high accuracy for approaximate mapping. Minimap2 and
|
||||||
miss close suboptimal hits and be overconfident in wrong mappings. On run time,
|
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
|
||||||
minialign is slightly faster than minimap2. They are over 30 times faster than
|
high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may
|
||||||
the rest.
|
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
|
||||||
|
On run time, minialign is slightly faster than minimap2. They are over 30 times
|
||||||
|
faster than the rest. Minimap2 consumed 6.1GB memory at the peak, more than
|
||||||
|
BWA-MEM but less than others.
|
||||||
|
|
||||||
On real SMRT reads from human, the relative performance and sensitivity of
|
On real human SMRT reads, the relative performance and sensitivity of
|
||||||
these aligners are broadly similar to those on simulated data. We are unable to
|
these aligners are broadly similar to those on simulated data. We are unable to
|
||||||
provide a good estimate of mapping error rate due to the lack of the truth. On
|
provide a good estimate of mapping error rate due to the lack of the truth. On
|
||||||
ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and
|
ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and
|
||||||
minimap2 are over 70 times faster than others. In addition to reference-based
|
minimap2 are over 70 times faster than others. We have also examined tens of
|
||||||
read mapping, minimap2 can also find overlaps between long reads and align
|
$\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can confirm the
|
||||||
long-read assemblies.
|
observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks them into
|
||||||
|
shorter gaps. Minimap2 does not have this issue.
|
||||||
|
|
||||||
|
\section{Discussions}
|
||||||
|
|
||||||
|
Minialign and minimap2 are fast because a) with chaining, they can quickly
|
||||||
|
filter out most false seed hits~\citep{Li:2016aa} and reduce unsuccessful but
|
||||||
|
costly DP-based alignments; b) they implemented so far the fastest DP-based
|
||||||
|
alignment algorithm for long sequences~\citep{Suzuki:2016}. It is possible to
|
||||||
|
further accelerate minimap2 with a few other tweaks such as adaptive
|
||||||
|
banding~\citep{Suzuki130633} or incremental banding.
|
||||||
|
|
||||||
|
In addition to reference-based read mapping, minimap2 inherits minimap's
|
||||||
|
ability to search against huge multi-species data and to find read overlaps. On
|
||||||
|
a few test data sets, minimap2 appears to yield slightly better miniasm
|
||||||
|
assembly. Minimap2 can also align long assemblies or closely related genomes,
|
||||||
|
though more thorough evaluation is needed. Genome alignment is an intricate
|
||||||
|
topic.
|
||||||
|
|
||||||
|
\section*{Acknowledgements}
|
||||||
|
We owe a debt of gratitude to Hajime Suzuki for releasing his masterpiece and
|
||||||
|
insightful notes before formal publication. We thank M. Schatz, P. Rescheneder
|
||||||
|
and F. Sedlazeck for pointing out the limitation of BWA-MEM. We are also
|
||||||
|
grateful to early minimap2 testers who have greatly helped to fix various
|
||||||
|
issues.
|
||||||
|
|
||||||
\bibliography{minimap2}
|
\bibliography{minimap2}
|
||||||
|
|
||||||
|
|
@ -264,10 +297,10 @@ In conclusion, all values in Eq.~(\ref{eq:suzuki}) are bounded: $x$ and $y$ by
|
||||||
$[-q-e,-e]$ and $\tilde{x}$, $\tilde{y}$ by
|
$[-q-e,-e]$ and $\tilde{x}$, $\tilde{y}$ by
|
||||||
$[-\tilde{q}-\tilde{e},-\tilde{e}]$, and $u$ and $v$ by $[-q-e,M+q+e]$. When
|
$[-\tilde{q}-\tilde{e},-\tilde{e}]$, and $u$ and $v$ by $[-q-e,M+q+e]$. When
|
||||||
matching score and gap cost are small, each of them can be stored as a 8-bit
|
matching score and gap cost are small, each of them can be stored as a 8-bit
|
||||||
integer. This enables efficient SSE vectorization regardless of the peak score
|
integer. This enables 16-way SSE vectorization regardless of the peak score of
|
||||||
of the alignment.
|
the alignment.
|
||||||
|
|
||||||
For more efficient SSE implementation, we transform the row-column coordinate
|
For a more efficient SSE implementation, we transform the row-column coordinate
|
||||||
to diagonal-anti-diagonal coordinate by letting $r\gets i+j$ and $t\gets i$.
|
to diagonal-anti-diagonal coordinate by letting $r\gets i+j$ and $t\gets i$.
|
||||||
Eq.~(\ref{eq:suzuki}) becomes:
|
Eq.~(\ref{eq:suzuki}) becomes:
|
||||||
\begin{equation*}
|
\begin{equation*}
|
||||||
|
|
@ -283,8 +316,8 @@ y_{rt}&=&\max\{0,y_{r-1,t}+u_{r-1,t}-z_{rt}+q\}-q-e\\
|
||||||
\end{array}\right.
|
\end{array}\right.
|
||||||
\end{equation*}
|
\end{equation*}
|
||||||
In this formulation, cells with the same row index $r$ are independent of each
|
In this formulation, cells with the same row index $r$ are independent of each
|
||||||
other. This allows us to vectorize the computation of all cells on the same
|
other. This allows us to fully vectorize the computation of all cells on the
|
||||||
anti-diagonal in one inner loop.
|
same anti-diagonal in one inner loop.
|
||||||
|
|
||||||
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
|
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
|
||||||
condition of this equation in the diagonal-anti-diagonal coordinate is
|
condition of this equation in the diagonal-anti-diagonal coordinate is
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue