387 lines
18 KiB
TeX
387 lines
18 KiB
TeX
\documentclass{bioinfo}
|
|
\copyrightyear{2017}
|
|
\pubyear{2017}
|
|
|
|
\usepackage{graphicx}
|
|
\usepackage{hyperref}
|
|
\usepackage{url}
|
|
\usepackage{amsmath}
|
|
\usepackage[ruled,vlined]{algorithm2e}
|
|
\newcommand\mycommfont[1]{\footnotesize\rmfamily{\it #1}}
|
|
\SetCommentSty{mycommfont}
|
|
\SetKwComment{Comment}{$\triangleright$\ }{}
|
|
|
|
\usepackage{natbib}
|
|
\bibliographystyle{apalike}
|
|
\usepackage{hyperref}
|
|
|
|
\DeclareMathOperator*{\argmax}{argmax}
|
|
|
|
\begin{document}
|
|
\firstpage{1}
|
|
|
|
\title[Long DNA sequence alignment with minimap2]{Minimap2: fast pairwise alignment for long DNA sequences}
|
|
\author[Li]{Heng Li}
|
|
\address{Broad Institute, 415 Main Street, Cambridge, MA 02142, USA}
|
|
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
\section{Summary:} Minimap2 is a general-purpose mapper to align long noisy DNA
|
|
sequences against a large reference database. It targets query sequences of
|
|
1kb--100Mb in length with per-base divergence typically below 25\%. Minimap2 is
|
|
$\sim$30 times faster than many mainstream long-read aligners and achieves
|
|
higher accuracy on simulated data. It also employs concave gap cost and rescues
|
|
inversions for improved alignment around potential structural variations.
|
|
|
|
\section{Availability and implementation:}
|
|
\href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2}
|
|
|
|
\section{Contact:} hengli@broadinstitute.org
|
|
\end{abstract}
|
|
|
|
\section{Introduction}
|
|
|
|
Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore
|
|
technologies (ONT) produce reads over 10kbp in length at an error rate
|
|
$\sim$15\%. Several aligners have been developed for such
|
|
data~\citep{Chaisson:2012aa,Li:2013aa,Liu:2016ab,Sovic:2016aa,Liu:2017aa,Lin:2017aa,Sedlazeck169557}.
|
|
They are usually five times as slow as mainstream short-read
|
|
aligners~\citep{Langmead:2012fk,Li:2013aa}. We speculated there could be
|
|
substantial room for speedup on the thought that 10kb long sequences should be
|
|
easier to map than 100bp reads because we can more effectively skip repetitive
|
|
regions, which are often the bottleneck of short-read alignment. We confirmed
|
|
our speculation by achieving approximate mapping 50 times faster than
|
|
BWA-MEM~\citep{Li:2016aa}. \citet{Suzuki:2016} extended our work with a fast
|
|
and novel algorithm on generating detailed alignment, which in turn inspired us
|
|
to develop minimap2 towards higher accuracy and more practical functionality.
|
|
|
|
\begin{methods}
|
|
\section{Methods}
|
|
|
|
Minimap2 is the successor of minimap~\citep{Li:2016aa}. It uses similar
|
|
indexing and seeding algorithms, and furthers it with more accurate chaining
|
|
and the ability to produce detailed alignment.
|
|
|
|
\subsection{Chaining}
|
|
|
|
\subsubsection{Chaining}
|
|
An \emph{anchor} is a 3-tuple $(x,y,w)$, indicating interval $[x-w+1,x]$ on the
|
|
reference matching interval $[y-w+1,y]$ on the query. Given a list of anchors
|
|
sorted by ending reference position $x$, let $f(i)$ be the maximal chaining
|
|
score up to the $i$-th anchor in the list. $f(i)$ can be calculated with
|
|
dynamic programming (DP):
|
|
\begin{equation}\label{eq:chain}
|
|
f(i)=\max\big\{\max_{i>j\ge 1} \{ f(j)+\alpha(j,i)-\beta(j,i) \},w_i\big\}
|
|
\end{equation}
|
|
where $\alpha(j,i)=\min\big\{\min\{y_i-y_j,x_i-x_j\},w_i\big\}$ is the number of
|
|
matching bases between the two anchors. $\beta(j,i)>0$ is the gap cost. It
|
|
equals $\infty$ if $y_j\ge y_i$ or $\max\{y_i-y_j,x_i-x_j\}>G$ (i.e. the
|
|
distance between two anchors is too large); otherwise
|
|
\begin{equation}\label{eq:chain-gap}
|
|
\beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
|
|
\end{equation}
|
|
In implementation, a gap of length $l$ costs $\gamma_c(l)=0.01\cdot \bar{w}\cdot
|
|
|l|+0.5\log_2|l|$, where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
|
|
Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
|
|
chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
|
|
are inapplicable to generic gap cost, complex to implement and usually
|
|
associated with a large constant. We introduced a simple heuristic to
|
|
accelerate chaining.
|
|
|
|
We note that if anchor $i$ is chained to $j$, chaining $i$ to a predecessor
|
|
of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
|
|
we start from anchor $i-1$ and stop the evaluation if we cannot find a better
|
|
score after up to $h$ iterations. This approach reduces the average time to
|
|
$O(h\cdot m)$. In practice, we can almost always find the optimal chain with
|
|
$h=50$; even if the heuristic fails, the optimal chain is often close.
|
|
|
|
\subsubsection{Backtracking}
|
|
Let $P(i)$ be the index of the best predecessor of anchor $i$. It equals 0 if
|
|
$f(i)=w_i$ or $\argmax_j\{f(j)+\eta(j,i)-\gamma(j,i)\}$ otherwise. For each
|
|
anchor $i$ in the descending order of $f(i)$, we apply $P(\cdot)$ repeatedly to
|
|
find its predecessor and mark each visited $i$ as `used', until $P(i)=0$ or we
|
|
reach an already `used' $i$. This way we find all chains with no anchors used
|
|
in more than one chains.
|
|
|
|
\subsubsection{Identifying primary chains}
|
|
In the absence of copy number changes, each query segment should not be mapped
|
|
to two places in the reference. However, chains found at the previous step may
|
|
have significant or complete overlaps due to repeats in the reference.
|
|
Minimap2 used the following procedure to identify \emph{primary chains} that do
|
|
not greatly overlap on the query. Let $Q$ be an empty set initially. For each
|
|
chain from the best to the worst according to their chaining scores: if on the
|
|
query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of
|
|
the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise,
|
|
add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
|
|
not choose a more sophisticated data structure (e.g. range tree or k-d tree)
|
|
because this step is not the performance bottleneck.
|
|
|
|
\subsection{Aligning genomic DNA}
|
|
|
|
\subsubsection{Alignment with 2-piece affine gap cost}
|
|
|
|
Minimap2 performs DP-based global alignment between adjacent anchors in a
|
|
chain. It uses a 2-piece affine gap cost~\citep{Gotoh:1990aa}:
|
|
\begin{equation}\label{eq:2-piece}
|
|
\gamma_a(l)=\min\{q+|l|\cdot e,\tilde{q}+|l|\cdot\tilde{e}\}
|
|
\end{equation}
|
|
Without losing generality, we always assume $q+e<\tilde{q}+\tilde{e}$.
|
|
On the condition that $e>\tilde{e}$, it applies cost $q+|l|\cdot e$ to gaps
|
|
shorter than $\lceil(\tilde{q}-q)/(e-\tilde{e})\rceil$ and applies
|
|
$\tilde{q}+|l|\cdot\tilde{e}$ to longer gaps. This scheme helps to recover
|
|
longer insertions and deletions~(INDELs).
|
|
|
|
The equation to compute the optimal alignment under $\gamma_a(\cdot)$ is
|
|
\begin{equation}\label{eq:ae86}
|
|
\left\{\begin{array}{l}
|
|
H_{ij} = \max\{H_{i-1,j-1}+s(i,j),E_{ij},F_{ij},\tilde{E}_{ij},\tilde{F}_{ij}\}\\
|
|
E_{i+1,j}= \max\{H_{ij}-q,E_{ij}\}-e\\
|
|
F_{i,j+1}= \max\{H_{ij}-q,F_{ij}\}-e\\
|
|
\tilde{E}_{i+1,j}= \max\{H_{ij}-\tilde{q},\tilde{E}_{ij}\}-\tilde{e}\\
|
|
\tilde{F}_{i,j+1}= \max\{H_{ij}-\tilde{q},\tilde{F}_{ij}\}-\tilde{e}
|
|
\end{array}\right.
|
|
\end{equation}
|
|
where $s(i,j)$ is the score between the $i$-th reference base and $j$-th query
|
|
base. It is a natural extension to the algorithm under affine gap
|
|
cost~\citep{Gotoh:1982aa,Altschul:1986aa}.
|
|
|
|
\subsubsection{Suzuki's formulation}
|
|
|
|
To efficiently align long sequences, minimap2 did not directly use
|
|
Eq.~(\ref{eq:ae86}) for alignment. It instead adopted a difference-based
|
|
formulation first proposed by \citet{Wu:1996aa} and later adapted by
|
|
\citet{Suzuki:2016} for affine gap cost. In case of 2-piece affine gap cost in
|
|
Eq.~(\ref{eq:2-piece}), define
|
|
\[
|
|
\left\{\begin{array}{ll}
|
|
u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
|
|
x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-\tilde{H}_{ij} \\
|
|
y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-\tilde{H}_{ij}
|
|
\end{array}\right.
|
|
\]
|
|
We can transform Eq.~(\ref{eq:ae86}) to
|
|
\begin{equation}\label{eq:suzuki}
|
|
\left\{\begin{array}{lll}
|
|
z_{ij}&=&\max\{s(i,j),x_{i-1,j}+v_{i-1,j},y_{i,j-1}+u_{i,j-1},\\
|
|
&&\tilde{x}_{i-1,j}+v_{i-1,j},\tilde{y}_{i,j-1}+u_{i,j-1}\}\\
|
|
u_{ij}&=&z_{ij}-v_{i-1,j}\\
|
|
v_{ij}&=&z_{ij}-u_{i,j-1}\\
|
|
x_{ij}&=&\max\{0,x_{i-1,j}+v_{i-1,j}-z_{ij}+q\}-q-e\\
|
|
y_{ij}&=&\max\{0,y_{i,j-1}+u_{i,j-1}-z_{ij}+q\}-q-e\\
|
|
\tilde{x}_{ij}&=&\max\{0,\tilde{x}_{i-1,j}+v_{i-1,j}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e}\\
|
|
\tilde{y}_{ij}&=&\max\{0,\tilde{y}_{i,j-1}+u_{i,j-1}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e}
|
|
\end{array}\right.
|
|
\end{equation}
|
|
where $z_{ij}$ is a temporary variable that does not need to be stored.
|
|
This is Suzuki's formulation for 2-piece affine gap cost. An important property
|
|
of this formulation is that all values are bounded. To see that,
|
|
\[
|
|
x_{ij}=E_{i+1,j}-H_{ij}=\max\{-q,E_{ij}-H_{ij}\}-e
|
|
\]
|
|
With $E_{ij}\le H_{ij}$, we have
|
|
\[
|
|
-q-e\le x_{ij}\le\max\{-q,0\}-e=-e
|
|
\]
|
|
and similar inequations for $y_{ij}$, $\tilde{x}_{ij}$ and $\tilde{y}_{ij}$.
|
|
In addition,
|
|
\[
|
|
u_{ij}=z_{ij}-v_{i-1,j}\ge\max\{x_{i-1,j},\tilde{x}_{i-1,j}\}\ge-q-e
|
|
\]
|
|
As the maximum value of $z_{ij}=H_{ij}-H_{i-1,j-1}$ is $M$, the maximal
|
|
matching score, we can derive
|
|
\[
|
|
u_{ij}\le M-v_{i-1,j}\le M+q+e
|
|
\]
|
|
In conclusion, in Eq.~(\ref{eq:suzuki}), $x$ and $y$ are bounded by $[-q-e,-e]$,
|
|
$\tilde{x}$ and $\tilde{y}$ by $[-\tilde{q}-\tilde{e},-\tilde{e}]$, and $u$ and
|
|
$v$ by $[-q-e,M+q+e]$. When $-128\le-q-e<M+q+e\le127$, each of them can be stored as
|
|
a 8-bit integer. This enables 16-way SSE vectorization regardless of the peak
|
|
score of the alignment.
|
|
|
|
For a more efficient SSE implementation, we transform the row-column coordinate
|
|
to diagonal-anti-diagonal coordinate by letting $r\gets i+j$ and $t\gets i$.
|
|
Eq.~(\ref{eq:suzuki}) becomes:
|
|
\begin{equation*}
|
|
\left\{\begin{array}{lll}
|
|
z_{rt}&=&\max\{s(t,r-t),x_{r-1,t-1}+v_{r-1,t-1},y_{r-1,t}\\
|
|
&&+u_{r-1,t},\tilde{x}_{r-1,t-1}+v_{r-1,t-1},\tilde{y}_{r-1,t}+u_{r-1,t}\}\\
|
|
u_{rt}&=&z_{rt}-v_{r-1,t-1}\\
|
|
v_{rt}&=&z_{rt}-u_{r-1,t}\\
|
|
x_{rt}&=&\max\{0,x_{r-1,t-1}+v_{r-1,t-1}-z_{rt}+q\}-q-e\\
|
|
y_{rt}&=&\max\{0,y_{r-1,t}+u_{r-1,t}-z_{rt}+q\}-q-e\\
|
|
\tilde{x}_{rt}&=&\max\{0,\tilde{x}_{r-1,t-1}+v_{r-1,t-1}-z_{rt}+\tilde{q}\}-\tilde{q}-\tilde{e}\\
|
|
\tilde{y}_{rt}&=&\max\{0,\tilde{y}_{r-1,t}+u_{r-1,t}-z_{rt}+\tilde{q}\}-\tilde{q}-\tilde{e}
|
|
\end{array}\right.
|
|
\end{equation*}
|
|
In this formulation, cells with the same diagonal index $r$ are independent of
|
|
each other. This allows us to fully vectorize the computation of all cells on
|
|
the same anti-diagonal in one inner loop.
|
|
|
|
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
|
|
condition of this equation in the diagonal-anti-diagonal coordinate is
|
|
\[
|
|
\left\{\begin{array}{l}
|
|
x_{r-1,-1}=y_{r-1,r}=-q-e\\
|
|
\tilde{x}_{r-1,-1}=\tilde{y}_{r-1,r}=-\tilde{q}-\tilde{e}\\
|
|
u_{r-1,r}=v_{r-1,-1}=\eta(r)\\
|
|
\end{array}\right.
|
|
\]
|
|
where
|
|
\[
|
|
\eta(r)=\left\{\begin{array}{ll}
|
|
-q-e & (r=0) \\
|
|
-e & (r<\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil) \\
|
|
r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil) \\
|
|
-\tilde{e} & (r>\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil)
|
|
\end{array}\right.
|
|
\]
|
|
|
|
In practice, our 16-way vectorized implementation of global alignment is three
|
|
times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without
|
|
banding, our implementation is slower than Edlib~\citep{Sosic:2017aa}, but with
|
|
a 1000bp band, it is considerably faster. When performing global alignment
|
|
between anchors, we expect the alignment to stay close to the diagonal of the
|
|
DP matrix. Banding is applicable most of time.
|
|
|
|
\subsubsection{The Z-drop heuristic}
|
|
|
|
With global alignment, minimap2 may force to align unrelated sequences between
|
|
two adjacent anchors. To avoid such an artifact, we compute accumulative
|
|
alignment score along the alignment path and break the alignment where the
|
|
score drops too fast in the diagonal direction. More precisely, let $S(i,j)$ be
|
|
the alignment score along the alignment path ending at cell $(i,j)$ in the DP
|
|
matrix. We break the alignment if there exist $(i',j')$ and $(i,j)$, $i'<i$ and
|
|
$j'<j$, such that
|
|
\[
|
|
S(i',j')-S(i,j)>Z+e\cdot|(i-i')-(j-j')|
|
|
\]
|
|
where $e$ is the gap extension cost and $Z$ is an arbitrary threshold.
|
|
This strategy is similar to X-drop employed in BLAST~\citep{Altschul:1997vn}.
|
|
However, unlike X-drop, it would not break the alignment in the presence of a
|
|
single long gap.
|
|
|
|
When minimap2 breaks a global alignment between two anchors, it performs local
|
|
alignment between the two subsequences involved in the global alignment, but
|
|
this time with the one subsequence reverse complemented. This additional
|
|
alignment step may identify short inversions that are missed during chaining.
|
|
|
|
\subsection{Aligning spliced sequences}
|
|
|
|
The algorithm described above can be adapted to spliced alignment. In this
|
|
mode, the chaining gap cost distinguishes insertions to and deletions from the
|
|
reference: $\gamma_c(l)$ in Eq.~(\ref{eq:chain-gap}) takes the form of
|
|
\[
|
|
\gamma_c(l)=\left\{\begin{array}{ll}
|
|
0.01\cdot\bar{w}\cdot l+0.5\log_2 l & (l>0) \\
|
|
\min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & (l<0)
|
|
\end{array}\right.
|
|
\]
|
|
Similarly, the gap cost function used for DP-based alignment is changed to
|
|
\[
|
|
\gamma_a(l)=\left\{\begin{array}{ll}
|
|
q+l\cdot e & (l>0) \\
|
|
\min\{q+|l|\cdot e,\tilde{q}\} & (l<0)
|
|
\end{array}\right.
|
|
\]
|
|
In alignment, a deletion no shorter than $\lceil(\tilde{q}-q)/e\rceil$ is
|
|
regarded as an intron, which pays no cost to gap extensions.
|
|
|
|
To pinpoint precise exon boundaries, minimap2 penalizes non-canonical splicing
|
|
with the following equation
|
|
\begin{equation}\label{eq:splice}
|
|
\left\{\begin{array}{l}
|
|
H_{ij} = \max\{H_{i-1,j-1}+s(i,j),E_{ij},F_{ij},\tilde{E}_{ij}-a_i\}\\
|
|
E_{i+1,j}= \max\{H_{ij}-q,E_{ij}\}-e\\
|
|
F_{i,j+1}= \max\{H_{ij}-q,F_{ij}\}-e\\
|
|
\tilde{E}_{i+1,j}= \max\{H_{ij}-d_i-\tilde{q},\tilde{E}_{ij}\}\\
|
|
\end{array}\right.
|
|
\end{equation}
|
|
Let $T$ be the reference sequence. $d_i$ is the cost of a non-canonical donor
|
|
site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a postive number $p$
|
|
otherwise. Similarly, $a_i$ is the cost of a non-canonical acceptor site, which
|
|
takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. When the read strand relative
|
|
to the underlying transcript is unknown, minimap2 aligns each chain twice, first
|
|
assuming ${\tt GT}$--${\tt AG}$ as the splice signal and then assuming ${\tt
|
|
CT}$--${\tt AC}$, the reverse complement of ${\tt GT}$--${\tt AG}$, as the
|
|
splice signal. The alignment with a higher score is taken as the final
|
|
alignment. This procedure also infers the relative strand of reads spanning
|
|
canonical splicing sites.
|
|
|
|
In the spliced alignment mode, minimap2 further increases the density of
|
|
minimizers and disables banded alignment. Together with the two-round DP-based
|
|
alignment, spliced alignment is several times slower than DNA sequence
|
|
alignment.
|
|
|
|
\end{methods}
|
|
|
|
\section{Results}
|
|
\begin{figure}[!tb]
|
|
\centering
|
|
\includegraphics[width=.5\textwidth]{roc-color.pdf}
|
|
\caption{Evaluation on simulated SMRT reads aligned against human genome
|
|
GRCh38. (a) ROC-like curve. (b) Accumulative mapping error rate as a function
|
|
of mapping quality. 33,088 $\ge$1000bp reads were simulated using
|
|
pbsim~\citep{Ono:2013aa} with error profile sampled from file
|
|
`m131017\_060208\_42213\_*.1.*' downloaded at
|
|
\href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is
|
|
11,628. A read is considered correctly mapped if the true position overlaps
|
|
with the best mapping position by 10\% of the read length. All aligners were
|
|
run under the default setting for SMRT reads.}\label{fig:eval}
|
|
\end{figure}
|
|
|
|
As a sanity check, we evaluated minimap2 on simulated human reads along with
|
|
BLASR~\citep{Chaisson:2012aa},
|
|
BWA-MEM~\citep{Li:2013aa},
|
|
GraphMap~\citep{Sovic:2016aa},
|
|
minialign~\citep{Suzuki:2016} and
|
|
NGMLR~\citep{Sedlazeck169557}. We excluded rHAT~\citep{Liu:2016ab},
|
|
LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either
|
|
crashed or produced malformatted output. In this evaluation, Minimap2 has
|
|
higher power to distinguish unique and repetitive hits, and achieves overall
|
|
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
|
|
even if we skip DP-based alignment (data not shown), suggesting chaining alone
|
|
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
|
|
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
|
|
high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may
|
|
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
|
|
On run time, minialign is slightly faster than minimap2. They are over 30 times
|
|
faster than the rest. Minimap2 consumed 6.1GB memory at the peak, more than
|
|
BWA-MEM but less than others.
|
|
|
|
On real human SMRT reads, the relative performance and sensitivity of
|
|
these aligners are broadly similar to those on simulated data. We are unable to
|
|
provide a good estimate of mapping error rate due to the lack of the truth. On
|
|
ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and
|
|
minimap2 are over 70 times faster than others. We have also examined tens of
|
|
$\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can confirm the
|
|
observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks them into
|
|
shorter gaps. Minimap2 does not have this issue.
|
|
|
|
\section{Discussions}
|
|
|
|
Minialign and minimap2 are fast because a) with chaining, they can quickly
|
|
filter out most false seed hits~\citep{Li:2016aa} and reduce unsuccessful but
|
|
costly DP-based alignments; b) they implemented so far the fastest DP-based
|
|
alignment algorithm for long sequences~\citep{Suzuki:2016}. It is possible to
|
|
further accelerate minimap2 with a few other tweaks such as adaptive
|
|
banding~\citep{Suzuki130633} or incremental banding.
|
|
|
|
In addition to reference-based read mapping, minimap2 inherits minimap's
|
|
ability to search against huge multi-species databases and to find read
|
|
overlaps. On a few test data sets, minimap2 appears to yield slightly better
|
|
miniasm assembly. Minimap2 can also align closely related genomes, though it
|
|
would benefit from more thorough evaluations. Genome alignment is an intricate
|
|
topic.
|
|
|
|
\section*{Acknowledgements}
|
|
We owe a debt of gratitude to Hajime Suzuki for releasing his masterpiece and
|
|
insightful notes before formal publication. We thank M. Schatz, P. Rescheneder
|
|
and F. Sedlazeck for pointing out the limitation of BWA-MEM. We are also
|
|
grateful to early minimap2 testers who have greatly helped to fix various
|
|
issues.
|
|
|
|
\bibliography{minimap2}
|
|
|
|
\end{document}
|