first round of revision
This commit is contained in:
parent
a8d476c6ad
commit
39535565ee
|
|
@ -313,3 +313,19 @@
|
|||
Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing},
|
||||
Volume = {33},
|
||||
Year = {2015}}
|
||||
|
||||
@article{Gurevich:2013aa,
|
||||
Author = {Gurevich, Alexey and others},
|
||||
Journal = {Bioinformatics},
|
||||
Pages = {1072-5},
|
||||
Title = {{QUAST}: quality assessment tool for genome assemblies},
|
||||
Volume = {29},
|
||||
Year = {2013}}
|
||||
|
||||
@article{Li:2010fk,
|
||||
Author = {Li, Heng and Durbin, Richard},
|
||||
Journal = {Bioinformatics},
|
||||
Pages = {589-95},
|
||||
Title = {Fast and accurate long-read alignment with {Burrows-Wheeler} transform},
|
||||
Volume = {26},
|
||||
Year = {2010}}
|
||||
|
|
|
|||
107
tex/minimap2.tex
107
tex/minimap2.tex
|
|
@ -40,9 +40,10 @@ full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
|
|||
related full chromosomes of hundreds of megabases in length. Minimap2 does
|
||||
split-read alignment, employs concave gap cost for long insertions and
|
||||
deletions (INDELs) and introduces new heuristics to reduce spurious alignments.
|
||||
It is 3--4 times faster than mainstream short-read mappers at comparable
|
||||
accuracy and $\ge$30 times faster at higher accuracy for both genomic and mRNA
|
||||
reads, surpassing most aligners specialized in one type of alignment.
|
||||
It is 3--4 times as fast as mainstream short-read mappers at comparable
|
||||
accuracy, and is $\ge$30 times faster than long-read genomic or cDNA
|
||||
mappers at higher accuracy, surpassing most aligners specialized in one type of
|
||||
alignment.
|
||||
|
||||
\section{Availability and implementation:}
|
||||
\href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2}
|
||||
|
|
@ -117,12 +118,12 @@ distance between two anchors is too large); otherwise
|
|||
\begin{equation}\label{eq:chain-gap}
|
||||
\beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
|
||||
\end{equation}
|
||||
In implementation, a gap of length $l$ costs
|
||||
In implementation, a gap of length $l\not=0$ costs
|
||||
\[
|
||||
\gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l|
|
||||
\]
|
||||
where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
|
||||
Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
|
||||
where $\bar{w}$ is the average seed length. For $n$ anchors, directly computing all $f(\cdot)$ with
|
||||
Eq.~(\ref{eq:chain}) takes $O(n^2)$ time. Although theoretically faster
|
||||
chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
|
||||
are inapplicable to generic gap cost, complex to implement and usually
|
||||
associated with a large constant. We introduced a simple heuristic to
|
||||
|
|
@ -132,7 +133,7 @@ We note that if anchor $i$ is chained to $j$, chaining $i$ to a predecessor
|
|||
of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
|
||||
we start from anchor $i-1$ and stop the process if we cannot find a better
|
||||
score after up to $h$ iterations. This approach reduces the average time to
|
||||
$O(h\cdot m)$. In practice, we can almost always find the optimal chain with
|
||||
$O(h\cdot n)$. In practice, we can almost always find the optimal chain with
|
||||
$h=50$; even if the heuristic fails, the optimal chain is often close.
|
||||
|
||||
\subsubsection{Backtracking}
|
||||
|
|
@ -146,9 +147,11 @@ in more than one chains.
|
|||
\subsubsection{Identifying primary chains}\label{sec:primary}
|
||||
In the absence of copy number changes, each query segment should not be mapped
|
||||
to two places in the reference. However, chains found at the previous step may
|
||||
have significant or complete overlaps due to repeats in the reference.
|
||||
have significant or complete overlaps due to repeats in the reference~\citep{Li:2010fk}.
|
||||
Minimap2 used the following procedure to identify \emph{primary chains} that do
|
||||
not greatly overlap on the query. Let $Q$ be an empty set initially. For each
|
||||
not greatly overlap on the query.
|
||||
|
||||
Let $Q$ be an empty set initially. For each
|
||||
chain from the best to the worst according to their chaining scores: if on the
|
||||
query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of
|
||||
the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise,
|
||||
|
|
@ -156,6 +159,16 @@ add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
|
|||
not choose a more sophisticated data structure (e.g. range tree or k-d tree)
|
||||
because this step is not the performance bottleneck.
|
||||
|
||||
For each primary chain, minimap2 estimates its mapping quality with an
|
||||
empirical formula:
|
||||
\[
|
||||
{\rm mapQ}=40\cdot (1-f_2/f_1)\cdot\min\{1,m/10\}\cdot\log f_1
|
||||
\]
|
||||
where $m$ is the number of anchors on the primary chain, $f_1$ is the chaining
|
||||
score, and $f_2\le f_1$ is the score of the best chain that is secondary to the
|
||||
primary chain. Intuitively, a chain is assigned to a higher mapping quality if
|
||||
it is long and its best secondary chain is weak.
|
||||
|
||||
\subsubsection{Estimating per-base sequence divergence}
|
||||
Suppose a query sequence harbors $n$ seeds of length $k$, $m$ of which are
|
||||
present in a chain. We want to estimate the sequence divergence $\epsilon$
|
||||
|
|
@ -186,7 +199,7 @@ $0.9$.
|
|||
\subsubsection{Indexing with homopolymer compressed $k$-mers}
|
||||
SmartDenovo
|
||||
(\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo};
|
||||
J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
|
||||
J. Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
|
||||
$k$-mers and finds the strategy improves overlap sensitivity for SMRT reads.
|
||||
Minimap2 adopts the same heuristic.
|
||||
|
||||
|
|
@ -199,9 +212,9 @@ To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping
|
|||
for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using
|
||||
different types of $k$-mers. With normal 15bp minimizers per 5bp window,
|
||||
minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference
|
||||
alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this
|
||||
alignment. With HPC 19-mers per 5bp window, minimap2 finds 97.4\% of overlaps. It achieves this
|
||||
higher sensitivity by indexing 1/3 fewer minimizers, which further helps
|
||||
performance. HPC-based indexing reduces the sensitivity for ONT reads, though.
|
||||
performance. HPC-based indexing reduces the sensitivity for current ONT reads, though.
|
||||
|
||||
\subsection{Aligning genomic DNA}\label{sec:genomic}
|
||||
|
||||
|
|
@ -246,8 +259,8 @@ In case of 2-piece gap cost, define
|
|||
\[
|
||||
\left\{\begin{array}{ll}
|
||||
u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
|
||||
x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-\tilde{H}_{ij} \\
|
||||
y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-\tilde{H}_{ij}
|
||||
x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-H_{ij} \\
|
||||
y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-H_{ij}
|
||||
\end{array}\right.
|
||||
\]
|
||||
We can transform Eq.~(\ref{eq:ae86}) to
|
||||
|
|
@ -311,7 +324,7 @@ the same anti-diagonal in one inner loop. It also simplifies banded alignment,
|
|||
which would be difficult with striped vectorization~\citep{Farrar:2007hs}.
|
||||
|
||||
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the initial
|
||||
values in the diagonal-antidiagonal formuation is
|
||||
values in the diagonal-antidiagonal formuation are
|
||||
\[
|
||||
\left\{\begin{array}{l}
|
||||
x_{r-1,-1}=y_{r-1,r}=-q-e\\
|
||||
|
|
@ -330,6 +343,13 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til
|
|||
\]
|
||||
These can be derived from the initial values for Eq.~(\ref{eq:ae86}).
|
||||
|
||||
When performing global alignment, we do not need to compute $H_{rt}$ in each cell.
|
||||
We use 16-way vectorization throughout the alignment process. When extending
|
||||
alignments from ends of chains, we need to find the cell $(r,t)$ where $H_{rt}$
|
||||
reaches the maximum. We resort to 4-way vectorization to compute
|
||||
$H_{rt}=H_{r-1,t}+u_{rt}$. Because this computation is simple,
|
||||
Eq.~(\ref{eq:suzuki}) is still the dominant performance bottleneck.
|
||||
|
||||
In practice, our 16-way vectorized implementation of global alignment is three
|
||||
times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without
|
||||
banding, our implementation is slower than Edlib~\citep{Sosic:2017aa}, but with
|
||||
|
|
@ -397,7 +417,7 @@ p/2 & \mbox{if $T[i+1,i+3]$ is ${\tt GTC}$ or ${\tt GTT}$} \\
|
|||
p & \mbox{otherwise}
|
||||
\end{array}\right.\]
|
||||
where $T[i,j]$ extracts a substring of $T$ between $i$ and $j$ inclusively.
|
||||
$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukayotic
|
||||
$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukaryotic
|
||||
splicing signal ${\tt GT[C/T]}$ with $p/2$~\citep{Irimia:2008aa}. Similarly,
|
||||
\[a(i)=\left\{\begin{array}{ll}
|
||||
0 & \mbox{if $T[i-2,i]$ is ${\tt CAG}$ or ${\tt TAG}$} \\
|
||||
|
|
@ -424,13 +444,13 @@ alignment.
|
|||
|
||||
\subsection{Aligning short paired-end reads}
|
||||
|
||||
During chainging, minimap2 takes a pair of reads as one fragment with a gap of
|
||||
During chaining, minimap2 takes a pair of reads as one fragment with a gap of
|
||||
unknown length in the middle. It applies a normal gap cost between seeds on the
|
||||
same read but is a more permissive gap cost between seeds on different reads.
|
||||
More precisely, the gap cost during chaining is:
|
||||
More precisely, the gap cost during chaining is ($l\not=0$):
|
||||
\[
|
||||
\gamma_c(l)=\left\{\begin{array}{ll}
|
||||
0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\
|
||||
0.01\cdot\bar{w}\cdot |l|+0.5\log_2 |l| & \mbox{if two seeds on the same read} \\
|
||||
\min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise}
|
||||
\end{array}\right.
|
||||
\]
|
||||
|
|
@ -443,6 +463,14 @@ consistent paired-end alignments.
|
|||
|
||||
\section{Results}
|
||||
|
||||
Minimap2 is implemented in the C programming language and comes with APIs in
|
||||
both C and Python. It is distributed under the MIT license, free to both
|
||||
commercial and academic uses. Minimap2 uses the same base algorithm for all
|
||||
applications, but it has to apply different sets of parameters depending on
|
||||
input data types. Similar to BWA-MEM, minimap2 introduces `presets' that
|
||||
modify multiple parameters with a simple invokation. Detailed settings
|
||||
and command-line options can be found in the minimap2 manpage.
|
||||
|
||||
\subsection{Aligning long genomic reads}\label{sec:long-genomic}
|
||||
|
||||
\begin{figure}[!tb]
|
||||
|
|
@ -450,19 +478,22 @@ consistent paired-end alignments.
|
|||
\includegraphics[width=.5\textwidth]{roc-color.pdf}
|
||||
\caption{Evaluation on aligning simulated reads. Simulated reads were mapped
|
||||
to the primary assembly of human genome GRCh38. A read is considered correctly
|
||||
mapped if the true position overlaps with the best mapping position by 10\% of
|
||||
the read length. Read alignments are sorted by mapping quality in the
|
||||
descending order. For each mapping quality threshold, the fraction of
|
||||
alignments with mapping quality above the threshold and their error rate are
|
||||
mapped if its longest alignment overlaps with the true interval, and the
|
||||
overlap length is $\ge$10\% of the true interval length. Read alignments are
|
||||
sorted by mapping quality in the descending order. For each mapping quality
|
||||
threshold, the fraction of alignments (out of the number of input reads) with
|
||||
mapping quality above the threshold and their error rate are
|
||||
plotted along the curve. (a) long-read alignment evaluation. 33,088 $\ge$1000bp
|
||||
reads were simulated using pbsim~\citep{Ono:2013aa} with error profile sampled
|
||||
from file `m131017\_060208\_42213\_*.1.*' downloaded at
|
||||
\href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is
|
||||
11,628. Aligners were run under the default setting for SMRT reads.
|
||||
(b) short-read alignment evaluation. 10 million pairs of 150bp reads were
|
||||
simulated using mason2~\citep{Holtgrewe:2010aa} with option
|
||||
`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under the
|
||||
default setting except for changing the maximum fragment length to
|
||||
Kart outputted all alignments at mapping quality 60, so is not shown in the
|
||||
figure. It mapped nearly all reads with 4.1\% of alignments being wrong, less
|
||||
accurate than others. (b) short-read alignment evaluation. 10 million pairs of
|
||||
150bp reads were simulated using mason2~\citep{Holtgrewe:2010aa} with option
|
||||
`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under
|
||||
the default setting except for changing the maximum fragment length to
|
||||
800bp.}\label{fig:eval}
|
||||
\end{figure}
|
||||
|
||||
|
|
@ -480,11 +511,11 @@ higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and
|
|||
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
|
||||
high mapping quality. Apparently, other aligners may
|
||||
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
|
||||
On run time, minialign is slightly faster than minimap2 and Kart. They are over
|
||||
30 times faster than the rest. Minimap2 consumed 6.1GB memory at the peak,
|
||||
more than BWA-MEM but less than others.
|
||||
On run time, minimap2 took 200 CPU seconds, comparable to minialign and Kart, and is over
|
||||
30 times faster than the rest. Minimap2 consumed 6.8GB memory at the peak,
|
||||
more than BWA-MEM (5.4GB), similar to NGMLR and less than others.
|
||||
|
||||
On real human SMRT reads, the relative performance and sensitivity of
|
||||
On real human SMRT reads, the relative performance and fraction of mapped reads reported by
|
||||
these aligners are broadly similar to the metrics on simulated data. We are
|
||||
unable to provide a good estimate of mapping error rate due to the lack of the
|
||||
truth. On ONT $\sim$100kb human reads~\citep{Jain128835}, BWA-MEM failed.
|
||||
|
|
@ -526,7 +557,7 @@ Peak RAM (GByte) & 8.9 & 14.5 & 3.2 & 29.2\vspace{1em}\\
|
|||
\% approx. introns & 91.8\% & 96.9\% & 92.5\% & 82.4\% \\
|
||||
\botrule
|
||||
\end{tabular}
|
||||
}{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse
|
||||
}{Mouse reads (AC:SRR5286960; R9.4 chemistry) were mapped to the primary assembly of mouse
|
||||
genome GRCm38 with the following tools and command options: minimap2 (`-ax
|
||||
splice'); GMAP (`-n 0 --min-intronlength 30 --cross-species'); SpAln (`-Q7 -LS
|
||||
-S3'); STARlong (according to
|
||||
|
|
@ -614,8 +645,9 @@ Minimap2 retains minimap's functionality to find overlaps between long reads
|
|||
and to search against large multi-species databases such as \emph{nt} from
|
||||
NCBI. Minimap2 can also align similar genomes or different assemblies of the
|
||||
same species. It took 7 wall-clock minutes over 8 CPU cores to align a human
|
||||
SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster
|
||||
MUMmer4~\citep{Kurtz:2004zr}.
|
||||
SMRT assembly (AC:GCA\_001297185.1) to GRCh38. QUAST~\citep{Gurevich:2013aa}
|
||||
v5.0 will use minimap2 to evaluate the quality of assemblies against a
|
||||
reference genome.
|
||||
|
||||
\section{Discussions}
|
||||
|
||||
|
|
@ -635,7 +667,7 @@ chaining alone is more accurate than all the other long-read mappers in
|
|||
Fig.~\ref{fig:eval}a (data not shown). This accuracy helps to reduce downstream
|
||||
base-level alignment of candidate chains, which is still times slower than
|
||||
chaining even with the Suzuki-Kasahara improvement. In addition, taking a
|
||||
general form, minimap2 chaining can be adapted to non-typical data types such
|
||||
general form, minimap2 chaining can be adapted to non-typical data types such as
|
||||
spliced reads and multiple reads per fragment. This gives us the opportunity to
|
||||
extend the same base algorithm to a variety of use cases.
|
||||
|
||||
|
|
@ -647,8 +679,9 @@ k-mers with a hash table instead. Such fixed-length seeds are inferior to
|
|||
variable-length seeds in theory, but can be computed much more efficiently in
|
||||
practice. When a query sequence has multiple seed hits, we can afford to skip
|
||||
highly repetitive seeds without affecting the final accuracy. This further
|
||||
alleviates the concern with the uniqueness of seeds. Hash table is the ideal
|
||||
data structure for mapping long query sequences.
|
||||
alleviates the concern with the seeding uniqueness. At the same time, at low
|
||||
sequence identity, it is rare to see long seeds anyway. Hash table is the ideal
|
||||
data structure for mapping long noisy sequences.
|
||||
|
||||
\section*{Acknowledgements}
|
||||
We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their
|
||||
|
|
|
|||
Loading…
Reference in New Issue