first round of revision

2018-02-06 16:19:22 -05:00 · 2018-02-06 16:19:22 -05:00 · 39535565ee
parent a8d476c6ad
commit 39535565ee
2 changed files with 86 additions and 37 deletions
--- a/tex/minimap2.bib
+++ b/tex/minimap2.bib
@ -313,3 +313,19 @@
 	Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing},
 	Volume = {33},
 	Year = {2015}}
+
+@article{Gurevich:2013aa,
+	Author = {Gurevich, Alexey and others},
+	Journal = {Bioinformatics},
+	Pages = {1072-5},
+	Title = {{QUAST}: quality assessment tool for genome assemblies},
+	Volume = {29},
+	Year = {2013}}
+
+@article{Li:2010fk,
+	Author = {Li, Heng and Durbin, Richard},
+	Journal = {Bioinformatics},
+	Pages = {589-95},
+	Title = {Fast and accurate long-read alignment with {Burrows-Wheeler} transform},
+	Volume = {26},
+	Year = {2010}}
--- a/tex/minimap2.tex
+++ b/tex/minimap2.tex
@ -40,9 +40,10 @@ full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
 related full chromosomes of hundreds of megabases in length. Minimap2 does
 split-read alignment, employs concave gap cost for long insertions and
 deletions (INDELs) and introduces new heuristics to reduce spurious alignments.
-It is 3--4 times faster than mainstream short-read mappers at comparable
-accuracy and $\ge$30 times faster at higher accuracy for both genomic and mRNA
-reads, surpassing most aligners specialized in one type of alignment.
+It is 3--4 times as fast as mainstream short-read mappers at comparable
+accuracy, and is $\ge$30 times faster than long-read genomic or cDNA
+mappers at higher accuracy, surpassing most aligners specialized in one type of
+alignment.

 \section{Availability and implementation:}
 \href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2}
@ -117,12 +118,12 @@ distance between two anchors is too large); otherwise
 \begin{equation}\label{eq:chain-gap}
 \beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
 \end{equation}
-In implementation, a gap of length $l$ costs
+In implementation, a gap of length $l\not=0$ costs
 \[
 \gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l|
 \]
-where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
-Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
+where $\bar{w}$ is the average seed length. For $n$ anchors, directly computing all $f(\cdot)$ with
+Eq.~(\ref{eq:chain}) takes $O(n^2)$ time. Although theoretically faster
 chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
 are inapplicable to generic gap cost, complex to implement and usually
 associated with a large constant. We introduced a simple heuristic to
@ -132,7 +133,7 @@ We note that if anchor $i$ is chained to $j$, chaining $i$ to a predecessor
 of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
 we start from anchor $i-1$ and stop the process if we cannot find a better
 score after up to $h$ iterations. This approach reduces the average time to
-$O(h\cdot m)$. In practice, we can almost always find the optimal chain with
+$O(h\cdot n)$. In practice, we can almost always find the optimal chain with
 $h=50$; even if the heuristic fails, the optimal chain is often close.

 \subsubsection{Backtracking}
@ -146,9 +147,11 @@ in more than one chains.
 \subsubsection{Identifying primary chains}\label{sec:primary}
 In the absence of copy number changes, each query segment should not be mapped
 to two places in the reference. However, chains found at the previous step may
-have significant or complete overlaps due to repeats in the reference.
+have significant or complete overlaps due to repeats in the reference~\citep{Li:2010fk}.
 Minimap2 used the following procedure to identify \emph{primary chains} that do
-not greatly overlap on the query. Let $Q$ be an empty set initially. For each
+not greatly overlap on the query.
+
+Let $Q$ be an empty set initially. For each
 chain from the best to the worst according to their chaining scores: if on the
 query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of
 the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise,
@ -156,6 +159,16 @@ add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
 not choose a more sophisticated data structure (e.g. range tree or k-d tree)
 because this step is not the performance bottleneck.

+For each primary chain, minimap2 estimates its mapping quality with an
+empirical formula:
+\[
+{\rm mapQ}=40\cdot (1-f_2/f_1)\cdot\min\{1,m/10\}\cdot\log f_1
+\]
+where $m$ is the number of anchors on the primary chain, $f_1$ is the chaining
+score, and $f_2\le f_1$ is the score of the best chain that is secondary to the
+primary chain. Intuitively, a chain is assigned to a higher mapping quality if
+it is long and its best secondary chain is weak.
+
 \subsubsection{Estimating per-base sequence divergence}
 Suppose a query sequence harbors $n$ seeds of length $k$, $m$ of which are
 present in a chain. We want to estimate the sequence divergence $\epsilon$
@ -186,7 +199,7 @@ $0.9$.
 \subsubsection{Indexing with homopolymer compressed $k$-mers}
 SmartDenovo
 (\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo};
-J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
+J. Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
 $k$-mers and finds the strategy improves overlap sensitivity for SMRT reads.
 Minimap2 adopts the same heuristic.

@ -199,9 +212,9 @@ To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping
 for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using
 different types of $k$-mers. With normal 15bp minimizers per 5bp window,
 minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference
-alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this
+alignment. With HPC 19-mers per 5bp window, minimap2 finds 97.4\% of overlaps. It achieves this
 higher sensitivity by indexing 1/3 fewer minimizers, which further helps
-performance. HPC-based indexing reduces the sensitivity for ONT reads, though.
+performance. HPC-based indexing reduces the sensitivity for current ONT reads, though.

 \subsection{Aligning genomic DNA}\label{sec:genomic}

@ -246,8 +259,8 @@ In case of 2-piece gap cost, define
 \[
 \left\{\begin{array}{ll}
 u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
-x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-\tilde{H}_{ij} \\
-y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-\tilde{H}_{ij}
+x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-H_{ij} \\
+y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-H_{ij}
 \end{array}\right.
 \]
 We can transform Eq.~(\ref{eq:ae86}) to
@ -311,7 +324,7 @@ the same anti-diagonal in one inner loop. It also simplifies banded alignment,
 which would be difficult with striped vectorization~\citep{Farrar:2007hs}.

 On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the initial
-values in the diagonal-antidiagonal formuation is
+values in the diagonal-antidiagonal formuation are
 \[
 \left\{\begin{array}{l}
 x_{r-1,-1}=y_{r-1,r}=-q-e\\
@ -330,6 +343,13 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til
 \]
 These can be derived from the initial values for Eq.~(\ref{eq:ae86}).

+When performing global alignment, we do not need to compute $H_{rt}$ in each cell.
+We use 16-way vectorization throughout the alignment process. When extending
+alignments from ends of chains, we need to find the cell $(r,t)$ where $H_{rt}$
+reaches the maximum. We resort to 4-way vectorization to compute
+$H_{rt}=H_{r-1,t}+u_{rt}$. Because this computation is simple,
+Eq.~(\ref{eq:suzuki}) is still the dominant performance bottleneck.
+
 In practice, our 16-way vectorized implementation of global alignment is three
 times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}.  Without
 banding, our implementation is slower than Edlib~\citep{Sosic:2017aa}, but with
@ -397,7 +417,7 @@ p/2 & \mbox{if $T[i+1,i+3]$ is ${\tt GTC}$ or ${\tt GTT}$} \\
 p & \mbox{otherwise}
 \end{array}\right.\]
 where $T[i,j]$ extracts a substring of $T$ between $i$ and $j$ inclusively.
-$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukayotic
+$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukaryotic
 splicing signal ${\tt GT[C/T]}$ with $p/2$~\citep{Irimia:2008aa}. Similarly,
 \[a(i)=\left\{\begin{array}{ll}
 0 & \mbox{if $T[i-2,i]$ is ${\tt CAG}$ or ${\tt TAG}$} \\
@ -424,13 +444,13 @@ alignment.

 \subsection{Aligning short paired-end reads}

-During chainging, minimap2 takes a pair of reads as one fragment with a gap of
+During chaining, minimap2 takes a pair of reads as one fragment with a gap of
 unknown length in the middle. It applies a normal gap cost between seeds on the
 same read but is a more permissive gap cost between seeds on different reads.
-More precisely, the gap cost during chaining is:
+More precisely, the gap cost during chaining is ($l\not=0$):
 \[
 \gamma_c(l)=\left\{\begin{array}{ll}
-0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\
+0.01\cdot\bar{w}\cdot |l|+0.5\log_2 |l| & \mbox{if two seeds on the same read} \\
 \min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise}
 \end{array}\right.
 \]
@ -443,6 +463,14 @@ consistent paired-end alignments.

 \section{Results}

+Minimap2 is implemented in the C programming language and comes with APIs in
+both C and Python. It is distributed under the MIT license, free to both
+commercial and academic uses. Minimap2 uses the same base algorithm for all
+applications, but it has to apply different sets of parameters depending on
+input data types. Similar to BWA-MEM, minimap2 introduces `presets' that
+modify multiple parameters with a simple invokation. Detailed settings
+and command-line options can be found in the minimap2 manpage.
+
 \subsection{Aligning long genomic reads}\label{sec:long-genomic}

 \begin{figure}[!tb]
@ -450,19 +478,22 @@ consistent paired-end alignments.
 \includegraphics[width=.5\textwidth]{roc-color.pdf}
 \caption{Evaluation on aligning simulated reads. Simulated reads were mapped 
 to the primary assembly of human genome GRCh38. A read is considered correctly
-mapped if the true position overlaps with the best mapping position by 10\% of
-the read length. Read alignments are sorted by mapping quality in the
-descending order. For each mapping quality threshold, the fraction of
-alignments with mapping quality above the threshold and their error rate are
+mapped if its longest alignment overlaps with the true interval, and the
+overlap length is $\ge$10\% of the true interval length.  Read alignments are
+sorted by mapping quality in the descending order. For each mapping quality
+threshold, the fraction of alignments (out of the number of input reads) with
+mapping quality above the threshold and their error rate are
 plotted along the curve. (a) long-read alignment evaluation. 33,088 $\ge$1000bp
 reads were simulated using pbsim~\citep{Ono:2013aa} with error profile sampled
 from file `m131017\_060208\_42213\_*.1.*' downloaded at
 \href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is
 11,628. Aligners were run under the default setting for SMRT reads.
-(b) short-read alignment evaluation. 10 million pairs of 150bp reads were
-simulated using mason2~\citep{Holtgrewe:2010aa} with option
-`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under the
-default setting except for changing the maximum fragment length to
+Kart outputted all alignments at mapping quality 60, so is not shown in the
+figure. It mapped nearly all reads with 4.1\% of alignments being wrong, less
+accurate than others.  (b) short-read alignment evaluation. 10 million pairs of
+150bp reads were simulated using mason2~\citep{Holtgrewe:2010aa} with option
+`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under
+the default setting except for changing the maximum fragment length to
 800bp.}\label{fig:eval}
 \end{figure}

@ -480,11 +511,11 @@ higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and
 NGMLR provide better mapping quality estimate: they rarely give repetitive hits
 high mapping quality.  Apparently, other aligners may
 occasionally miss close suboptimal hits and be overconfident in wrong mappings.
-On run time, minialign is slightly faster than minimap2 and Kart. They are over
-30 times faster than the rest.  Minimap2 consumed 6.1GB memory at the peak,
-more than BWA-MEM but less than others.
+On run time, minimap2 took 200 CPU seconds, comparable to minialign and Kart, and is over
+30 times faster than the rest.  Minimap2 consumed 6.8GB memory at the peak,
+more than BWA-MEM (5.4GB), similar to NGMLR and less than others.

-On real human SMRT reads, the relative performance and sensitivity of
+On real human SMRT reads, the relative performance and fraction of mapped reads reported by
 these aligners are broadly similar to the metrics on simulated data. We are
 unable to provide a good estimate of mapping error rate due to the lack of the
 truth.  On ONT $\sim$100kb human reads~\citep{Jain128835}, BWA-MEM failed.
@ -526,7 +557,7 @@ Peak RAM (GByte)          & 8.9      & 14.5     & 3.2      & 29.2\vspace{1em}\\
 \% approx. introns        & 91.8\%   & 96.9\%   & 92.5\%   & 82.4\% \\
 \botrule
 \end{tabular}
-}{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse
+}{Mouse reads (AC:SRR5286960; R9.4 chemistry) were mapped to the primary assembly of mouse
 genome GRCm38 with the following tools and command options: minimap2 (`-ax
 splice'); GMAP (`-n 0 --min-intronlength 30 --cross-species'); SpAln (`-Q7 -LS
 -S3'); STARlong (according to
@ -614,8 +645,9 @@ Minimap2 retains minimap's functionality to find overlaps between long reads
 and to search against large multi-species databases such as \emph{nt} from
 NCBI. Minimap2 can also align similar genomes or different assemblies of the
 same species. It took 7 wall-clock minutes over 8 CPU cores to align a human
-SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster
-MUMmer4~\citep{Kurtz:2004zr}.
+SMRT assembly (AC:GCA\_001297185.1) to GRCh38. QUAST~\citep{Gurevich:2013aa}
+v5.0 will use minimap2 to evaluate the quality of assemblies against a
+reference genome.

 \section{Discussions}

@ -635,7 +667,7 @@ chaining alone is more accurate than all the other long-read mappers in
 Fig.~\ref{fig:eval}a (data not shown). This accuracy helps to reduce downstream
 base-level alignment of candidate chains, which is still times slower than
 chaining even with the Suzuki-Kasahara improvement. In addition, taking a
-general form, minimap2 chaining can be adapted to non-typical data types such
+general form, minimap2 chaining can be adapted to non-typical data types such as
 spliced reads and multiple reads per fragment. This gives us the opportunity to
 extend the same base algorithm to a variety of use cases.

@ -647,8 +679,9 @@ k-mers with a hash table instead. Such fixed-length seeds are inferior to
 variable-length seeds in theory, but can be computed much more efficiently in
 practice. When a query sequence has multiple seed hits, we can afford to skip
 highly repetitive seeds without affecting the final accuracy. This further
-alleviates the concern with the uniqueness of seeds. Hash table is the ideal
-data structure for mapping long query sequences.
+alleviates the concern with the seeding uniqueness. At the same time, at low
+sequence identity, it is rare to see long seeds anyway. Hash table is the ideal
+data structure for mapping long noisy sequences.

 \section*{Acknowledgements}
 We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their