first round of revision

2018-02-06 16:19:22 -05:00 · 2018-02-06 16:19:22 -05:00 · 39535565ee
parent a8d476c6ad
commit 39535565ee
2 changed files with 86 additions and 37 deletions
--- a/tex/minimap2.bib
+++ b/tex/minimap2.bib
@ -313,3 +313,19 @@
 	Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing},
 	Volume = {33},
 	Year = {2015}}
@article{Gurevich:2013aa,
 	Author = {Gurevich, Alexey and others},
 	Journal = {Bioinformatics},
 	Pages = {1072-5},
 	Title = {{QUAST}: quality assessment tool for genome assemblies},
 	Volume = {29},
 	Year = {2013}}
@article{Li:2010fk,
 	Author = {Li, Heng and Durbin, Richard},
 	Journal = {Bioinformatics},
 	Pages = {589-95},
 	Title = {Fast and accurate long-read alignment with {Burrows-Wheeler} transform},
 	Volume = {26},
 	Year = {2010}}
--- a/tex/minimap2.tex
+++ b/tex/minimap2.tex
@ -40,9 +40,10 @@ full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
 related full chromosomes of hundreds of megabases in length. Minimap2 does
 split-read alignment, employs concave gap cost for long insertions and
 deletions (INDELs) and introduces new heuristics to reduce spurious alignments.
-It is 3--4 times faster than mainstream short-read mappers at comparable
+It is 3--4 times as fast as mainstream short-read mappers at comparable
-accuracy and $\ge$30 times faster at higher accuracy for both genomic and mRNA
+accuracy, and is $\ge$30 times faster than long-read genomic or cDNA
-reads, surpassing most aligners specialized in one type of alignment.
+mappers at higher accuracy, surpassing most aligners specialized in one type of
 alignment.
 \section{Availability and implementation:}
 \href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2}
@ -117,12 +118,12 @@ distance between two anchors is too large); otherwise
 \begin{equation}\label{eq:chain-gap}
 \beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big)
 \end{equation}
-In implementation, a gap of length $l$ costs
+In implementation, a gap of length $l\not=0$ costs
 \[
 \gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l|
 \]
-where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with
+where $\bar{w}$ is the average seed length. For $n$ anchors, directly computing all $f(\cdot)$ with
-Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster
+Eq.~(\ref{eq:chain}) takes $O(n^2)$ time. Although theoretically faster
 chaining algorithms exist~\citep{Abouelhoda:2005aa}, they
 are inapplicable to generic gap cost, complex to implement and usually
 associated with a large constant. We introduced a simple heuristic to
@ -132,7 +133,7 @@ We note that if anchor $i$ is chained to $j$, chaining $i$ to a predecessor
 of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}),
 we start from anchor $i-1$ and stop the process if we cannot find a better
 score after up to $h$ iterations. This approach reduces the average time to
-$O(h\cdot m)$. In practice, we can almost always find the optimal chain with
+$O(h\cdot n)$. In practice, we can almost always find the optimal chain with
 $h=50$; even if the heuristic fails, the optimal chain is often close.
 \subsubsection{Backtracking}
@ -146,9 +147,11 @@ in more than one chains.
 \subsubsection{Identifying primary chains}\label{sec:primary}
 In the absence of copy number changes, each query segment should not be mapped
 to two places in the reference. However, chains found at the previous step may
-have significant or complete overlaps due to repeats in the reference.
+have significant or complete overlaps due to repeats in the reference~\citep{Li:2010fk}.
 Minimap2 used the following procedure to identify \emph{primary chains} that do
-not greatly overlap on the query. Let $Q$ be an empty set initially. For each
+not greatly overlap on the query.
 Let $Q$ be an empty set initially. For each
 chain from the best to the worst according to their chaining scores: if on the
 query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of
 the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise,
@ -156,6 +159,16 @@ add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
 not choose a more sophisticated data structure (e.g. range tree or k-d tree)
 because this step is not the performance bottleneck.
 For each primary chain, minimap2 estimates its mapping quality with an
 empirical formula:
 \[
 {\rm mapQ}=40\cdot (1-f_2/f_1)\cdot\min\{1,m/10\}\cdot\log f_1
 \]
 where $m$ is the number of anchors on the primary chain, $f_1$ is the chaining
 score, and $f_2\le f_1$ is the score of the best chain that is secondary to the
 primary chain. Intuitively, a chain is assigned to a higher mapping quality if
 it is long and its best secondary chain is weak.
 \subsubsection{Estimating per-base sequence divergence}
 Suppose a query sequence harbors $n$ seeds of length $k$, $m$ of which are
 present in a chain. We want to estimate the sequence divergence $\epsilon$
@ -186,7 +199,7 @@ $0.9$.
 \subsubsection{Indexing with homopolymer compressed $k$-mers}
 SmartDenovo
 (\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo};
-J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
+J. Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
 $k$-mers and finds the strategy improves overlap sensitivity for SMRT reads.
 Minimap2 adopts the same heuristic.
@ -199,9 +212,9 @@ To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping
 for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using
 different types of $k$-mers. With normal 15bp minimizers per 5bp window,
 minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference
-alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this
+alignment. With HPC 19-mers per 5bp window, minimap2 finds 97.4\% of overlaps. It achieves this
 higher sensitivity by indexing 1/3 fewer minimizers, which further helps
-performance. HPC-based indexing reduces the sensitivity for ONT reads, though.
+performance. HPC-based indexing reduces the sensitivity for current ONT reads, though.
 \subsection{Aligning genomic DNA}\label{sec:genomic}
@ -246,8 +259,8 @@ In case of 2-piece gap cost, define
 \[
 \left\{\begin{array}{ll}
 u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
-x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-\tilde{H}_{ij} \\
+x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-H_{ij} \\
-y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-\tilde{H}_{ij}
+y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-H_{ij}
 \end{array}\right.
 \]
 We can transform Eq.~(\ref{eq:ae86}) to
@ -311,7 +324,7 @@ the same anti-diagonal in one inner loop. It also simplifies banded alignment,
 which would be difficult with striped vectorization~\citep{Farrar:2007hs}.
 On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the initial
-values in the diagonal-antidiagonal formuation is
+values in the diagonal-antidiagonal formuation are
 \[
 \left\{\begin{array}{l}
 x_{r-1,-1}=y_{r-1,r}=-q-e\\
@ -330,6 +343,13 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til
 \]
 These can be derived from the initial values for Eq.~(\ref{eq:ae86}).
 When performing global alignment, we do not need to compute $H_{rt}$ in each cell.
 We use 16-way vectorization throughout the alignment process. When extending
 alignments from ends of chains, we need to find the cell $(r,t)$ where $H_{rt}$
 reaches the maximum. We resort to 4-way vectorization to compute
 $H_{rt}=H_{r-1,t}+u_{rt}$. Because this computation is simple,
 Eq.~(\ref{eq:suzuki}) is still the dominant performance bottleneck.
 In practice, our 16-way vectorized implementation of global alignment is three
 times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}.  Without
 banding, our implementation is slower than Edlib~\citep{Sosic:2017aa}, but with
@ -397,7 +417,7 @@ p/2 & \mbox{if $T[i+1,i+3]$ is ${\tt GTC}$ or ${\tt GTT}$} \\
 p & \mbox{otherwise}
 \end{array}\right.\]
 where $T[i,j]$ extracts a substring of $T$ between $i$ and $j$ inclusively.
-$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukayotic
+$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukaryotic
 splicing signal ${\tt GT[C/T]}$ with $p/2$~\citep{Irimia:2008aa}. Similarly,
 \[a(i)=\left\{\begin{array}{ll}
 0 & \mbox{if $T[i-2,i]$ is ${\tt CAG}$ or ${\tt TAG}$} \\
@ -424,13 +444,13 @@ alignment.
 \subsection{Aligning short paired-end reads}
-During chainging, minimap2 takes a pair of reads as one fragment with a gap of
+During chaining, minimap2 takes a pair of reads as one fragment with a gap of
 unknown length in the middle. It applies a normal gap cost between seeds on the
 same read but is a more permissive gap cost between seeds on different reads.
-More precisely, the gap cost during chaining is:
+More precisely, the gap cost during chaining is ($l\not=0$):
 \[
 \gamma_c(l)=\left\{\begin{array}{ll}
-0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\
+0.01\cdot\bar{w}\cdot |l|+0.5\log_2 |l| & \mbox{if two seeds on the same read} \\
 \min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise}
 \end{array}\right.
 \]
@ -443,6 +463,14 @@ consistent paired-end alignments.
 \section{Results}
 Minimap2 is implemented in the C programming language and comes with APIs in
 both C and Python. It is distributed under the MIT license, free to both
 commercial and academic uses. Minimap2 uses the same base algorithm for all
 applications, but it has to apply different sets of parameters depending on
 input data types. Similar to BWA-MEM, minimap2 introduces `presets' that
 modify multiple parameters with a simple invokation. Detailed settings
 and command-line options can be found in the minimap2 manpage.
 \subsection{Aligning long genomic reads}\label{sec:long-genomic}
 \begin{figure}[!tb]
@ -450,19 +478,22 @@ consistent paired-end alignments.
 \includegraphics[width=.5\textwidth]{roc-color.pdf}
 \caption{Evaluation on aligning simulated reads. Simulated reads were mapped 
 to the primary assembly of human genome GRCh38. A read is considered correctly
-mapped if the true position overlaps with the best mapping position by 10\% of
+mapped if its longest alignment overlaps with the true interval, and the
-the read length. Read alignments are sorted by mapping quality in the
+overlap length is $\ge$10\% of the true interval length.  Read alignments are
-descending order. For each mapping quality threshold, the fraction of
+sorted by mapping quality in the descending order. For each mapping quality
-alignments with mapping quality above the threshold and their error rate are
+threshold, the fraction of alignments (out of the number of input reads) with
 mapping quality above the threshold and their error rate are
 plotted along the curve. (a) long-read alignment evaluation. 33,088 $\ge$1000bp
 reads were simulated using pbsim~\citep{Ono:2013aa} with error profile sampled
 from file `m131017\_060208\_42213\_*.1.*' downloaded at
 \href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is
 11,628. Aligners were run under the default setting for SMRT reads.
-(b) short-read alignment evaluation. 10 million pairs of 150bp reads were
+Kart outputted all alignments at mapping quality 60, so is not shown in the
-simulated using mason2~\citep{Holtgrewe:2010aa} with option
+figure. It mapped nearly all reads with 4.1\% of alignments being wrong, less
-`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under the
+accurate than others.  (b) short-read alignment evaluation. 10 million pairs of
-default setting except for changing the maximum fragment length to
+150bp reads were simulated using mason2~\citep{Holtgrewe:2010aa} with option
 `\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under
 the default setting except for changing the maximum fragment length to
 800bp.}\label{fig:eval}
 \end{figure}
@ -480,11 +511,11 @@ higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and
 NGMLR provide better mapping quality estimate: they rarely give repetitive hits
 high mapping quality.  Apparently, other aligners may
 occasionally miss close suboptimal hits and be overconfident in wrong mappings.
-On run time, minialign is slightly faster than minimap2 and Kart. They are over
+On run time, minimap2 took 200 CPU seconds, comparable to minialign and Kart, and is over
-30 times faster than the rest.  Minimap2 consumed 6.1GB memory at the peak,
+30 times faster than the rest.  Minimap2 consumed 6.8GB memory at the peak,
-more than BWA-MEM but less than others.
+more than BWA-MEM (5.4GB), similar to NGMLR and less than others.
-On real human SMRT reads, the relative performance and sensitivity of
+On real human SMRT reads, the relative performance and fraction of mapped reads reported by
 these aligners are broadly similar to the metrics on simulated data. We are
 unable to provide a good estimate of mapping error rate due to the lack of the
 truth.  On ONT $\sim$100kb human reads~\citep{Jain128835}, BWA-MEM failed.
@ -526,7 +557,7 @@ Peak RAM (GByte)          & 8.9      & 14.5     & 3.2      & 29.2\vspace{1em}\\
 \% approx. introns        & 91.8\%   & 96.9\%   & 92.5\%   & 82.4\% \\
 \botrule
 \end{tabular}
-}{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse
+}{Mouse reads (AC:SRR5286960; R9.4 chemistry) were mapped to the primary assembly of mouse
 genome GRCm38 with the following tools and command options: minimap2 (`-ax
 splice'); GMAP (`-n 0 --min-intronlength 30 --cross-species'); SpAln (`-Q7 -LS
 -S3'); STARlong (according to
@ -614,8 +645,9 @@ Minimap2 retains minimap's functionality to find overlaps between long reads
 and to search against large multi-species databases such as \emph{nt} from
 NCBI. Minimap2 can also align similar genomes or different assemblies of the
 same species. It took 7 wall-clock minutes over 8 CPU cores to align a human
-SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster
+SMRT assembly (AC:GCA\_001297185.1) to GRCh38. QUAST~\citep{Gurevich:2013aa}
-MUMmer4~\citep{Kurtz:2004zr}.
+v5.0 will use minimap2 to evaluate the quality of assemblies against a
 reference genome.
 \section{Discussions}
@ -635,7 +667,7 @@ chaining alone is more accurate than all the other long-read mappers in
 Fig.~\ref{fig:eval}a (data not shown). This accuracy helps to reduce downstream
 base-level alignment of candidate chains, which is still times slower than
 chaining even with the Suzuki-Kasahara improvement. In addition, taking a
-general form, minimap2 chaining can be adapted to non-typical data types such
+general form, minimap2 chaining can be adapted to non-typical data types such as
 spliced reads and multiple reads per fragment. This gives us the opportunity to
 extend the same base algorithm to a variety of use cases.
@ -647,8 +679,9 @@ k-mers with a hash table instead. Such fixed-length seeds are inferior to
 variable-length seeds in theory, but can be computed much more efficiently in
 practice. When a query sequence has multiple seed hits, we can afford to skip
 highly repetitive seeds without affecting the final accuracy. This further
-alleviates the concern with the uniqueness of seeds. Hash table is the ideal
+alleviates the concern with the seeding uniqueness. At the same time, at low
-data structure for mapping long query sequences.
+sequence identity, it is rare to see long seeds anyway. Hash table is the ideal
 data structure for mapping long noisy sequences.
 \section*{Acknowledgements}
 We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their