From 39535565ee3f936142f4662a2f8e4bcb04534e4c Mon Sep 17 00:00:00 2001 From: Heng Li Date: Tue, 6 Feb 2018 16:19:22 -0500 Subject: [PATCH] first round of revision --- tex/minimap2.bib | 16 +++++++ tex/minimap2.tex | 107 +++++++++++++++++++++++++++++++---------------- 2 files changed, 86 insertions(+), 37 deletions(-) diff --git a/tex/minimap2.bib b/tex/minimap2.bib index b12d4e9..4fed63c 100644 --- a/tex/minimap2.bib +++ b/tex/minimap2.bib @@ -313,3 +313,19 @@ Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing}, Volume = {33}, Year = {2015}} + +@article{Gurevich:2013aa, + Author = {Gurevich, Alexey and others}, + Journal = {Bioinformatics}, + Pages = {1072-5}, + Title = {{QUAST}: quality assessment tool for genome assemblies}, + Volume = {29}, + Year = {2013}} + +@article{Li:2010fk, + Author = {Li, Heng and Durbin, Richard}, + Journal = {Bioinformatics}, + Pages = {589-95}, + Title = {Fast and accurate long-read alignment with {Burrows-Wheeler} transform}, + Volume = {26}, + Year = {2010}} diff --git a/tex/minimap2.tex b/tex/minimap2.tex index d23760f..5b4514b 100644 --- a/tex/minimap2.tex +++ b/tex/minimap2.tex @@ -40,9 +40,10 @@ full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions (INDELs) and introduces new heuristics to reduce spurious alignments. -It is 3--4 times faster than mainstream short-read mappers at comparable -accuracy and $\ge$30 times faster at higher accuracy for both genomic and mRNA -reads, surpassing most aligners specialized in one type of alignment. +It is 3--4 times as fast as mainstream short-read mappers at comparable +accuracy, and is $\ge$30 times faster than long-read genomic or cDNA +mappers at higher accuracy, surpassing most aligners specialized in one type of +alignment. \section{Availability and implementation:} \href{https://github.com/lh3/minimap2}{https://github.com/lh3/minimap2} @@ -117,12 +118,12 @@ distance between two anchors is too large); otherwise \begin{equation}\label{eq:chain-gap} \beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big) \end{equation} -In implementation, a gap of length $l$ costs +In implementation, a gap of length $l\not=0$ costs \[ \gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l| \] -where $\bar{w}$ is the average seed length. For $m$ anchors, directly computing all $f(\cdot)$ with -Eq.~(\ref{eq:chain}) takes $O(m^2)$ time. Although theoretically faster +where $\bar{w}$ is the average seed length. For $n$ anchors, directly computing all $f(\cdot)$ with +Eq.~(\ref{eq:chain}) takes $O(n^2)$ time. Although theoretically faster chaining algorithms exist~\citep{Abouelhoda:2005aa}, they are inapplicable to generic gap cost, complex to implement and usually associated with a large constant. We introduced a simple heuristic to @@ -132,7 +133,7 @@ We note that if anchor $i$ is chained to $j$, chaining $i$ to a predecessor of $j$ is likely to yield a lower score. When evaluating Eq.~(\ref{eq:chain}), we start from anchor $i-1$ and stop the process if we cannot find a better score after up to $h$ iterations. This approach reduces the average time to -$O(h\cdot m)$. In practice, we can almost always find the optimal chain with +$O(h\cdot n)$. In practice, we can almost always find the optimal chain with $h=50$; even if the heuristic fails, the optimal chain is often close. \subsubsection{Backtracking} @@ -146,9 +147,11 @@ in more than one chains. \subsubsection{Identifying primary chains}\label{sec:primary} In the absence of copy number changes, each query segment should not be mapped to two places in the reference. However, chains found at the previous step may -have significant or complete overlaps due to repeats in the reference. +have significant or complete overlaps due to repeats in the reference~\citep{Li:2010fk}. Minimap2 used the following procedure to identify \emph{primary chains} that do -not greatly overlap on the query. Let $Q$ be an empty set initially. For each +not greatly overlap on the query. + +Let $Q$ be an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in $Q$ by 50\% or higher percentage of the shorter chain, mark the chain as secondary to the chain in $Q$; otherwise, @@ -156,6 +159,16 @@ add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did not choose a more sophisticated data structure (e.g. range tree or k-d tree) because this step is not the performance bottleneck. +For each primary chain, minimap2 estimates its mapping quality with an +empirical formula: +\[ +{\rm mapQ}=40\cdot (1-f_2/f_1)\cdot\min\{1,m/10\}\cdot\log f_1 +\] +where $m$ is the number of anchors on the primary chain, $f_1$ is the chaining +score, and $f_2\le f_1$ is the score of the best chain that is secondary to the +primary chain. Intuitively, a chain is assigned to a higher mapping quality if +it is long and its best secondary chain is weak. + \subsubsection{Estimating per-base sequence divergence} Suppose a query sequence harbors $n$ seeds of length $k$, $m$ of which are present in a chain. We want to estimate the sequence divergence $\epsilon$ @@ -186,7 +199,7 @@ $0.9$. \subsubsection{Indexing with homopolymer compressed $k$-mers} SmartDenovo (\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo}; -J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC) +J. Ruan, personal communication) indexes reads with homopolymer-compressed (HPC) $k$-mers and finds the strategy improves overlap sensitivity for SMRT reads. Minimap2 adopts the same heuristic. @@ -199,9 +212,9 @@ To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using different types of $k$-mers. With normal 15bp minimizers per 5bp window, minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference -alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this +alignment. With HPC 19-mers per 5bp window, minimap2 finds 97.4\% of overlaps. It achieves this higher sensitivity by indexing 1/3 fewer minimizers, which further helps -performance. HPC-based indexing reduces the sensitivity for ONT reads, though. +performance. HPC-based indexing reduces the sensitivity for current ONT reads, though. \subsection{Aligning genomic DNA}\label{sec:genomic} @@ -246,8 +259,8 @@ In case of 2-piece gap cost, define \[ \left\{\begin{array}{ll} u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\ -x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-\tilde{H}_{ij} \\ -y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-\tilde{H}_{ij} +x_{ij}\triangleq E_{i+1,j}-H_{ij} & \tilde{x}_{ij}\triangleq \tilde{E}_{i+1,j}-H_{ij} \\ +y_{ij}\triangleq F_{i,j+1}-H_{ij} & \tilde{y}_{ij}\triangleq \tilde{F}_{i,j+1}-H_{ij} \end{array}\right. \] We can transform Eq.~(\ref{eq:ae86}) to @@ -311,7 +324,7 @@ the same anti-diagonal in one inner loop. It also simplifies banded alignment, which would be difficult with striped vectorization~\citep{Farrar:2007hs}. On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the initial -values in the diagonal-antidiagonal formuation is +values in the diagonal-antidiagonal formuation are \[ \left\{\begin{array}{l} x_{r-1,-1}=y_{r-1,r}=-q-e\\ @@ -330,6 +343,13 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til \] These can be derived from the initial values for Eq.~(\ref{eq:ae86}). +When performing global alignment, we do not need to compute $H_{rt}$ in each cell. +We use 16-way vectorization throughout the alignment process. When extending +alignments from ends of chains, we need to find the cell $(r,t)$ where $H_{rt}$ +reaches the maximum. We resort to 4-way vectorization to compute +$H_{rt}=H_{r-1,t}+u_{rt}$. Because this computation is simple, +Eq.~(\ref{eq:suzuki}) is still the dominant performance bottleneck. + In practice, our 16-way vectorized implementation of global alignment is three times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without banding, our implementation is slower than Edlib~\citep{Sosic:2017aa}, but with @@ -397,7 +417,7 @@ p/2 & \mbox{if $T[i+1,i+3]$ is ${\tt GTC}$ or ${\tt GTT}$} \\ p & \mbox{otherwise} \end{array}\right.\] where $T[i,j]$ extracts a substring of $T$ between $i$ and $j$ inclusively. -$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukayotic +$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukaryotic splicing signal ${\tt GT[C/T]}$ with $p/2$~\citep{Irimia:2008aa}. Similarly, \[a(i)=\left\{\begin{array}{ll} 0 & \mbox{if $T[i-2,i]$ is ${\tt CAG}$ or ${\tt TAG}$} \\ @@ -424,13 +444,13 @@ alignment. \subsection{Aligning short paired-end reads} -During chainging, minimap2 takes a pair of reads as one fragment with a gap of +During chaining, minimap2 takes a pair of reads as one fragment with a gap of unknown length in the middle. It applies a normal gap cost between seeds on the same read but is a more permissive gap cost between seeds on different reads. -More precisely, the gap cost during chaining is: +More precisely, the gap cost during chaining is ($l\not=0$): \[ \gamma_c(l)=\left\{\begin{array}{ll} -0.01\cdot\bar{w}\cdot l+0.5\log_2 l & \mbox{if two seeds on the same read} \\ +0.01\cdot\bar{w}\cdot |l|+0.5\log_2 |l| & \mbox{if two seeds on the same read} \\ \min\{0.01\cdot\bar{w}\cdot|l|,\log_2|l|\} & \mbox{otherwise} \end{array}\right. \] @@ -443,6 +463,14 @@ consistent paired-end alignments. \section{Results} +Minimap2 is implemented in the C programming language and comes with APIs in +both C and Python. It is distributed under the MIT license, free to both +commercial and academic uses. Minimap2 uses the same base algorithm for all +applications, but it has to apply different sets of parameters depending on +input data types. Similar to BWA-MEM, minimap2 introduces `presets' that +modify multiple parameters with a simple invokation. Detailed settings +and command-line options can be found in the minimap2 manpage. + \subsection{Aligning long genomic reads}\label{sec:long-genomic} \begin{figure}[!tb] @@ -450,19 +478,22 @@ consistent paired-end alignments. \includegraphics[width=.5\textwidth]{roc-color.pdf} \caption{Evaluation on aligning simulated reads. Simulated reads were mapped to the primary assembly of human genome GRCh38. A read is considered correctly -mapped if the true position overlaps with the best mapping position by 10\% of -the read length. Read alignments are sorted by mapping quality in the -descending order. For each mapping quality threshold, the fraction of -alignments with mapping quality above the threshold and their error rate are +mapped if its longest alignment overlaps with the true interval, and the +overlap length is $\ge$10\% of the true interval length. Read alignments are +sorted by mapping quality in the descending order. For each mapping quality +threshold, the fraction of alignments (out of the number of input reads) with +mapping quality above the threshold and their error rate are plotted along the curve. (a) long-read alignment evaluation. 33,088 $\ge$1000bp reads were simulated using pbsim~\citep{Ono:2013aa} with error profile sampled from file `m131017\_060208\_42213\_*.1.*' downloaded at \href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is 11,628. Aligners were run under the default setting for SMRT reads. -(b) short-read alignment evaluation. 10 million pairs of 150bp reads were -simulated using mason2~\citep{Holtgrewe:2010aa} with option -`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under the -default setting except for changing the maximum fragment length to +Kart outputted all alignments at mapping quality 60, so is not shown in the +figure. It mapped nearly all reads with 4.1\% of alignments being wrong, less +accurate than others. (b) short-read alignment evaluation. 10 million pairs of +150bp reads were simulated using mason2~\citep{Holtgrewe:2010aa} with option +`\mbox{--illumina-prob-mismatch-scale 2.5}'. Short-read aligners were run under +the default setting except for changing the maximum fragment length to 800bp.}\label{fig:eval} \end{figure} @@ -480,11 +511,11 @@ higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and NGMLR provide better mapping quality estimate: they rarely give repetitive hits high mapping quality. Apparently, other aligners may occasionally miss close suboptimal hits and be overconfident in wrong mappings. -On run time, minialign is slightly faster than minimap2 and Kart. They are over -30 times faster than the rest. Minimap2 consumed 6.1GB memory at the peak, -more than BWA-MEM but less than others. +On run time, minimap2 took 200 CPU seconds, comparable to minialign and Kart, and is over +30 times faster than the rest. Minimap2 consumed 6.8GB memory at the peak, +more than BWA-MEM (5.4GB), similar to NGMLR and less than others. -On real human SMRT reads, the relative performance and sensitivity of +On real human SMRT reads, the relative performance and fraction of mapped reads reported by these aligners are broadly similar to the metrics on simulated data. We are unable to provide a good estimate of mapping error rate due to the lack of the truth. On ONT $\sim$100kb human reads~\citep{Jain128835}, BWA-MEM failed. @@ -526,7 +557,7 @@ Peak RAM (GByte) & 8.9 & 14.5 & 3.2 & 29.2\vspace{1em}\\ \% approx. introns & 91.8\% & 96.9\% & 92.5\% & 82.4\% \\ \botrule \end{tabular} -}{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse +}{Mouse reads (AC:SRR5286960; R9.4 chemistry) were mapped to the primary assembly of mouse genome GRCm38 with the following tools and command options: minimap2 (`-ax splice'); GMAP (`-n 0 --min-intronlength 30 --cross-species'); SpAln (`-Q7 -LS -S3'); STARlong (according to @@ -614,8 +645,9 @@ Minimap2 retains minimap's functionality to find overlaps between long reads and to search against large multi-species databases such as \emph{nt} from NCBI. Minimap2 can also align similar genomes or different assemblies of the same species. It took 7 wall-clock minutes over 8 CPU cores to align a human -SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster -MUMmer4~\citep{Kurtz:2004zr}. +SMRT assembly (AC:GCA\_001297185.1) to GRCh38. QUAST~\citep{Gurevich:2013aa} +v5.0 will use minimap2 to evaluate the quality of assemblies against a +reference genome. \section{Discussions} @@ -635,7 +667,7 @@ chaining alone is more accurate than all the other long-read mappers in Fig.~\ref{fig:eval}a (data not shown). This accuracy helps to reduce downstream base-level alignment of candidate chains, which is still times slower than chaining even with the Suzuki-Kasahara improvement. In addition, taking a -general form, minimap2 chaining can be adapted to non-typical data types such +general form, minimap2 chaining can be adapted to non-typical data types such as spliced reads and multiple reads per fragment. This gives us the opportunity to extend the same base algorithm to a variety of use cases. @@ -647,8 +679,9 @@ k-mers with a hash table instead. Such fixed-length seeds are inferior to variable-length seeds in theory, but can be computed much more efficiently in practice. When a query sequence has multiple seed hits, we can afford to skip highly repetitive seeds without affecting the final accuracy. This further -alleviates the concern with the uniqueness of seeds. Hash table is the ideal -data structure for mapping long query sequences. +alleviates the concern with the seeding uniqueness. At the same time, at low +sequence identity, it is rare to see long seeds anyway. Hash table is the ideal +data structure for mapping long noisy sequences. \section*{Acknowledgements} We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their