backup manuscript

This commit is contained in:
Heng Li 2017-08-24 22:05:14 +08:00
parent b8bdac7e64
commit 240f6caaff
3 changed files with 129 additions and 80 deletions

View File

@ -94,7 +94,7 @@ while (file.readline(buf) >= 0) {
ori_qlen = parseInt(t[1]); ori_qlen = parseInt(t[1]);
} else { // SAM } else { // SAM
var flag = parseInt(t[1]); var flag = parseInt(t[1]);
if (flag & 4) continue; if ((flag & 4) || t[2] == '*' || t[5] == '*') continue;
if (flag & 0x100) { if (flag & 0x100) {
++n_2nd; ++n_2nd;
continue; continue;

View File

@ -251,3 +251,11 @@
Title = {A cross-species alignment tool {(CAT)}}, Title = {A cross-species alignment tool {(CAT)}},
Volume = {8}, Volume = {8},
Year = {2007}} Year = {2007}}
@article{Farrar:2007hs,
Author = {Farrar, Michael},
Journal = {Bioinformatics},
Pages = {156-61},
Title = {{Striped Smith-Waterman speeds database searches six times over other SIMD implementations}},
Volume = {23},
Year = {2007}}

View File

@ -49,16 +49,24 @@ Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore
technologies (ONT) produce reads over 10kbp in length at an error rate technologies (ONT) produce reads over 10kbp in length at an error rate
$\sim$15\%. Several aligners have been developed for such $\sim$15\%. Several aligners have been developed for such
data~\citep{Chaisson:2012aa,Li:2013aa,Liu:2016ab,Sovic:2016aa,Liu:2017aa,Lin:2017aa,Sedlazeck169557}. data~\citep{Chaisson:2012aa,Li:2013aa,Liu:2016ab,Sovic:2016aa,Liu:2017aa,Lin:2017aa,Sedlazeck169557}.
They are usually five times as slow as mainstream short-read Most of them were five times as slow as mainstream short-read
aligners~\citep{Langmead:2012fk,Li:2013aa}. We speculated there could be aligners~\citep{Langmead:2012fk,Li:2013aa} in terms of the number of bases
substantial room for speedup on the thought that 10kb long sequences should be mapped per second. We speculated there could be substantial room for speedup on
easier to map than 100bp reads because we can more effectively skip repetitive the thought that 10kb long sequences should be easier to map than 100bp reads
regions, which are often the bottleneck of short-read alignment. We confirmed because we can more effectively skip repetitive regions, which are often the
our speculation by achieving approximate mapping 50 times faster than bottleneck of short-read alignment. We confirmed our speculation by achieving
BWA-MEM~\citep{Li:2016aa}. \citet{Suzuki:2016} extended our work with a fast approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
and novel algorithm on generating base-level alignment, which in turn inspired \citet{Suzuki:2016} extended our work with a fast and novel algorithm on
us to develop minimap2 towards higher accuracy and more practical generating base-level alignment, which in turn inspired us to develop minimap2
functionality. towards higher accuracy and more practical functionality.
Both SMRT and ONT have been applied to sequence spliced mRNAs (RNA-seq). While
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
optimized for long noisy sequence reads and are tens of times slower than
dedicated long-read aligners. When developing minimap2 initially for aligning
genomic DNA only, we realized minor modifications could make it competitive for
aligning mRNAs as well. Minimap2 is a first RNA-seq aligner specifically
designed for long noisy reads.
\begin{methods} \begin{methods}
\section{Methods} \section{Methods}
@ -83,7 +91,8 @@ spliced alignment.
An \emph{anchor} is a 3-tuple $(x,y,w)$, indicating interval $[x-w+1,x]$ on the An \emph{anchor} is a 3-tuple $(x,y,w)$, indicating interval $[x-w+1,x]$ on the
reference matching interval $[y-w+1,y]$ on the query. Given a list of anchors reference matching interval $[y-w+1,y]$ on the query. Given a list of anchors
sorted by ending reference position $x$, let $f(i)$ be the maximal chaining sorted by ending reference position $x$, let $f(i)$ be the maximal chaining
score up to the $i$-th anchor in the list. $f(i)$ can be calculated with DP: score up to the $i$-th anchor in the list. $f(i)$ can be calculated with
dynamic programming:
\begin{equation}\label{eq:chain} \begin{equation}\label{eq:chain}
f(i)=\max\big\{\max_{i>j\ge 1} \{ f(j)+\alpha(j,i)-\beta(j,i) \},w_i\big\} f(i)=\max\big\{\max_{i>j\ge 1} \{ f(j)+\alpha(j,i)-\beta(j,i) \},w_i\big\}
\end{equation} \end{equation}
@ -161,11 +170,15 @@ gap cost~\citep{Gotoh:1982aa,Altschul:1986aa}.
\subsubsection{Suzuki's formulation} \subsubsection{Suzuki's formulation}
To efficiently align long sequences, minimap2 did not directly implement When we allow gaps longer than several hundred base pairs, nucleotide-level
Eq.~(\ref{eq:ae86}). It instead adopted a difference-based alignment is much slower than chaining. SSE acceleration is critical to the
formulation first proposed by \citet{Wu:1996aa} and later adapted by performance of minimap2. Traditional SSE implementations~\citep{Farrar:2007hs}
\citet{Suzuki:2016} for affine gap cost. In case of 2-piece affine gap cost in based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short
Eq.~(\ref{eq:2-piece}), define sequences, but only 4-way parallelization when the peak alignment score reaches
32767. Long sequence alignment may exceed this threshold. Inspired by
\citet{Wu:1996aa} and the following work, \citet{Suzuki:2016} proposed a
difference-based formulation that lifted this limitation. In case of 2-piece
gap cost, define
\[ \[
\left\{\begin{array}{ll} \left\{\begin{array}{ll}
u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\ u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
@ -186,9 +199,9 @@ y_{ij}&=&\max\{0,y_{i,j-1}+u_{i,j-1}-z_{ij}+q\}-q-e\\
\tilde{y}_{ij}&=&\max\{0,\tilde{y}_{i,j-1}+u_{i,j-1}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e} \tilde{y}_{ij}&=&\max\{0,\tilde{y}_{i,j-1}+u_{i,j-1}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e}
\end{array}\right. \end{array}\right.
\end{equation} \end{equation}
where $z_{ij}$ is a temporary variable that does not need to be stored. where $z_{ij}$ is a temporary variable that does not need to be stored. An
This is Suzuki's formulation for 2-piece affine gap cost. An important property important property of Eq.~(\ref{eq:suzuki}) is that all values are bounded. To
of this formulation is that all values are bounded. To see that, see that,
\[ \[
x_{ij}=E_{i+1,j}-H_{ij}=\max\{-q,E_{ij}-H_{ij}\}-e x_{ij}=E_{i+1,j}-H_{ij}=\max\{-q,E_{ij}-H_{ij}\}-e
\] \]
@ -229,10 +242,11 @@ y_{rt}&=&\max\{0,y_{r-1,t}+u_{r-1,t}-z_{rt}+q\}-q-e\\
\end{equation*} \end{equation*}
In this formulation, cells with the same diagonal index $r$ are independent of In this formulation, cells with the same diagonal index $r$ are independent of
each other. This allows us to fully vectorize the computation of all cells on each other. This allows us to fully vectorize the computation of all cells on
the same anti-diagonal in one inner loop. the same anti-diagonal in one inner loop. It also simplifies banded alignment,
which would be difficult with striped vectorization~\citep{Farrar:2007hs}.
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
condition of this equation in the diagonal-antidiagonal coordinate is condition of the equation above is
\[ \[
\left\{\begin{array}{l} \left\{\begin{array}{l}
x_{r-1,-1}=y_{r-1,r}=-q-e\\ x_{r-1,-1}=y_{r-1,r}=-q-e\\
@ -249,6 +263,7 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til
-\tilde{e} & (r>\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil) -\tilde{e} & (r>\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil)
\end{array}\right. \end{array}\right.
\] \]
These can be derived from the initial conditions of Eq.~(\ref{eq:ae86}).
In practice, our 16-way vectorized implementation of global alignment is three In practice, our 16-way vectorized implementation of global alignment is three
times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without
@ -298,8 +313,10 @@ q+l\cdot e & (l>0) \\
\end{array}\right. \end{array}\right.
\] \]
In alignment, a deletion no shorter than $\lceil(\tilde{q}-q)/e\rceil$ is In alignment, a deletion no shorter than $\lceil(\tilde{q}-q)/e\rceil$ is
regarded as an intron, which pays no cost to gap extensions. Minimap2 further regarded as an intron, which pays no cost to gap extensions.
introduces reference-dependent cost to penalize non-canonical splicing:
To pinpoint precise splicing junctions, minimap2 introduces reference-dependent
cost to penalize non-canonical splicing:
\begin{equation}\label{eq:splice} \begin{equation}\label{eq:splice}
\left\{\begin{array}{l} \left\{\begin{array}{l}
H_{ij} = \max\{H_{i-1,j-1}+s(i,j),E_{ij},F_{ij},\tilde{E}_{ij}-a(i)\}\\ H_{ij} = \max\{H_{i-1,j-1}+s(i,j),E_{ij},F_{ij},\tilde{E}_{ij}-a(i)\}\\
@ -312,10 +329,9 @@ Let $T$ be the reference sequence. $d(i)$ is the cost of a non-canonical donor
site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a postive number $p$ site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a postive number $p$
otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which
takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is
almost the same as the equation used by EXALIN~\citep{Zhang:2006aa} and almost equivalent to the equation used by EXALIN~\citep{Zhang:2006aa} except
CAT~\citep{Li:2007aa} except that we allow insertions immediately followed by that we allow insertions immediately followed by deletions and vice versa; in
deletions and vice versa; in addition, we use Suzuki's diagonal formulation in addition, we use Suzuki's diagonal formulation in actual implementation.
actual implementation.
%Given that $d_i$ and $a_i$ %Given that $d_i$ and $a_i$
%are a function of the reference sequence, it is possible to incorporate %are a function of the reference sequence, it is possible to incorporate
@ -345,60 +361,66 @@ alignment.
\centering \centering
\includegraphics[width=.5\textwidth]{roc-color.pdf} \includegraphics[width=.5\textwidth]{roc-color.pdf}
\caption{Evaluation on simulated SMRT reads aligned against human genome \caption{Evaluation on simulated SMRT reads aligned against human genome
GRCh38. (a) ROC-like curve. (b) Accumulative mapping error rate as a function GRCh38. (a) ROC-like curve. Alignments are sorted by mapping quality in the
of mapping quality. 33,088 $\ge$1000bp reads were simulated using descending order. For each mapping quality threshold, the fraction of
pbsim~\citep{Ono:2013aa} with error profile sampled from file alignments with mapping quality above the threshold and their error rate
`m131017\_060208\_42213\_*.1.*' downloaded at are plotted. (b) Accumulative mapping error rate as a function of mapping
\href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is quality. 33,088 $\ge$1000bp reads were simulated using pbsim~\citep{Ono:2013aa}
11,628. A read is considered correctly mapped if the true position overlaps with error profile sampled from file `m131017\_060208\_42213\_*.1.*' downloaded
at \href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length
is 11,628. A read is considered correctly mapped if the true position overlaps
with the best mapping position by 10\% of the read length. All aligners were with the best mapping position by 10\% of the read length. All aligners were
run under the default setting for SMRT reads.}\label{fig:eval} run under the default setting for SMRT reads. Kart outputted all alignments at
mapping quality 60, so is not shown in the figure. It mapped nearly all reads
with 4.1\% of alignments being wrong, less accurate than others.}\label{fig:eval}
\end{figure} \end{figure}
As a sanity check, we evaluated minimap2 on simulated human reads along with As a sanity check, we evaluated minimap2 on simulated human reads along with
BLASR~\citep{Chaisson:2012aa}, BLASR~(v1.MC.rc64; \citealp{Chaisson:2012aa}),
BWA-MEM~\citep{Li:2013aa}, BWA-MEM~(v0.7.15; \citealp{Li:2013aa}),
GraphMap~\citep{Sovic:2016aa}, GraphMap~(v0.5.2; \citealp{Sovic:2016aa}),
minialign~\citep{Suzuki:2016} and Kart~(v2.2.5; \citealp{Lin:2017aa}),
NGMLR~\citep{Sedlazeck169557}. We excluded rHAT~\citep{Liu:2016ab}, minialign~(v0.5.3; \citealp{Suzuki:2016}) and
LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either NGMLR~(v0.2.5; \citealp{Sedlazeck169557}). We excluded rHAT~\citep{Liu:2016ab}
and LAMSA~\citep{Liu:2017aa} because they either
crashed or produced malformatted output. In this evaluation, Minimap2 has crashed or produced malformatted output. In this evaluation, Minimap2 has
higher power to distinguish unique and repetitive hits, and achieves overall higher power to distinguish unique and repetitive hits, and achieves overall
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
even if we skip DP-based alignment (data not shown), suggesting chaining alone even if we skip DP-based alignment (data not shown), confirming chaining alone
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
NGMLR provide better mapping quality estimate: they rarely give repetitive hits NGMLR provide better mapping quality estimate: they rarely give repetitive hits
high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may
occasionally miss close suboptimal hits and be overconfident in wrong mappings. occasionally miss close suboptimal hits and be overconfident in wrong mappings.
On run time, minialign is slightly faster than minimap2. They are over 30 times On run time, minialign is slightly faster than minimap2 and Kart. They are over
faster than the rest. Minimap2 consumed 6.1GB memory at the peak, more than 30 times faster than the rest. Minimap2 consumed 6.1GB memory at the peak,
BWA-MEM but less than others. more than BWA-MEM but less than others.
On real human SMRT reads, the relative performance and sensitivity of On real human SMRT reads, the relative performance and sensitivity of
these aligners are broadly similar to those on simulated data. We are unable to these aligners are broadly similar to the metrics on simulated data. We are
provide a good estimate of mapping error rate due to the lack of the truth. On unable to provide a good estimate of mapping error rate due to the lack of the
ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and truth. On ONT $\sim$100kb human reads~\citep{Jain128835}, BWA-MEM failed.
minimap2 are over 70 times faster than others. We have also examined tens of Kart, minialign and minimap2 are over 70 times faster than others. We have also
$\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can confirm the examined tens of $\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can
observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks them into confirm the observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks
shorter gaps. Minimap2 does not have this issue. them into shorter gaps. The issue is much alleviated with minimap2, thanks
to the 2-piece affine gap cost.
\subsection{Aligning spliced reads} \subsection{Aligning spliced reads}
We first evaluated minimap2 on SIRV control data~(AC:SRR5286959; We evaluated minimap2 on SIRV control data~(AC:SRR5286959;
\citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,916 \citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,916
introns, 93.0\% of which are precise. We examined wrongly predicted introns and introns from 11\,017 reads. 93.0\% of splice juctions are precise. We examined
found the majority were caused by clustered splicing signals (e.g. two adjacent wrongly predicted junctions and found the majority were caused by clustered
${\tt GT}$ sites). When INDEL sequencing errors are frequent, it is difficult splicing signals (e.g. two adjacent ${\tt GT}$ sites). When INDEL sequencing
to found precise splicing sites in this case. If we allow up to 10bp distance errors are frequent, it is difficult to find precise splicing sites in this
from true splicing sites, 98.4\% of aligned introns are approximately correct. case. If we allow up to 10bp distance from true splicing sites, 98.4\% of
Given this observation, we might be able to improve boundary detection by aligned introns are approximately correct. Given this observation, we might be
initializing $d(\cdot)$ and $a(\cdot)$ in Eq.~(\ref{eq:splice}) with able to improve boundary detection by initializing $d(\cdot)$ and $a(\cdot)$ in
position-specific scoring matrices or more sophisticated models. We have Eq.~(\ref{eq:splice}) with position-specific scoring matrices or more
not tried this approach. sophisticated models. We have not tried this approach.
\begin{table}[!tb] \begin{table}[!tb]
\processtable{Evaluation of splicing accuracy on 2D ONT reads} \processtable{Evaluation of junction accuracy on 2D ONT reads}
{\footnotesize\label{tab:intron} {\footnotesize\label{tab:intron}
\begin{tabular}{p{3.1cm}rrrr} \begin{tabular}{p{3.1cm}rrrr}
\toprule \toprule
@ -431,26 +453,45 @@ We next aligned real mouse reads~\citep{Byrne:2017aa} with GMAP~(v2017-06-20;
\citealp{Wu:2005vn}), minimap2, SpAln~(v2.3.1; \citealp{Iwata:2012aa}) and \citealp{Wu:2005vn}), minimap2, SpAln~(v2.3.1; \citealp{Iwata:2012aa}) and
STAR~(v2.5.3a; \citealp{Dobin:2013kx}). In general, minimap2 is more STAR~(v2.5.3a; \citealp{Dobin:2013kx}). In general, minimap2 is more
consistent with existing annotations (Table~\ref{tab:intron}): it finds consistent with existing annotations (Table~\ref{tab:intron}): it finds
more splicing with a higher percentage being exactly or approximately correct. more junctions with a higher percentage being exactly or approximately correct.
We noted that GMAP and SpAln have not been optimized for noisy reads. We have Minimap2 is over 40 times faster than GMAP and SpAln. While STAR is close to
tried different settings, but their developers should be able to improve the minimap2 in speed, it does not work well with noisy reads. We have also
accuracy further. On run time, minimap2 is over 40 times faster than GMAP and evaluated spliced aligners on public Iso-Seq data (human Alzheimer brain
SpAln. While STAR is close to minimap2 in speed, it does not work well with from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The observation
noisy reads. is similar: minimap2 is faster at higher junction accuracy.
\section{Discussions} We noted that GMAP and SpAln have not been optimized for noisy reads. We are
showing the best setting we have experimented, but their developers should be
able to improve their accuracy further.
Minialign and minimap2 are fast because a) with chaining, they can quickly %\begin{table}[!tb]
filter out most false seed hits~\citep{Li:2016aa} and reduce unsuccessful but %\processtable{Evaluation of junction accuracy on SMRT Iso-Seq reads}
costly DP-based alignments; b) they implemented so far the fastest DP-based %{\footnotesize
alignment algorithm for long sequences~\citep{Suzuki:2016}. It is possible to %\begin{tabular}{lrrrr}
further accelerate minimap2 with a few other tweaks such as adaptive %\toprule
banding~\citep{Suzuki130633} or incremental banding. %& GMAP & minimap2 & SpAln & STAR\\
%\midrule
%Run time (CPU min) & & 243 & 2\,352 & 1\,647 \\
%\# aligned reads & & 1\,123\,025 & 1\,094\,092 & 682\,452\\
%\# chimeric alignments & & 33\,091 & 0 & 0\\
%\# non-spliced alignments & & 339\,081 & 291\,447 & 272\,536\vspace{1em}\\
%\# aligned introns & & 9\,071\,755 & 9\,208\,564 & 3\,029\,121 \\
%\# novel introns & & 42\,773 & 82\,230 & 17\,791 \\
%\% exact introns & & 94.9\% & 91.7\% & 84.7\% \\
%\% approx. introns&& 96.9\% & 93.4\% & 93.8\% \\
%\botrule
%\end{tabular}
%}{}
%\end{table}
In addition to reference-based read mapping, minimap2 inherits minimap's
functionality to search against huge multi-species databases and to find read \section{Conclusion}
overlaps. On a few test data sets, minimap2 appears to yield slightly better
miniasm assembly~\citep{Li:2016aa}. Minimap2 can also align closely related Minimap2 is a fast, accurate and versatile aligner for long nucleotide
sequences. In addition to reference-based read mapping, minimap2 inherits
minimap's functionality to search against huge multi-species databases and to
find read overlaps. On a few test data sets, minimap2 appears to yield slightly
better miniasm assembly~\citep{Li:2016aa}. Minimap2 can also align similar
genomes or different assemblies of the same species. However, full-genome genomes or different assemblies of the same species. However, full-genome
alignment is an intricate research topic. More thorough evaluations would be alignment is an intricate research topic. More thorough evaluations would be
necessary to justify the use of minimap2 for such applications. necessary to justify the use of minimap2 for such applications.