backup manuscript
This commit is contained in:
parent
b8bdac7e64
commit
240f6caaff
|
|
@ -94,7 +94,7 @@ while (file.readline(buf) >= 0) {
|
||||||
ori_qlen = parseInt(t[1]);
|
ori_qlen = parseInt(t[1]);
|
||||||
} else { // SAM
|
} else { // SAM
|
||||||
var flag = parseInt(t[1]);
|
var flag = parseInt(t[1]);
|
||||||
if (flag & 4) continue;
|
if ((flag & 4) || t[2] == '*' || t[5] == '*') continue;
|
||||||
if (flag & 0x100) {
|
if (flag & 0x100) {
|
||||||
++n_2nd;
|
++n_2nd;
|
||||||
continue;
|
continue;
|
||||||
|
|
|
||||||
|
|
@ -251,3 +251,11 @@
|
||||||
Title = {A cross-species alignment tool {(CAT)}},
|
Title = {A cross-species alignment tool {(CAT)}},
|
||||||
Volume = {8},
|
Volume = {8},
|
||||||
Year = {2007}}
|
Year = {2007}}
|
||||||
|
|
||||||
|
@article{Farrar:2007hs,
|
||||||
|
Author = {Farrar, Michael},
|
||||||
|
Journal = {Bioinformatics},
|
||||||
|
Pages = {156-61},
|
||||||
|
Title = {{Striped Smith-Waterman speeds database searches six times over other SIMD implementations}},
|
||||||
|
Volume = {23},
|
||||||
|
Year = {2007}}
|
||||||
|
|
|
||||||
199
tex/minimap2.tex
199
tex/minimap2.tex
|
|
@ -49,16 +49,24 @@ Single Molecule Real-Time (SMRT) sequencing technology and Oxford Nanopore
|
||||||
technologies (ONT) produce reads over 10kbp in length at an error rate
|
technologies (ONT) produce reads over 10kbp in length at an error rate
|
||||||
$\sim$15\%. Several aligners have been developed for such
|
$\sim$15\%. Several aligners have been developed for such
|
||||||
data~\citep{Chaisson:2012aa,Li:2013aa,Liu:2016ab,Sovic:2016aa,Liu:2017aa,Lin:2017aa,Sedlazeck169557}.
|
data~\citep{Chaisson:2012aa,Li:2013aa,Liu:2016ab,Sovic:2016aa,Liu:2017aa,Lin:2017aa,Sedlazeck169557}.
|
||||||
They are usually five times as slow as mainstream short-read
|
Most of them were five times as slow as mainstream short-read
|
||||||
aligners~\citep{Langmead:2012fk,Li:2013aa}. We speculated there could be
|
aligners~\citep{Langmead:2012fk,Li:2013aa} in terms of the number of bases
|
||||||
substantial room for speedup on the thought that 10kb long sequences should be
|
mapped per second. We speculated there could be substantial room for speedup on
|
||||||
easier to map than 100bp reads because we can more effectively skip repetitive
|
the thought that 10kb long sequences should be easier to map than 100bp reads
|
||||||
regions, which are often the bottleneck of short-read alignment. We confirmed
|
because we can more effectively skip repetitive regions, which are often the
|
||||||
our speculation by achieving approximate mapping 50 times faster than
|
bottleneck of short-read alignment. We confirmed our speculation by achieving
|
||||||
BWA-MEM~\citep{Li:2016aa}. \citet{Suzuki:2016} extended our work with a fast
|
approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
|
||||||
and novel algorithm on generating base-level alignment, which in turn inspired
|
\citet{Suzuki:2016} extended our work with a fast and novel algorithm on
|
||||||
us to develop minimap2 towards higher accuracy and more practical
|
generating base-level alignment, which in turn inspired us to develop minimap2
|
||||||
functionality.
|
towards higher accuracy and more practical functionality.
|
||||||
|
|
||||||
|
Both SMRT and ONT have been applied to sequence spliced mRNAs (RNA-seq). While
|
||||||
|
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
|
||||||
|
optimized for long noisy sequence reads and are tens of times slower than
|
||||||
|
dedicated long-read aligners. When developing minimap2 initially for aligning
|
||||||
|
genomic DNA only, we realized minor modifications could make it competitive for
|
||||||
|
aligning mRNAs as well. Minimap2 is a first RNA-seq aligner specifically
|
||||||
|
designed for long noisy reads.
|
||||||
|
|
||||||
\begin{methods}
|
\begin{methods}
|
||||||
\section{Methods}
|
\section{Methods}
|
||||||
|
|
@ -83,7 +91,8 @@ spliced alignment.
|
||||||
An \emph{anchor} is a 3-tuple $(x,y,w)$, indicating interval $[x-w+1,x]$ on the
|
An \emph{anchor} is a 3-tuple $(x,y,w)$, indicating interval $[x-w+1,x]$ on the
|
||||||
reference matching interval $[y-w+1,y]$ on the query. Given a list of anchors
|
reference matching interval $[y-w+1,y]$ on the query. Given a list of anchors
|
||||||
sorted by ending reference position $x$, let $f(i)$ be the maximal chaining
|
sorted by ending reference position $x$, let $f(i)$ be the maximal chaining
|
||||||
score up to the $i$-th anchor in the list. $f(i)$ can be calculated with DP:
|
score up to the $i$-th anchor in the list. $f(i)$ can be calculated with
|
||||||
|
dynamic programming:
|
||||||
\begin{equation}\label{eq:chain}
|
\begin{equation}\label{eq:chain}
|
||||||
f(i)=\max\big\{\max_{i>j\ge 1} \{ f(j)+\alpha(j,i)-\beta(j,i) \},w_i\big\}
|
f(i)=\max\big\{\max_{i>j\ge 1} \{ f(j)+\alpha(j,i)-\beta(j,i) \},w_i\big\}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
@ -161,11 +170,15 @@ gap cost~\citep{Gotoh:1982aa,Altschul:1986aa}.
|
||||||
|
|
||||||
\subsubsection{Suzuki's formulation}
|
\subsubsection{Suzuki's formulation}
|
||||||
|
|
||||||
To efficiently align long sequences, minimap2 did not directly implement
|
When we allow gaps longer than several hundred base pairs, nucleotide-level
|
||||||
Eq.~(\ref{eq:ae86}). It instead adopted a difference-based
|
alignment is much slower than chaining. SSE acceleration is critical to the
|
||||||
formulation first proposed by \citet{Wu:1996aa} and later adapted by
|
performance of minimap2. Traditional SSE implementations~\citep{Farrar:2007hs}
|
||||||
\citet{Suzuki:2016} for affine gap cost. In case of 2-piece affine gap cost in
|
based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short
|
||||||
Eq.~(\ref{eq:2-piece}), define
|
sequences, but only 4-way parallelization when the peak alignment score reaches
|
||||||
|
32767. Long sequence alignment may exceed this threshold. Inspired by
|
||||||
|
\citet{Wu:1996aa} and the following work, \citet{Suzuki:2016} proposed a
|
||||||
|
difference-based formulation that lifted this limitation. In case of 2-piece
|
||||||
|
gap cost, define
|
||||||
\[
|
\[
|
||||||
\left\{\begin{array}{ll}
|
\left\{\begin{array}{ll}
|
||||||
u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
|
u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
|
||||||
|
|
@ -186,9 +199,9 @@ y_{ij}&=&\max\{0,y_{i,j-1}+u_{i,j-1}-z_{ij}+q\}-q-e\\
|
||||||
\tilde{y}_{ij}&=&\max\{0,\tilde{y}_{i,j-1}+u_{i,j-1}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e}
|
\tilde{y}_{ij}&=&\max\{0,\tilde{y}_{i,j-1}+u_{i,j-1}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e}
|
||||||
\end{array}\right.
|
\end{array}\right.
|
||||||
\end{equation}
|
\end{equation}
|
||||||
where $z_{ij}$ is a temporary variable that does not need to be stored.
|
where $z_{ij}$ is a temporary variable that does not need to be stored. An
|
||||||
This is Suzuki's formulation for 2-piece affine gap cost. An important property
|
important property of Eq.~(\ref{eq:suzuki}) is that all values are bounded. To
|
||||||
of this formulation is that all values are bounded. To see that,
|
see that,
|
||||||
\[
|
\[
|
||||||
x_{ij}=E_{i+1,j}-H_{ij}=\max\{-q,E_{ij}-H_{ij}\}-e
|
x_{ij}=E_{i+1,j}-H_{ij}=\max\{-q,E_{ij}-H_{ij}\}-e
|
||||||
\]
|
\]
|
||||||
|
|
@ -229,10 +242,11 @@ y_{rt}&=&\max\{0,y_{r-1,t}+u_{r-1,t}-z_{rt}+q\}-q-e\\
|
||||||
\end{equation*}
|
\end{equation*}
|
||||||
In this formulation, cells with the same diagonal index $r$ are independent of
|
In this formulation, cells with the same diagonal index $r$ are independent of
|
||||||
each other. This allows us to fully vectorize the computation of all cells on
|
each other. This allows us to fully vectorize the computation of all cells on
|
||||||
the same anti-diagonal in one inner loop.
|
the same anti-diagonal in one inner loop. It also simplifies banded alignment,
|
||||||
|
which would be difficult with striped vectorization~\citep{Farrar:2007hs}.
|
||||||
|
|
||||||
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
|
On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
|
||||||
condition of this equation in the diagonal-antidiagonal coordinate is
|
condition of the equation above is
|
||||||
\[
|
\[
|
||||||
\left\{\begin{array}{l}
|
\left\{\begin{array}{l}
|
||||||
x_{r-1,-1}=y_{r-1,r}=-q-e\\
|
x_{r-1,-1}=y_{r-1,r}=-q-e\\
|
||||||
|
|
@ -249,6 +263,7 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til
|
||||||
-\tilde{e} & (r>\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil)
|
-\tilde{e} & (r>\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil)
|
||||||
\end{array}\right.
|
\end{array}\right.
|
||||||
\]
|
\]
|
||||||
|
These can be derived from the initial conditions of Eq.~(\ref{eq:ae86}).
|
||||||
|
|
||||||
In practice, our 16-way vectorized implementation of global alignment is three
|
In practice, our 16-way vectorized implementation of global alignment is three
|
||||||
times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without
|
times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without
|
||||||
|
|
@ -298,8 +313,10 @@ q+l\cdot e & (l>0) \\
|
||||||
\end{array}\right.
|
\end{array}\right.
|
||||||
\]
|
\]
|
||||||
In alignment, a deletion no shorter than $\lceil(\tilde{q}-q)/e\rceil$ is
|
In alignment, a deletion no shorter than $\lceil(\tilde{q}-q)/e\rceil$ is
|
||||||
regarded as an intron, which pays no cost to gap extensions. Minimap2 further
|
regarded as an intron, which pays no cost to gap extensions.
|
||||||
introduces reference-dependent cost to penalize non-canonical splicing:
|
|
||||||
|
To pinpoint precise splicing junctions, minimap2 introduces reference-dependent
|
||||||
|
cost to penalize non-canonical splicing:
|
||||||
\begin{equation}\label{eq:splice}
|
\begin{equation}\label{eq:splice}
|
||||||
\left\{\begin{array}{l}
|
\left\{\begin{array}{l}
|
||||||
H_{ij} = \max\{H_{i-1,j-1}+s(i,j),E_{ij},F_{ij},\tilde{E}_{ij}-a(i)\}\\
|
H_{ij} = \max\{H_{i-1,j-1}+s(i,j),E_{ij},F_{ij},\tilde{E}_{ij}-a(i)\}\\
|
||||||
|
|
@ -312,10 +329,9 @@ Let $T$ be the reference sequence. $d(i)$ is the cost of a non-canonical donor
|
||||||
site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a postive number $p$
|
site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a postive number $p$
|
||||||
otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which
|
otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which
|
||||||
takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is
|
takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is
|
||||||
almost the same as the equation used by EXALIN~\citep{Zhang:2006aa} and
|
almost equivalent to the equation used by EXALIN~\citep{Zhang:2006aa} except
|
||||||
CAT~\citep{Li:2007aa} except that we allow insertions immediately followed by
|
that we allow insertions immediately followed by deletions and vice versa; in
|
||||||
deletions and vice versa; in addition, we use Suzuki's diagonal formulation in
|
addition, we use Suzuki's diagonal formulation in actual implementation.
|
||||||
actual implementation.
|
|
||||||
|
|
||||||
%Given that $d_i$ and $a_i$
|
%Given that $d_i$ and $a_i$
|
||||||
%are a function of the reference sequence, it is possible to incorporate
|
%are a function of the reference sequence, it is possible to incorporate
|
||||||
|
|
@ -345,60 +361,66 @@ alignment.
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=.5\textwidth]{roc-color.pdf}
|
\includegraphics[width=.5\textwidth]{roc-color.pdf}
|
||||||
\caption{Evaluation on simulated SMRT reads aligned against human genome
|
\caption{Evaluation on simulated SMRT reads aligned against human genome
|
||||||
GRCh38. (a) ROC-like curve. (b) Accumulative mapping error rate as a function
|
GRCh38. (a) ROC-like curve. Alignments are sorted by mapping quality in the
|
||||||
of mapping quality. 33,088 $\ge$1000bp reads were simulated using
|
descending order. For each mapping quality threshold, the fraction of
|
||||||
pbsim~\citep{Ono:2013aa} with error profile sampled from file
|
alignments with mapping quality above the threshold and their error rate
|
||||||
`m131017\_060208\_42213\_*.1.*' downloaded at
|
are plotted. (b) Accumulative mapping error rate as a function of mapping
|
||||||
\href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length is
|
quality. 33,088 $\ge$1000bp reads were simulated using pbsim~\citep{Ono:2013aa}
|
||||||
11,628. A read is considered correctly mapped if the true position overlaps
|
with error profile sampled from file `m131017\_060208\_42213\_*.1.*' downloaded
|
||||||
|
at \href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length
|
||||||
|
is 11,628. A read is considered correctly mapped if the true position overlaps
|
||||||
with the best mapping position by 10\% of the read length. All aligners were
|
with the best mapping position by 10\% of the read length. All aligners were
|
||||||
run under the default setting for SMRT reads.}\label{fig:eval}
|
run under the default setting for SMRT reads. Kart outputted all alignments at
|
||||||
|
mapping quality 60, so is not shown in the figure. It mapped nearly all reads
|
||||||
|
with 4.1\% of alignments being wrong, less accurate than others.}\label{fig:eval}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
As a sanity check, we evaluated minimap2 on simulated human reads along with
|
As a sanity check, we evaluated minimap2 on simulated human reads along with
|
||||||
BLASR~\citep{Chaisson:2012aa},
|
BLASR~(v1.MC.rc64; \citealp{Chaisson:2012aa}),
|
||||||
BWA-MEM~\citep{Li:2013aa},
|
BWA-MEM~(v0.7.15; \citealp{Li:2013aa}),
|
||||||
GraphMap~\citep{Sovic:2016aa},
|
GraphMap~(v0.5.2; \citealp{Sovic:2016aa}),
|
||||||
minialign~\citep{Suzuki:2016} and
|
Kart~(v2.2.5; \citealp{Lin:2017aa}),
|
||||||
NGMLR~\citep{Sedlazeck169557}. We excluded rHAT~\citep{Liu:2016ab},
|
minialign~(v0.5.3; \citealp{Suzuki:2016}) and
|
||||||
LAMSA~\citep{Liu:2017aa} and Kart~\citep{Lin:2017aa} because they either
|
NGMLR~(v0.2.5; \citealp{Sedlazeck169557}). We excluded rHAT~\citep{Liu:2016ab}
|
||||||
|
and LAMSA~\citep{Liu:2017aa} because they either
|
||||||
crashed or produced malformatted output. In this evaluation, Minimap2 has
|
crashed or produced malformatted output. In this evaluation, Minimap2 has
|
||||||
higher power to distinguish unique and repetitive hits, and achieves overall
|
higher power to distinguish unique and repetitive hits, and achieves overall
|
||||||
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
|
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
|
||||||
even if we skip DP-based alignment (data not shown), suggesting chaining alone
|
even if we skip DP-based alignment (data not shown), confirming chaining alone
|
||||||
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
|
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
|
||||||
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
|
NGMLR provide better mapping quality estimate: they rarely give repetitive hits
|
||||||
high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may
|
high mapping quality (Fig.~\ref{fig:eval}b). Apparently, other aligners may
|
||||||
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
|
occasionally miss close suboptimal hits and be overconfident in wrong mappings.
|
||||||
On run time, minialign is slightly faster than minimap2. They are over 30 times
|
On run time, minialign is slightly faster than minimap2 and Kart. They are over
|
||||||
faster than the rest. Minimap2 consumed 6.1GB memory at the peak, more than
|
30 times faster than the rest. Minimap2 consumed 6.1GB memory at the peak,
|
||||||
BWA-MEM but less than others.
|
more than BWA-MEM but less than others.
|
||||||
|
|
||||||
On real human SMRT reads, the relative performance and sensitivity of
|
On real human SMRT reads, the relative performance and sensitivity of
|
||||||
these aligners are broadly similar to those on simulated data. We are unable to
|
these aligners are broadly similar to the metrics on simulated data. We are
|
||||||
provide a good estimate of mapping error rate due to the lack of the truth. On
|
unable to provide a good estimate of mapping error rate due to the lack of the
|
||||||
ONT ultra-long human reads~\citep{Jain128835}, BWA-MEM failed. Minialign and
|
truth. On ONT $\sim$100kb human reads~\citep{Jain128835}, BWA-MEM failed.
|
||||||
minimap2 are over 70 times faster than others. We have also examined tens of
|
Kart, minialign and minimap2 are over 70 times faster than others. We have also
|
||||||
$\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can confirm the
|
examined tens of $\ge$100bp INDELs in IGV~\citep{Robinson:2011aa} and can
|
||||||
observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks them into
|
confirm the observation by~\citet{Sedlazeck169557} that BWA-MEM often breaks
|
||||||
shorter gaps. Minimap2 does not have this issue.
|
them into shorter gaps. The issue is much alleviated with minimap2, thanks
|
||||||
|
to the 2-piece affine gap cost.
|
||||||
|
|
||||||
\subsection{Aligning spliced reads}
|
\subsection{Aligning spliced reads}
|
||||||
|
|
||||||
We first evaluated minimap2 on SIRV control data~(AC:SRR5286959;
|
We evaluated minimap2 on SIRV control data~(AC:SRR5286959;
|
||||||
\citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,916
|
\citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,916
|
||||||
introns, 93.0\% of which are precise. We examined wrongly predicted introns and
|
introns from 11\,017 reads. 93.0\% of splice juctions are precise. We examined
|
||||||
found the majority were caused by clustered splicing signals (e.g. two adjacent
|
wrongly predicted junctions and found the majority were caused by clustered
|
||||||
${\tt GT}$ sites). When INDEL sequencing errors are frequent, it is difficult
|
splicing signals (e.g. two adjacent ${\tt GT}$ sites). When INDEL sequencing
|
||||||
to found precise splicing sites in this case. If we allow up to 10bp distance
|
errors are frequent, it is difficult to find precise splicing sites in this
|
||||||
from true splicing sites, 98.4\% of aligned introns are approximately correct.
|
case. If we allow up to 10bp distance from true splicing sites, 98.4\% of
|
||||||
Given this observation, we might be able to improve boundary detection by
|
aligned introns are approximately correct. Given this observation, we might be
|
||||||
initializing $d(\cdot)$ and $a(\cdot)$ in Eq.~(\ref{eq:splice}) with
|
able to improve boundary detection by initializing $d(\cdot)$ and $a(\cdot)$ in
|
||||||
position-specific scoring matrices or more sophisticated models. We have
|
Eq.~(\ref{eq:splice}) with position-specific scoring matrices or more
|
||||||
not tried this approach.
|
sophisticated models. We have not tried this approach.
|
||||||
|
|
||||||
\begin{table}[!tb]
|
\begin{table}[!tb]
|
||||||
\processtable{Evaluation of splicing accuracy on 2D ONT reads}
|
\processtable{Evaluation of junction accuracy on 2D ONT reads}
|
||||||
{\footnotesize\label{tab:intron}
|
{\footnotesize\label{tab:intron}
|
||||||
\begin{tabular}{p{3.1cm}rrrr}
|
\begin{tabular}{p{3.1cm}rrrr}
|
||||||
\toprule
|
\toprule
|
||||||
|
|
@ -431,26 +453,45 @@ We next aligned real mouse reads~\citep{Byrne:2017aa} with GMAP~(v2017-06-20;
|
||||||
\citealp{Wu:2005vn}), minimap2, SpAln~(v2.3.1; \citealp{Iwata:2012aa}) and
|
\citealp{Wu:2005vn}), minimap2, SpAln~(v2.3.1; \citealp{Iwata:2012aa}) and
|
||||||
STAR~(v2.5.3a; \citealp{Dobin:2013kx}). In general, minimap2 is more
|
STAR~(v2.5.3a; \citealp{Dobin:2013kx}). In general, minimap2 is more
|
||||||
consistent with existing annotations (Table~\ref{tab:intron}): it finds
|
consistent with existing annotations (Table~\ref{tab:intron}): it finds
|
||||||
more splicing with a higher percentage being exactly or approximately correct.
|
more junctions with a higher percentage being exactly or approximately correct.
|
||||||
We noted that GMAP and SpAln have not been optimized for noisy reads. We have
|
Minimap2 is over 40 times faster than GMAP and SpAln. While STAR is close to
|
||||||
tried different settings, but their developers should be able to improve the
|
minimap2 in speed, it does not work well with noisy reads. We have also
|
||||||
accuracy further. On run time, minimap2 is over 40 times faster than GMAP and
|
evaluated spliced aligners on public Iso-Seq data (human Alzheimer brain
|
||||||
SpAln. While STAR is close to minimap2 in speed, it does not work well with
|
from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The observation
|
||||||
noisy reads.
|
is similar: minimap2 is faster at higher junction accuracy.
|
||||||
|
|
||||||
\section{Discussions}
|
We noted that GMAP and SpAln have not been optimized for noisy reads. We are
|
||||||
|
showing the best setting we have experimented, but their developers should be
|
||||||
|
able to improve their accuracy further.
|
||||||
|
|
||||||
Minialign and minimap2 are fast because a) with chaining, they can quickly
|
%\begin{table}[!tb]
|
||||||
filter out most false seed hits~\citep{Li:2016aa} and reduce unsuccessful but
|
%\processtable{Evaluation of junction accuracy on SMRT Iso-Seq reads}
|
||||||
costly DP-based alignments; b) they implemented so far the fastest DP-based
|
%{\footnotesize
|
||||||
alignment algorithm for long sequences~\citep{Suzuki:2016}. It is possible to
|
%\begin{tabular}{lrrrr}
|
||||||
further accelerate minimap2 with a few other tweaks such as adaptive
|
%\toprule
|
||||||
banding~\citep{Suzuki130633} or incremental banding.
|
%& GMAP & minimap2 & SpAln & STAR\\
|
||||||
|
%\midrule
|
||||||
|
%Run time (CPU min) & & 243 & 2\,352 & 1\,647 \\
|
||||||
|
%\# aligned reads & & 1\,123\,025 & 1\,094\,092 & 682\,452\\
|
||||||
|
%\# chimeric alignments & & 33\,091 & 0 & 0\\
|
||||||
|
%\# non-spliced alignments & & 339\,081 & 291\,447 & 272\,536\vspace{1em}\\
|
||||||
|
%\# aligned introns & & 9\,071\,755 & 9\,208\,564 & 3\,029\,121 \\
|
||||||
|
%\# novel introns & & 42\,773 & 82\,230 & 17\,791 \\
|
||||||
|
%\% exact introns & & 94.9\% & 91.7\% & 84.7\% \\
|
||||||
|
%\% approx. introns&& 96.9\% & 93.4\% & 93.8\% \\
|
||||||
|
%\botrule
|
||||||
|
%\end{tabular}
|
||||||
|
%}{}
|
||||||
|
%\end{table}
|
||||||
|
|
||||||
In addition to reference-based read mapping, minimap2 inherits minimap's
|
|
||||||
functionality to search against huge multi-species databases and to find read
|
\section{Conclusion}
|
||||||
overlaps. On a few test data sets, minimap2 appears to yield slightly better
|
|
||||||
miniasm assembly~\citep{Li:2016aa}. Minimap2 can also align closely related
|
Minimap2 is a fast, accurate and versatile aligner for long nucleotide
|
||||||
|
sequences. In addition to reference-based read mapping, minimap2 inherits
|
||||||
|
minimap's functionality to search against huge multi-species databases and to
|
||||||
|
find read overlaps. On a few test data sets, minimap2 appears to yield slightly
|
||||||
|
better miniasm assembly~\citep{Li:2016aa}. Minimap2 can also align similar
|
||||||
genomes or different assemblies of the same species. However, full-genome
|
genomes or different assemblies of the same species. However, full-genome
|
||||||
alignment is an intricate research topic. More thorough evaluations would be
|
alignment is an intricate research topic. More thorough evaluations would be
|
||||||
necessary to justify the use of minimap2 for such applications.
|
necessary to justify the use of minimap2 for such applications.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue