this will become version 3 at arXiv

This commit is contained in:
Heng Li 2017-11-06 10:57:12 -05:00
parent 2191ac58ad
commit 1617b87ee1
1 changed files with 34 additions and 19 deletions

View File

@ -34,7 +34,7 @@ cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in
length. Existing alignment programs are unable or inefficient to process such data length. Existing alignment programs are unable or inefficient to process such data
at scale, which presses for the development of new alignment algorithms. at scale, which presses for the development of new alignment algorithms.
\section{Results:} Minimap2 is a general-purpose mapper to align DNA or long \section{Results:} Minimap2 is a general-purpose alignment program to map DNA or long
mRNA sequences against a large reference database. It works with accurate short mRNA sequences against a large reference database. It works with accurate short
reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%, reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%,
full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
@ -66,15 +66,22 @@ bottleneck of short-read alignment. We confirmed our speculation by achieving
approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}. approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
\citet{Suzuki130633} extended our work with a fast and novel algorithm on \citet{Suzuki130633} extended our work with a fast and novel algorithm on
generating base-level alignment, which in turn inspired us to develop minimap2 generating base-level alignment, which in turn inspired us to develop minimap2
towards higher accuracy and more practical functionality. with added functionality.
Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While Both SMRT and ONT have been applied to the sequencing of spliced mRNAs (RNA-seq). While
traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not traditional mRNA aligners work~\citep{Wu:2005vn,Iwata:2012aa}, they are not
optimized for long noisy sequence reads and are tens of times slower than optimized for long noisy sequence reads and are tens of times slower than
dedicated long-read aligners. When developing minimap2 initially for aligning dedicated long-read aligners. When developing minimap2 initially for aligning
genomic DNA only, we realized minor modifications could make it competitive for genomic DNA only, we realized minor modifications could enable the base
aligning mRNAs as well. Minimap2 is a first RNA-seq aligner specifically algorithm to map mRNAs as well. Minimap2 becomes a first RNA-seq aligner
designed for long noisy reads. specifically designed for long noisy reads. We have also extended the original
algorithm to map short reads at a speed faster than several mainstream
short-read mappers.
In this article, we will describe the minimap2 algorithm and its applications
to different types of input sequences. We will evaluate the performance and
accuracy of minimap2 on several simulated and real data sets and demonstrate
the versatility of minimap2.
\begin{methods} \begin{methods}
\section{Methods} \section{Methods}
@ -366,12 +373,12 @@ reads that span canonical splicing sites.
In the spliced alignment mode, minimap2 further increases the density of In the spliced alignment mode, minimap2 further increases the density of
minimizers and disables banded alignment. Together with the two-round DP-based minimizers and disables banded alignment. Together with the two-round DP-based
alignment, spliced alignment is several times slower than DNA sequence alignment, spliced alignment is several times slower than genomic DNA
alignment. alignment.
\subsection{Aligning short paired-end reads} \subsection{Aligning short paired-end reads}
During chainging, minimap2 takes a pair of reads as one read with a gap of During chainging, minimap2 takes a pair of reads as one fragment with a gap of
unknown length in the middle. It applies a normal gap cost between seeds on the unknown length in the middle. It applies a normal gap cost between seeds on the
same read but is a more permissive gap cost between seeds on different reads. same read but is a more permissive gap cost between seeds on different reads.
More precisely, the gap cost during chaining is: More precisely, the gap cost during chaining is:
@ -423,9 +430,7 @@ NGMLR~(v0.2.5; \citealp{Sedlazeck169557}). We excluded rHAT~\citep{Liu:2016ab}
and LAMSA~\citep{Liu:2017aa} because they either and LAMSA~\citep{Liu:2017aa} because they either
crashed or produced malformatted output. In this evaluation, minimap2 has crashed or produced malformatted output. In this evaluation, minimap2 has
higher power to distinguish unique and repetitive hits, and achieves overall higher power to distinguish unique and repetitive hits, and achieves overall
higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate higher mapping accuracy (Fig.~\ref{fig:eval}a). Minimap2 and
even if we skip DP-based alignment (data not shown), confirming chaining alone
is sufficient to achieve high accuracy for approximate mapping. Minimap2 and
NGMLR provide better mapping quality estimate: they rarely give repetitive hits NGMLR provide better mapping quality estimate: they rarely give repetitive hits
high mapping quality. Apparently, other aligners may high mapping quality. Apparently, other aligners may
occasionally miss close suboptimal hits and be overconfident in wrong mappings. occasionally miss close suboptimal hits and be overconfident in wrong mappings.
@ -498,10 +503,10 @@ minimap2 in speed, it does not work well with noisy reads.
We have also evaluated spliced aligners on public Iso-Seq data (human Alzheimer We have also evaluated spliced aligners on public Iso-Seq data (human Alzheimer
brain from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The brain from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The
observation is similar: minimap2 is faster at higher junction accuracy. observation is similar: minimap2 is faster at higher junction accuracy.
On a private Nanopore Direct RNA data set with $>$20\% sequencing error rate On a private Nanopore Direct RNA data set with $\sim$17\% sequencing error rate
(M\"{u}ller et al, personal communication), minimap2 aligned 940,346 introns (N. Loman, personal communication), minimap2 aligned 96\,467 introns
from 239,976 mapped reads with 88.5\% of them consistent with human gene from 37\,068 mapped reads with 95.4\% of them consistent with human gene
annotations. In comparison, only 40.3\% of GMAP introns found in known gene annotations. In comparison, only 74.8\% of GMAP introns found in known gene
annotations. annotations.
We noted that GMAP and SpAln have not been optimized for noisy reads. We are We noted that GMAP and SpAln have not been optimized for noisy reads. We are
@ -551,9 +556,8 @@ data set
ERR1341796. In this evaluation, minimap2 has higher SNP false negative rate ERR1341796. In this evaluation, minimap2 has higher SNP false negative rate
(FNR; 2.5\% of minimap2 vs 2.2\% of BWA-MEM), but fewer false positive SNPs per (FNR; 2.5\% of minimap2 vs 2.2\% of BWA-MEM), but fewer false positive SNPs per
million bases (FPPM; 3.0 vs 3.9), lower 2--50bp INDEL FNR (7.3\% vs 7.5\%) and million bases (FPPM; 3.0 vs 3.9), lower 2--50bp INDEL FNR (7.3\% vs 7.5\%) and
similar INDEL FPPM (both 1.0). In comparison, Bowtie2 has a SNP FNR of 4.7\% similar INDEL FPPM (both 1.0). Minimap2 is broadly similar to BWA-MEM in the
and INDEL FNR of 10.4\%. Minimap2 is broadly similar to BWA-MEM in the context context of small variant calling.
of small variant calling.
\subsection{Other applications} \subsection{Other applications}
@ -561,14 +565,14 @@ Minimap2 retains minimap's functionality to find overlaps between long reads
and to search against large multi-species databases such as \emph{nt} from and to search against large multi-species databases such as \emph{nt} from
NCBI. Minimap2 can also align similar genomes or different assemblies of the NCBI. Minimap2 can also align similar genomes or different assemblies of the
same species. It took 7 wall-clock minutes over 8 CPU cores to align a human same species. It took 7 wall-clock minutes over 8 CPU cores to align a human
SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times as fast as SMRT assembly (AC:GCA\_001297185.1) to GRCh38, over 20 times faster
MUMmer4~\citep{Kurtz:2004zr}. MUMmer4~\citep{Kurtz:2004zr}.
\section{Discussions} \section{Discussions}
Minimap2 is a versatile mapper and pairwise aligner for nucleotide sequences. Minimap2 is a versatile mapper and pairwise aligner for nucleotide sequences.
It works with short reads, assembly contigs and long noisy genomic and RNA-seq It works with short reads, assembly contigs and long noisy genomic and RNA-seq
reads. It can be used as a read mapper, long-read overlapper or a full-genome reads, and can be used as a read mapper, long-read overlapper or a full-genome
aligner. Minimap2 is also accurate and efficient, often outperforming other aligner. Minimap2 is also accurate and efficient, often outperforming other
domain-specific alignment tools in terms of both speed and accuracy. domain-specific alignment tools in terms of both speed and accuracy.
@ -586,6 +590,17 @@ general form, minimap2 chaining can be adapted to non-typical data types such
spliced reads and multiple reads per fragment. This gives us the opportunity to spliced reads and multiple reads per fragment. This gives us the opportunity to
extend the same base algorithm to a variety of use cases. extend the same base algorithm to a variety of use cases.
Modern mainstream aligners often use a full-text index, such as suffix array or
FM-index, to index reference sequences. An advantage of this approach is that
we can use exact seeds of arbitrary lengths, which helps to increase seed
uniqueness and reduce unsuccessful extensions. Minimap2 indexes reference
k-mers with a hash table instead. Such fixed-length seeds are inferior to
variable-length seeds in theory, but can be computed much more efficiently in
practice. When a query sequence has multiple seed hits, we can afford to skip
some highly repetitive seeds without affecting the final accuracy. This further
alleviates the concern with the uniqueness of seeds. Hash table is the ideal
data structure for mapping long query sequences.
\section*{Acknowledgements} \section*{Acknowledgements}
We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their
masterpiece and insightful notes before formal publication. We thank M. masterpiece and insightful notes before formal publication. We thank M.