updated the tech note

This commit is contained in:
Heng Li 2017-11-02 15:37:24 -04:00
parent 22290db3e4
commit d0ac78ac08
4 changed files with 153 additions and 105 deletions

View File

@ -281,3 +281,19 @@
Journal = {arXiv:1111:5572}, Journal = {arXiv:1111:5572},
Title = {Faster and More Accurate Sequence Alignment with SNAP}, Title = {Faster and More Accurate Sequence Alignment with SNAP},
Year = {2011}} Year = {2011}}
@article{Irimia:2008aa,
Author = {Irimia, Manuel and Roy, Scott William},
Journal = {PLoS Genet},
Pages = {e1000148},
Title = {Evolutionary convergence on highly-conserved 3' intron structures in intron-poor eukaryotes and insights into the ancestral eukaryotic genome},
Volume = {4},
Year = {2008}}
@article{Depristo:2011vn,
Author = {Depristo, Mark A and others},
Journal = {Nat Genet},
Pages = {491-8},
Title = {A framework for variation discovery and genotyping using next-generation {DNA} sequencing data},
Volume = {43},
Year = {2011}}

View File

@ -31,10 +31,10 @@
\section{Motivation:} Recent advances in sequencing technologies promise \section{Motivation:} Recent advances in sequencing technologies promise
ultra-long reads of $\sim$100 kilo bases (kb) in average, full-length mRNA or ultra-long reads of $\sim$100 kilo bases (kb) in average, full-length mRNA or
cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in
length. Existing alignment tools are unable or inefficient to process such data length. Existing alignment programs are unable or inefficient to process such data
at scale, which presses for the development of new alignment algorithms. at scale, which presses for the development of new alignment algorithms.
\section{Results:} Minimap2 is a general-purpose aligner to map DNA or long \section{Results:} Minimap2 is a general-purpose mapper to align DNA or long
mRNA sequences against a large reference database. It works with accurate short mRNA sequences against a large reference database. It works with accurate short
reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%, reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%,
full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
@ -64,7 +64,7 @@ the thought that 10kb long sequences should be easier to map than 100bp reads
because we can more effectively skip repetitive regions, which are often the because we can more effectively skip repetitive regions, which are often the
bottleneck of short-read alignment. We confirmed our speculation by achieving bottleneck of short-read alignment. We confirmed our speculation by achieving
approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}. approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
\citet{Suzuki:2016} extended our work with a fast and novel algorithm on \citet{Suzuki130633} extended our work with a fast and novel algorithm on
generating base-level alignment, which in turn inspired us to develop minimap2 generating base-level alignment, which in turn inspired us to develop minimap2
towards higher accuracy and more practical functionality. towards higher accuracy and more practical functionality.
@ -179,7 +179,7 @@ where $s(i,j)$ is the score between the $i$-th reference base and $j$-th query
base. Eq.~(\ref{eq:ae86}) is a natural extension to the equation under affine base. Eq.~(\ref{eq:ae86}) is a natural extension to the equation under affine
gap cost~\citep{Gotoh:1982aa,Altschul:1986aa}. gap cost~\citep{Gotoh:1982aa,Altschul:1986aa}.
\subsubsection{Suzuki's formulation} \subsubsection{The Suzuki-Kasahara formulation}
When we allow gaps longer than several hundred base pairs, nucleotide-level When we allow gaps longer than several hundred base pairs, nucleotide-level
alignment is much slower than chaining. SSE acceleration is critical to the alignment is much slower than chaining. SSE acceleration is critical to the
@ -187,7 +187,7 @@ performance of minimap2. Traditional SSE implementations~\citep{Farrar:2007hs}
based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short
sequences, but only 4-way parallelization when the peak alignment score reaches sequences, but only 4-way parallelization when the peak alignment score reaches
32767. Long sequence alignment may exceed this threshold. Inspired by 32767. Long sequence alignment may exceed this threshold. Inspired by
\citet{Wu:1996aa} and the following work, \citet{Suzuki:2016} proposed a \citet{Wu:1996aa} and the following work, \citet{Suzuki130633} proposed a
difference-based formulation that lifted this limitation. difference-based formulation that lifted this limitation.
In case of 2-piece gap cost, define In case of 2-piece gap cost, define
\[ \[
@ -337,18 +337,24 @@ F_{i,j+1}= \max\{H_{ij}-q,F_{ij}\}-e\\
\tilde{E}_{i+1,j}= \max\{H_{ij}-d(i)-\tilde{q},\tilde{E}_{ij}\}\\ \tilde{E}_{i+1,j}= \max\{H_{ij}-d(i)-\tilde{q},\tilde{E}_{ij}\}\\
\end{array}\right. \end{array}\right.
\end{equation} \end{equation}
Let $T$ be the reference sequence. $d(i)$ is the cost of a non-canonical donor Let $T$ be the reference sequence. $d(i)$ is computed as
site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a positive number $p$ \[d(i)=\left\{\begin{array}{ll}
otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which 0 & \mbox{if $T[i+1,i+3]$ is ${\tt GTA}$ or ${\tt GTG}$} \\
takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is p/2 & \mbox{if $T[i+1,i+3]$ is ${\tt GTC}$ or ${\tt GTT}$} \\
almost equivalent to the equation used by EXALIN~\citep{Zhang:2006aa} except p & \mbox{otherwise}
that we allow insertions immediately followed by deletions and vice versa; in \end{array}\right.\]
addition, we use Suzuki's diagonal formulation in actual implementation. where $T[i,j]$ extracts a substring of $T$ between $i$ and $j$ inclusively.
$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukayotic
%Given that $d_i$ and $a_i$ splicing signal ${\tt GT[C/T]}$ with $p/2$~\citep{Irimia:2008aa}. Similarly,
%are a function of the reference sequence, it is possible to incorporate \[a(i)=\left\{\begin{array}{ll}
%splicing signals with more sophisticated models, such as positional weight 0 & \mbox{if $T[i-2,i]$ is ${\tt CAG}$ or ${\tt TAG}$} \\
%matrices. We have not tried this approach. p/2 & \mbox{if $T[i-2,i]$ is ${\tt AAG}$ or ${\tt GAG}$} \\
p & \mbox{otherwise}
\end{array}\right.\]
models the acceptor signal. Eq.~(\ref{eq:splice}) is close to an equation in
\citet{Zhang:2006aa} except that we allow insertions immediately followed by
deletions and vice versa; in addition, we use the Suzuki-Kasahara diagonal
formulation in actual implementation.
If RNA-seq reads are not sequenced from stranded libraries, the read strand If RNA-seq reads are not sequenced from stranded libraries, the read strand
relative to the underlying transcript is unknown. By default, minimap2 aligns relative to the underlying transcript is unknown. By default, minimap2 aligns
@ -440,16 +446,16 @@ to the 2-piece affine gap cost.
\subsection{Aligning long spliced reads} \subsection{Aligning long spliced reads}
We evaluated minimap2 on SIRV control data~(AC:SRR5286959; We evaluated minimap2 on SIRV control data~(AC:SRR5286959;
\citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,916 \citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,918
introns from 11\,017 reads. 93.0\% of splice juctions are precise. We examined introns from 11\,018 reads. 93.8\% of splice juctions are precise. We examined
wrongly predicted junctions and found the majority were caused by clustered wrongly predicted junctions and found the majority were caused by clustered
splicing signals (e.g. two adjacent ${\tt GT}$ sites). When INDEL sequencing splicing signals (e.g. two adjacent ${\tt GT}$ sites). When INDEL sequencing
errors are frequent, it is difficult to find precise splicing sites in this errors are frequent, it is difficult to find precise splicing sites in this
case. If we allow up to 10bp distance from true splicing sites, 98.4\% of case. If we allow up to 10bp distance from true splicing sites, 98.4\% of
aligned introns are approximately correct. Given this observation, we might be aligned introns are approximately correct. It is worth noting that for SIRV, we
able to improve boundary detection by initializing $d(\cdot)$ and $a(\cdot)$ in asked minimap2 to model the ${\tt GT..AG}$ splicing signal only without extra
Eq.~(\ref{eq:splice}) with position-specific scoring matrices or more bases. This is because SIRV does not honor the evolutionarily prevalent signal
sophisticated models. We have not tried this approach. ${\tt GT[A/G]..[C/T]AG}$~\citep{Irimia:2008aa}.
\begin{table}[!tb] \begin{table}[!tb]
\processtable{Evaluation of junction accuracy on 2D ONT reads} \processtable{Evaluation of junction accuracy on 2D ONT reads}
@ -460,13 +466,13 @@ sophisticated models. We have not tried this approach.
\midrule \midrule
Run time (CPU min) & 631 & 15.9 & 2\,076 & 33.9 \\ Run time (CPU min) & 631 & 15.9 & 2\,076 & 33.9 \\
Peak RAM (GByte) & 8.9 & 14.5 & 3.2 & 29.2\vspace{1em}\\ Peak RAM (GByte) & 8.9 & 14.5 & 3.2 & 29.2\vspace{1em}\\
\# aligned reads & 103\,669 & 104\,200 & 103\,711 & 26\,479 \\ \# aligned reads & 103\,669 & 104\,199 & 103\,711 & 26\,479 \\
\# chimeric alignments & 1\,904 & 1\,488 & 0 & 0 \\ \# chimeric alignments & 1\,904 & 1\,488 & 0 & 0 \\
\# non-spliced alignments & 15\,854 & 14\,639 & 17\,033 & 10\,545\vspace{1em}\\ \# non-spliced alignments & 15\,854 & 14\,798 & 17\,033 & 10\,545\vspace{1em}\\
\# aligned introns & 692\,275 & 694\,103 & 692\,945 & 78\,603 \\ \# aligned introns & 692\,275 & 693\,553 & 692\,945 & 78\,603 \\
\# novel introns & 11\,239 & 3\,207 & 8\,550 & 1\,214 \\ \# novel introns & 11\,239 & 3\,113 & 8\,550 & 1\,214 \\
\% exact introns & 83.8\% & 91.7\% & 87.9\% & 55.2\% \\ \% exact introns & 83.8\% & 94.0\% & 87.9\% & 55.2\% \\
\% approx. introns & 91.8\% & 96.5\% & 92.5\% & 82.4\% \\ \% approx. introns & 91.8\% & 96.9\% & 92.5\% & 82.4\% \\
\botrule \botrule
\end{tabular} \end{tabular}
}{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse }{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse
@ -487,10 +493,16 @@ STAR~(v2.5.3a; \citealp{Dobin:2013kx}). In general, minimap2 is more
consistent with existing annotations (Table~\ref{tab:intron}): it finds consistent with existing annotations (Table~\ref{tab:intron}): it finds
more junctions with a higher percentage being exactly or approximately correct. more junctions with a higher percentage being exactly or approximately correct.
Minimap2 is over 40 times faster than GMAP and SpAln. While STAR is close to Minimap2 is over 40 times faster than GMAP and SpAln. While STAR is close to
minimap2 in speed, it does not work well with noisy reads. We have also minimap2 in speed, it does not work well with noisy reads.
evaluated spliced aligners on public Iso-Seq data (human Alzheimer brain
from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The observation We have also evaluated spliced aligners on public Iso-Seq data (human Alzheimer
is similar: minimap2 is faster at higher junction accuracy. brain from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The
observation is similar: minimap2 is faster at higher junction accuracy.
On a private Nanopore Direct RNA data set with $>$20\% sequencing error rate
(M\"{u}ller et al, personal communication), minimap2 aligned 940,346 introns
from 239,976 mapped reads with 88.5\% of them consistent with human gene
annotations. In comparison, only 40.3\% of GMAP introns found in known gene
annotations.
We noted that GMAP and SpAln have not been optimized for noisy reads. We are We noted that GMAP and SpAln have not been optimized for noisy reads. We are
showing the best setting we have experimented, but their developers should be showing the best setting we have experimented, but their developers should be
@ -528,6 +540,20 @@ region close to its mate. If we disable this feature, BWA-MEM becomes slightly
less accurate than minimap2. We might consider to implement a similar heuristic less accurate than minimap2. We might consider to implement a similar heuristic
in minimap2 in future. in minimap2 in future.
To evaluate the accuracy of minimap2 on real data, we aligned human reads
(AC:ERR1341796) with BWA-MEM and minimap2, and called SNPs and small INDELs
with GATK HaplotypeCaller v3.5~\citep{Depristo:2011vn}. This run was sequenced
from experimentally mixed CHM1 and CHM13 cell lines. Both them are homozygous
across the whole genome and have been \emph{de novo} assembled with SMRT reads
to high quality. This allowed us to construct an independent truth variant
data set
(\href{https://github.com/lh3/CHM-eval}{https://github.com/lh3/CHM-eval}) for
ERR1341796. In this evaluation, minimap2 has higher SNP false negative rate
(FNR; 2.5\% of minimap2 vs 2.2\% of BWA-MEM), but fewer false positive SNPs per
million bases (FPPM; 3.0 vs 3.9), lower INDEL FNR (7.3\% vs 7.5\%) and similar
INDEL FPPM (both 1.0). The difference between the two mappers is much smaller
than between BWA-MEM and Bowtie2.
\section{Conclusion} \section{Conclusion}
Minimap2 is a fast, accurate and versatile aligner for long nucleotide Minimap2 is a fast, accurate and versatile aligner for long nucleotide
@ -540,11 +566,11 @@ alignment is an intricate research topic. More thorough evaluations would be
necessary to justify the use of minimap2 for such applications. necessary to justify the use of minimap2 for such applications.
\section*{Acknowledgements} \section*{Acknowledgements}
We owe a debt of gratitude to Hajime Suzuki for releasing his masterpiece and We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their
insightful notes before formal publication. We thank M. Schatz, P. Rescheneder masterpiece and insightful notes before formal publication. We thank M.
and F. Sedlazeck for pointing out the limitation of BWA-MEM. We are also Schatz, P. Rescheneder and F. Sedlazeck for pointing out the limitation of
grateful to early minimap2 testers who have greatly helped to suggest features BWA-MEM. We are also grateful to early minimap2 testers who have greatly helped
and to fix various issues. to suggest features and to fix various issues.
\bibliography{minimap2} \bibliography{minimap2}

View File

@ -1,60 +1,62 @@
Q 60 18345673 8 0.000000436 18345673 Q 60 18579866 27 0.000001453 18579866
Q 59 33966 4 0.000000653 18379639 Q 59 27087 4 0.000001666 18606953
Q 58 34178 1 0.000000706 18413817 Q 58 21435 1 0.000001718 18628388
Q 56 49138 1 0.000000758 18462955 Q 57 45663 3 0.000001874 18674051
Q 54 22442 4 0.000000974 18485397 Q 56 36031 2 0.000001978 18710082
Q 53 19070 2 0.000001081 18504467 Q 55 18499 2 0.000002082 18728581
Q 52 14169 3 0.000001242 18518636 Q 54 14754 2 0.000002187 18743335
Q 51 13233 4 0.000001457 18531869 Q 53 25541 2 0.000002291 18768876
Q 50 12133 2 0.000001564 18544002 Q 52 26397 5 0.000002554 18795273
Q 49 11138 4 0.000001778 18555140 Q 51 15090 3 0.000002711 18810363
Q 48 11174 8 0.000002208 18566314 Q 50 13425 11 0.000003294 18823788
Q 47 17139 4 0.000002422 18583453 Q 49 15175 2 0.000003397 18838963
Q 46 20428 10 0.000002956 18603881 Q 48 19407 4 0.000003606 18858370
Q 45 16503 3 0.000003115 18620384 Q 47 11538 16 0.000004452 18869908
Q 44 11933 6 0.000003435 18632317 Q 46 12558 17 0.000005349 18882466
Q 43 25392 11 0.000004020 18657709 Q 45 40362 28 0.000006817 18922828
Q 42 16734 9 0.000004498 18674443 Q 44 10465 13 0.000007500 18933293
Q 41 13826 10 0.000005030 18688269 Q 43 10098 20 0.000008552 18943391
Q 40 13023 10 0.000005561 18701292 Q 42 10682 19 0.000009549 18954073
Q 39 12686 10 0.000006092 18713978 Q 41 9823 11 0.000010125 18963896
Q 38 17275 4 0.000006300 18731253 Q 40 9685 16 0.000010963 18973581
Q 37 17241 2 0.000006401 18748494 Q 39 10273 18 0.000011905 18983854
Q 36 12458 12 0.000007036 18760952 Q 38 9515 18 0.000012847 18993369
Q 35 11981 5 0.000007298 18772933 Q 37 9474 27 0.000014261 19002843
Q 34 12004 11 0.000007879 18784937 Q 36 10430 25 0.000015568 19013273
Q 33 12111 7 0.000008246 18797048 Q 35 9241 34 0.000017348 19022514
Q 32 11782 9 0.000008719 18808830 Q 34 9162 31 0.000018968 19031676
Q 31 11811 7 0.000009086 18820641 Q 33 10164 49 0.000021532 19041840
Q 30 33507 32 0.000010767 18854148 Q 32 9152 55 0.000024408 19050992
Q 29 11243 21 0.000011874 18865391 Q 31 9252 35 0.000026233 19060244
Q 28 10779 17 0.000012767 18876170 Q 30 9872 55 0.000029103 19070116
Q 27 15733 24 0.000014027 18891903 Q 29 8938 65 0.000032496 19079054
Q 26 16762 40 0.000016130 18908665 Q 28 8951 73 0.000036306 19088005
Q 25 13811 49 0.000018708 18922476 Q 27 9949 95 0.000041261 19097954
Q 24 14141 46 0.000021123 18936617 Q 26 9784 97 0.000046316 19107738
Q 23 13429 55 0.000024010 18950046 Q 25 10126 97 0.000051366 19117864
Q 22 13116 26 0.000025365 18963162 Q 24 11260 123 0.000057765 19129124
Q 21 13436 46 0.000027771 18976598 Q 23 10047 114 0.000063691 19139171
Q 20 13441 55 0.000030648 18990039 Q 22 9661 123 0.000070083 19148832
Q 19 12988 53 0.000033416 19003027 Q 21 10339 168 0.000078813 19159171
Q 18 13353 51 0.000036074 19016380 Q 20 17928 193 0.000088804 19177099
Q 17 13782 77 0.000040094 19030162 Q 19 9842 193 0.000098817 19186941
Q 16 14065 94 0.000045001 19044227 Q 18 14737 247 0.000111605 19201678
Q 15 14044 124 0.000051474 19058271 Q 17 10218 238 0.000123934 19211896
Q 14 14714 140 0.000058774 19072985 Q 16 10271 242 0.000136457 19222167
Q 13 17459 197 0.000069040 19090444 Q 15 12241 333 0.000153683 19234408
Q 12 17339 259 0.000082532 19107783 Q 14 9189 336 0.000171070 19243597
Q 11 17381 280 0.000097097 19125164 Q 13 9493 515 0.000197734 19253090
Q 10 17732 295 0.000112418 19142896 Q 12 11502 743 0.000236185 19264592
Q 9 17959 416 0.000134023 19160855 Q 11 8211 507 0.000262390 19272803
Q 8 18234 530 0.000161530 19179089 Q 10 9133 606 0.000293695 19281936
Q 7 19048 514 0.000188143 19198137 Q 9 10014 931 0.000341801 19291950
Q 6 19722 656 0.000222085 19217859 Q 8 8436 698 0.000377816 19300386
Q 5 19753 775 0.000262143 19237612 Q 7 8443 705 0.000414163 19308829
Q 4 19818 1030 0.000315359 19257430 Q 6 10203 944 0.000462808 19319032
Q 3 17088 1100 0.000372149 19274518 Q 5 6936 756 0.000501760 19325968
Q 2 43045 6708 0.000718569 19317563 Q 4 6732 843 0.000545190 19332700
Q 1 126377 25255 0.002012761 19443940 Q 3 8215 1104 0.000602040 19340915
Q 0 554357 372087 0.020562901 19998297 Q 2 21201 5440 0.000882342 19362116
Q 1 82328 22186 0.002019600 19444444
Q 0 553853 371953 0.020562901 19998297
U 1703 U 1703

View File

@ -1,9 +1,13 @@
Q 60 32226 0 0.000000000 32226 Q 60 32477 0 0.000000000 32477
Q 20 267 1 0.000030776 32493 Q 22 16 1 0.000030776 32493
Q 10 34 1 0.000061487 32527 Q 21 44 1 0.000061468 32537
Q 9 118 1 0.000091898 32645 Q 19 73 1 0.000091996 32610
Q 5 27 2 0.000153036 32672 Q 14 66 1 0.000122414 32676
Q 4 68 2 0.000213806 32740 Q 10 26 3 0.000214054 32702
Q 1 314 101 0.003267381 33054 Q 8 14 1 0.000244529 32716
Q 7 13 2 0.000305539 32729
Q 6 47 1 0.000335611 32776
Q 3 10 1 0.000366010 32786
Q 2 20 2 0.000426751 32806
Q 1 248 94 0.003267381 33054
Q 0 31 17 0.003778147 33085 Q 0 31 17 0.003778147 33085
U 3