updated the tech note

This commit is contained in:
Heng Li 2017-11-02 15:37:24 -04:00
parent 22290db3e4
commit d0ac78ac08
4 changed files with 153 additions and 105 deletions

View File

@ -281,3 +281,19 @@
Journal = {arXiv:1111:5572},
Title = {Faster and More Accurate Sequence Alignment with SNAP},
Year = {2011}}
@article{Irimia:2008aa,
Author = {Irimia, Manuel and Roy, Scott William},
Journal = {PLoS Genet},
Pages = {e1000148},
Title = {Evolutionary convergence on highly-conserved 3' intron structures in intron-poor eukaryotes and insights into the ancestral eukaryotic genome},
Volume = {4},
Year = {2008}}
@article{Depristo:2011vn,
Author = {Depristo, Mark A and others},
Journal = {Nat Genet},
Pages = {491-8},
Title = {A framework for variation discovery and genotyping using next-generation {DNA} sequencing data},
Volume = {43},
Year = {2011}}

View File

@ -31,10 +31,10 @@
\section{Motivation:} Recent advances in sequencing technologies promise
ultra-long reads of $\sim$100 kilo bases (kb) in average, full-length mRNA or
cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in
length. Existing alignment tools are unable or inefficient to process such data
length. Existing alignment programs are unable or inefficient to process such data
at scale, which presses for the development of new alignment algorithms.
\section{Results:} Minimap2 is a general-purpose aligner to map DNA or long
\section{Results:} Minimap2 is a general-purpose mapper to align DNA or long
mRNA sequences against a large reference database. It works with accurate short
reads of $\ge$100bp in length, $\ge$1kb genomic reads at error rate $\sim$15\%,
full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
@ -64,7 +64,7 @@ the thought that 10kb long sequences should be easier to map than 100bp reads
because we can more effectively skip repetitive regions, which are often the
bottleneck of short-read alignment. We confirmed our speculation by achieving
approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}.
\citet{Suzuki:2016} extended our work with a fast and novel algorithm on
\citet{Suzuki130633} extended our work with a fast and novel algorithm on
generating base-level alignment, which in turn inspired us to develop minimap2
towards higher accuracy and more practical functionality.
@ -179,7 +179,7 @@ where $s(i,j)$ is the score between the $i$-th reference base and $j$-th query
base. Eq.~(\ref{eq:ae86}) is a natural extension to the equation under affine
gap cost~\citep{Gotoh:1982aa,Altschul:1986aa}.
\subsubsection{Suzuki's formulation}
\subsubsection{The Suzuki-Kasahara formulation}
When we allow gaps longer than several hundred base pairs, nucleotide-level
alignment is much slower than chaining. SSE acceleration is critical to the
@ -187,7 +187,7 @@ performance of minimap2. Traditional SSE implementations~\citep{Farrar:2007hs}
based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short
sequences, but only 4-way parallelization when the peak alignment score reaches
32767. Long sequence alignment may exceed this threshold. Inspired by
\citet{Wu:1996aa} and the following work, \citet{Suzuki:2016} proposed a
\citet{Wu:1996aa} and the following work, \citet{Suzuki130633} proposed a
difference-based formulation that lifted this limitation.
In case of 2-piece gap cost, define
\[
@ -337,18 +337,24 @@ F_{i,j+1}= \max\{H_{ij}-q,F_{ij}\}-e\\
\tilde{E}_{i+1,j}= \max\{H_{ij}-d(i)-\tilde{q},\tilde{E}_{ij}\}\\
\end{array}\right.
\end{equation}
Let $T$ be the reference sequence. $d(i)$ is the cost of a non-canonical donor
site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a positive number $p$
otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which
takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is
almost equivalent to the equation used by EXALIN~\citep{Zhang:2006aa} except
that we allow insertions immediately followed by deletions and vice versa; in
addition, we use Suzuki's diagonal formulation in actual implementation.
%Given that $d_i$ and $a_i$
%are a function of the reference sequence, it is possible to incorporate
%splicing signals with more sophisticated models, such as positional weight
%matrices. We have not tried this approach.
Let $T$ be the reference sequence. $d(i)$ is computed as
\[d(i)=\left\{\begin{array}{ll}
0 & \mbox{if $T[i+1,i+3]$ is ${\tt GTA}$ or ${\tt GTG}$} \\
p/2 & \mbox{if $T[i+1,i+3]$ is ${\tt GTC}$ or ${\tt GTT}$} \\
p & \mbox{otherwise}
\end{array}\right.\]
where $T[i,j]$ extracts a substring of $T$ between $i$ and $j$ inclusively.
$d(i)$ penalizes non-canonical donor sites with $p$ and less frequent Eukayotic
splicing signal ${\tt GT[C/T]}$ with $p/2$~\citep{Irimia:2008aa}. Similarly,
\[a(i)=\left\{\begin{array}{ll}
0 & \mbox{if $T[i-2,i]$ is ${\tt CAG}$ or ${\tt TAG}$} \\
p/2 & \mbox{if $T[i-2,i]$ is ${\tt AAG}$ or ${\tt GAG}$} \\
p & \mbox{otherwise}
\end{array}\right.\]
models the acceptor signal. Eq.~(\ref{eq:splice}) is close to an equation in
\citet{Zhang:2006aa} except that we allow insertions immediately followed by
deletions and vice versa; in addition, we use the Suzuki-Kasahara diagonal
formulation in actual implementation.
If RNA-seq reads are not sequenced from stranded libraries, the read strand
relative to the underlying transcript is unknown. By default, minimap2 aligns
@ -440,16 +446,16 @@ to the 2-piece affine gap cost.
\subsection{Aligning long spliced reads}
We evaluated minimap2 on SIRV control data~(AC:SRR5286959;
\citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,916
introns from 11\,017 reads. 93.0\% of splice juctions are precise. We examined
\citealp{Byrne:2017aa}) where the truth is known. Minimap2 predicted 59\,918
introns from 11\,018 reads. 93.8\% of splice juctions are precise. We examined
wrongly predicted junctions and found the majority were caused by clustered
splicing signals (e.g. two adjacent ${\tt GT}$ sites). When INDEL sequencing
errors are frequent, it is difficult to find precise splicing sites in this
case. If we allow up to 10bp distance from true splicing sites, 98.4\% of
aligned introns are approximately correct. Given this observation, we might be
able to improve boundary detection by initializing $d(\cdot)$ and $a(\cdot)$ in
Eq.~(\ref{eq:splice}) with position-specific scoring matrices or more
sophisticated models. We have not tried this approach.
aligned introns are approximately correct. It is worth noting that for SIRV, we
asked minimap2 to model the ${\tt GT..AG}$ splicing signal only without extra
bases. This is because SIRV does not honor the evolutionarily prevalent signal
${\tt GT[A/G]..[C/T]AG}$~\citep{Irimia:2008aa}.
\begin{table}[!tb]
\processtable{Evaluation of junction accuracy on 2D ONT reads}
@ -460,13 +466,13 @@ sophisticated models. We have not tried this approach.
\midrule
Run time (CPU min) & 631 & 15.9 & 2\,076 & 33.9 \\
Peak RAM (GByte) & 8.9 & 14.5 & 3.2 & 29.2\vspace{1em}\\
\# aligned reads & 103\,669 & 104\,200 & 103\,711 & 26\,479 \\
\# aligned reads & 103\,669 & 104\,199 & 103\,711 & 26\,479 \\
\# chimeric alignments & 1\,904 & 1\,488 & 0 & 0 \\
\# non-spliced alignments & 15\,854 & 14\,639 & 17\,033 & 10\,545\vspace{1em}\\
\# aligned introns & 692\,275 & 694\,103 & 692\,945 & 78\,603 \\
\# novel introns & 11\,239 & 3\,207 & 8\,550 & 1\,214 \\
\% exact introns & 83.8\% & 91.7\% & 87.9\% & 55.2\% \\
\% approx. introns & 91.8\% & 96.5\% & 92.5\% & 82.4\% \\
\# non-spliced alignments & 15\,854 & 14\,798 & 17\,033 & 10\,545\vspace{1em}\\
\# aligned introns & 692\,275 & 693\,553 & 692\,945 & 78\,603 \\
\# novel introns & 11\,239 & 3\,113 & 8\,550 & 1\,214 \\
\% exact introns & 83.8\% & 94.0\% & 87.9\% & 55.2\% \\
\% approx. introns & 91.8\% & 96.9\% & 92.5\% & 82.4\% \\
\botrule
\end{tabular}
}{Mouse reads (AC:SRR5286960) were mapped to the primary assembly of mouse
@ -487,10 +493,16 @@ STAR~(v2.5.3a; \citealp{Dobin:2013kx}). In general, minimap2 is more
consistent with existing annotations (Table~\ref{tab:intron}): it finds
more junctions with a higher percentage being exactly or approximately correct.
Minimap2 is over 40 times faster than GMAP and SpAln. While STAR is close to
minimap2 in speed, it does not work well with noisy reads. We have also
evaluated spliced aligners on public Iso-Seq data (human Alzheimer brain
from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The observation
is similar: minimap2 is faster at higher junction accuracy.
minimap2 in speed, it does not work well with noisy reads.
We have also evaluated spliced aligners on public Iso-Seq data (human Alzheimer
brain from \href{http://bit.ly/isoseqpub}{http://bit.ly/isoseqpub}). The
observation is similar: minimap2 is faster at higher junction accuracy.
On a private Nanopore Direct RNA data set with $>$20\% sequencing error rate
(M\"{u}ller et al, personal communication), minimap2 aligned 940,346 introns
from 239,976 mapped reads with 88.5\% of them consistent with human gene
annotations. In comparison, only 40.3\% of GMAP introns found in known gene
annotations.
We noted that GMAP and SpAln have not been optimized for noisy reads. We are
showing the best setting we have experimented, but their developers should be
@ -528,6 +540,20 @@ region close to its mate. If we disable this feature, BWA-MEM becomes slightly
less accurate than minimap2. We might consider to implement a similar heuristic
in minimap2 in future.
To evaluate the accuracy of minimap2 on real data, we aligned human reads
(AC:ERR1341796) with BWA-MEM and minimap2, and called SNPs and small INDELs
with GATK HaplotypeCaller v3.5~\citep{Depristo:2011vn}. This run was sequenced
from experimentally mixed CHM1 and CHM13 cell lines. Both them are homozygous
across the whole genome and have been \emph{de novo} assembled with SMRT reads
to high quality. This allowed us to construct an independent truth variant
data set
(\href{https://github.com/lh3/CHM-eval}{https://github.com/lh3/CHM-eval}) for
ERR1341796. In this evaluation, minimap2 has higher SNP false negative rate
(FNR; 2.5\% of minimap2 vs 2.2\% of BWA-MEM), but fewer false positive SNPs per
million bases (FPPM; 3.0 vs 3.9), lower INDEL FNR (7.3\% vs 7.5\%) and similar
INDEL FPPM (both 1.0). The difference between the two mappers is much smaller
than between BWA-MEM and Bowtie2.
\section{Conclusion}
Minimap2 is a fast, accurate and versatile aligner for long nucleotide
@ -540,11 +566,11 @@ alignment is an intricate research topic. More thorough evaluations would be
necessary to justify the use of minimap2 for such applications.
\section*{Acknowledgements}
We owe a debt of gratitude to Hajime Suzuki for releasing his masterpiece and
insightful notes before formal publication. We thank M. Schatz, P. Rescheneder
and F. Sedlazeck for pointing out the limitation of BWA-MEM. We are also
grateful to early minimap2 testers who have greatly helped to suggest features
and to fix various issues.
We owe a debt of gratitude to H. Suzuki and M. Kasahara for releasing their
masterpiece and insightful notes before formal publication. We thank M.
Schatz, P. Rescheneder and F. Sedlazeck for pointing out the limitation of
BWA-MEM. We are also grateful to early minimap2 testers who have greatly helped
to suggest features and to fix various issues.
\bibliography{minimap2}

View File

@ -1,60 +1,62 @@
Q 60 18345673 8 0.000000436 18345673
Q 59 33966 4 0.000000653 18379639
Q 58 34178 1 0.000000706 18413817
Q 56 49138 1 0.000000758 18462955
Q 54 22442 4 0.000000974 18485397
Q 53 19070 2 0.000001081 18504467
Q 52 14169 3 0.000001242 18518636
Q 51 13233 4 0.000001457 18531869
Q 50 12133 2 0.000001564 18544002
Q 49 11138 4 0.000001778 18555140
Q 48 11174 8 0.000002208 18566314
Q 47 17139 4 0.000002422 18583453
Q 46 20428 10 0.000002956 18603881
Q 45 16503 3 0.000003115 18620384
Q 44 11933 6 0.000003435 18632317
Q 43 25392 11 0.000004020 18657709
Q 42 16734 9 0.000004498 18674443
Q 41 13826 10 0.000005030 18688269
Q 40 13023 10 0.000005561 18701292
Q 39 12686 10 0.000006092 18713978
Q 38 17275 4 0.000006300 18731253
Q 37 17241 2 0.000006401 18748494
Q 36 12458 12 0.000007036 18760952
Q 35 11981 5 0.000007298 18772933
Q 34 12004 11 0.000007879 18784937
Q 33 12111 7 0.000008246 18797048
Q 32 11782 9 0.000008719 18808830
Q 31 11811 7 0.000009086 18820641
Q 30 33507 32 0.000010767 18854148
Q 29 11243 21 0.000011874 18865391
Q 28 10779 17 0.000012767 18876170
Q 27 15733 24 0.000014027 18891903
Q 26 16762 40 0.000016130 18908665
Q 25 13811 49 0.000018708 18922476
Q 24 14141 46 0.000021123 18936617
Q 23 13429 55 0.000024010 18950046
Q 22 13116 26 0.000025365 18963162
Q 21 13436 46 0.000027771 18976598
Q 20 13441 55 0.000030648 18990039
Q 19 12988 53 0.000033416 19003027
Q 18 13353 51 0.000036074 19016380
Q 17 13782 77 0.000040094 19030162
Q 16 14065 94 0.000045001 19044227
Q 15 14044 124 0.000051474 19058271
Q 14 14714 140 0.000058774 19072985
Q 13 17459 197 0.000069040 19090444
Q 12 17339 259 0.000082532 19107783
Q 11 17381 280 0.000097097 19125164
Q 10 17732 295 0.000112418 19142896
Q 9 17959 416 0.000134023 19160855
Q 8 18234 530 0.000161530 19179089
Q 7 19048 514 0.000188143 19198137
Q 6 19722 656 0.000222085 19217859
Q 5 19753 775 0.000262143 19237612
Q 4 19818 1030 0.000315359 19257430
Q 3 17088 1100 0.000372149 19274518
Q 2 43045 6708 0.000718569 19317563
Q 1 126377 25255 0.002012761 19443940
Q 0 554357 372087 0.020562901 19998297
Q 60 18579866 27 0.000001453 18579866
Q 59 27087 4 0.000001666 18606953
Q 58 21435 1 0.000001718 18628388
Q 57 45663 3 0.000001874 18674051
Q 56 36031 2 0.000001978 18710082
Q 55 18499 2 0.000002082 18728581
Q 54 14754 2 0.000002187 18743335
Q 53 25541 2 0.000002291 18768876
Q 52 26397 5 0.000002554 18795273
Q 51 15090 3 0.000002711 18810363
Q 50 13425 11 0.000003294 18823788
Q 49 15175 2 0.000003397 18838963
Q 48 19407 4 0.000003606 18858370
Q 47 11538 16 0.000004452 18869908
Q 46 12558 17 0.000005349 18882466
Q 45 40362 28 0.000006817 18922828
Q 44 10465 13 0.000007500 18933293
Q 43 10098 20 0.000008552 18943391
Q 42 10682 19 0.000009549 18954073
Q 41 9823 11 0.000010125 18963896
Q 40 9685 16 0.000010963 18973581
Q 39 10273 18 0.000011905 18983854
Q 38 9515 18 0.000012847 18993369
Q 37 9474 27 0.000014261 19002843
Q 36 10430 25 0.000015568 19013273
Q 35 9241 34 0.000017348 19022514
Q 34 9162 31 0.000018968 19031676
Q 33 10164 49 0.000021532 19041840
Q 32 9152 55 0.000024408 19050992
Q 31 9252 35 0.000026233 19060244
Q 30 9872 55 0.000029103 19070116
Q 29 8938 65 0.000032496 19079054
Q 28 8951 73 0.000036306 19088005
Q 27 9949 95 0.000041261 19097954
Q 26 9784 97 0.000046316 19107738
Q 25 10126 97 0.000051366 19117864
Q 24 11260 123 0.000057765 19129124
Q 23 10047 114 0.000063691 19139171
Q 22 9661 123 0.000070083 19148832
Q 21 10339 168 0.000078813 19159171
Q 20 17928 193 0.000088804 19177099
Q 19 9842 193 0.000098817 19186941
Q 18 14737 247 0.000111605 19201678
Q 17 10218 238 0.000123934 19211896
Q 16 10271 242 0.000136457 19222167
Q 15 12241 333 0.000153683 19234408
Q 14 9189 336 0.000171070 19243597
Q 13 9493 515 0.000197734 19253090
Q 12 11502 743 0.000236185 19264592
Q 11 8211 507 0.000262390 19272803
Q 10 9133 606 0.000293695 19281936
Q 9 10014 931 0.000341801 19291950
Q 8 8436 698 0.000377816 19300386
Q 7 8443 705 0.000414163 19308829
Q 6 10203 944 0.000462808 19319032
Q 5 6936 756 0.000501760 19325968
Q 4 6732 843 0.000545190 19332700
Q 3 8215 1104 0.000602040 19340915
Q 2 21201 5440 0.000882342 19362116
Q 1 82328 22186 0.002019600 19444444
Q 0 553853 371953 0.020562901 19998297
U 1703

View File

@ -1,9 +1,13 @@
Q 60 32226 0 0.000000000 32226
Q 20 267 1 0.000030776 32493
Q 10 34 1 0.000061487 32527
Q 9 118 1 0.000091898 32645
Q 5 27 2 0.000153036 32672
Q 4 68 2 0.000213806 32740
Q 1 314 101 0.003267381 33054
Q 60 32477 0 0.000000000 32477
Q 22 16 1 0.000030776 32493
Q 21 44 1 0.000061468 32537
Q 19 73 1 0.000091996 32610
Q 14 66 1 0.000122414 32676
Q 10 26 3 0.000214054 32702
Q 8 14 1 0.000244529 32716
Q 7 13 2 0.000305539 32729
Q 6 47 1 0.000335611 32776
Q 3 10 1 0.000366010 32786
Q 2 20 2 0.000426751 32806
Q 1 248 94 0.003267381 33054
Q 0 31 17 0.003778147 33085
U 3