diff --git a/minimap2.1 b/minimap2.1 index f0c040a..1a64410 100644 --- a/minimap2.1 +++ b/minimap2.1 @@ -1,4 +1,4 @@ -.TH minimap2 1 "24 February 2018" "minimap2-2.9 (r720)" "Bioinformatics tools" +.TH minimap2 1 "15 March 2018" "minimap2-2.9-dirty (r746)" "Bioinformatics tools" .SH NAME .PP minimap2 - mapping and alignment between collections of DNA sequences @@ -123,10 +123,27 @@ provided as the target sequences, options will be effectively overridden by the options stored in the index file. .SS Mapping options .TP 10 -.BI -f \ FLOAT -Ignore top +.BI -f \ FLOAT | INT1 [, INT2 ] +If fraction, ignore top .I FLOAT -fraction of most frequent minimizers [0.0002] +fraction of most frequent minimizers [0.0002]. If integer, +ignore minimizers occuring more than +.I INT1 +times. +.I INT2 +is only effective in the +.B --sr +or +.B -xsr +mode, which sets the threshold for a second round of seeding. +.TP +.BI --min-occ-floor \ INT +Force minimap2 to always use k-mers occurring +.I INT +times or less [0]. In effect, the max occurrence threshold is set to +the +.RI max{ INT , +.BR -f }. .TP .BI -g \ INT Stop chain enlongation if there are no minimizers within @@ -217,14 +234,6 @@ on chains. Set .I INT to a large number to switch off this heurstics. .TP -.BI --min-occ-floor \ INT -Force minimap2 to always use k-mers occurring -.I INT -times or less [0]. In effect, the max occurrence threshold is set to -the -.RI max{ INT , -.BR -f }. -.TP .B --no-long-join Disable the long gap patching heuristic. When this option is applied, the maximum alignment gap is mostly controlled by diff --git a/tex/minimap2.bib b/tex/minimap2.bib index 948cb4b..0fb22a7 100644 --- a/tex/minimap2.bib +++ b/tex/minimap2.bib @@ -61,13 +61,6 @@ Volume = {32}, Year = {2016}} -@misc{Suzuki:2016, - title = {Fast and accurate alignment tool for PacBio and Nanopore long reads}, - author = {Hajime Suzuki}, - journal = {Unpublished}, - howpublished = {\href{https://github.com/ocxtal/minialign}{https://github.com/ocxtal/minialign}}, - year = {2016}} - @misc{Ruan:2016, title = {Ultra-fast de novo assembler using long noisy reads}, author = {Jue Ruan}, @@ -172,14 +165,6 @@ Volume = {29}, Year = {2011}} -@article {Suzuki130633, - author = {Suzuki, Hajime and Kasahara, Masahiro}, - title = {Acceleration Of Nucleotide Semi-Global Alignment With Adaptive Banded Dynamic Programming}, - year = {2017}, - note = {doi:10.1101/130633}, - publisher = {Cold Spring Harbor Labs Journals}, - journal = {bioRxiv}} - @article{Gotoh:1982aa, Author = {Gotoh, O}, Journal = {J Mol Biol}, @@ -337,3 +322,19 @@ Title = {{MUMmer4}: A fast and versatile genome alignment system}, Volume = {14}, Year = {2018}} + +@article{Li:2009ys, + Author = {Li, Heng and others}, + Journal = {Bioinformatics}, + Pages = {2078-9}, + Title = {The {Sequence Alignment/Map format and SAMtools}}, + Volume = {25}, + Year = {2009}} + +@article{Suzuki:2018aa, + Author = {Suzuki, Hajime and Kasahara, Masahiro}, + Journal = {BMC Bioinformatics}, + Pages = {45}, + Title = {Introducing difference recurrence relations for faster semi-global alignment of long sequences}, + Volume = {19}, + Year = {2018}} diff --git a/tex/minimap2.tex b/tex/minimap2.tex index 57dfcf7..d1013b3 100644 --- a/tex/minimap2.tex +++ b/tex/minimap2.tex @@ -19,7 +19,7 @@ \begin{document} \firstpage{1} -\title[Aligning nucleotide sequences with minimap2]{Minimap2: versatile pairwise alignment for nucleotide sequences} +\title[Aligning nucleotide sequences with minimap2]{Minimap2: pairwise alignment for nucleotide sequences} \author[Li]{Heng Li} \address{Broad Institute, 415 Main Street, Cambridge, MA 02142, USA} @@ -64,7 +64,7 @@ the thought that 10kb long sequences should be easier to map than 100bp reads because we can more effectively skip repetitive regions, which are often the bottleneck of short-read alignment. We confirmed our speculation by achieving approximate mapping 50 times faster than BWA-MEM~\citep{Li:2016aa}. -\citet{Suzuki130633} extended our work with a fast and novel algorithm on +\citet{Suzuki:2018aa} extended our work with a fast and novel algorithm on generating base-level alignment, which in turn inspired us to develop minimap2 with added functionality. @@ -88,7 +88,9 @@ the versatility of minimap2. Minimap2 follows a typical seed-chain-align procedure as is used by most full-genome aligners. It collects minimizers~\citep{Roberts:2004fv} of the -reference sequences and indexes them in a hash table. Then for each query +reference sequences and indexes them in a hash table, with the key being the +hash of a minimizer and the value being a list of locations of the minimizer +copies. Then for each query sequence, minimap2 takes query minimizers as \emph{seeds}, finds exact matches (i.e. \emph{anchors}) to the reference, and identifies sets of colinear anchors as \emph{chains}. If base-level alignment is requested, minimap2 applies dynamic @@ -118,9 +120,12 @@ distance between two anchors is too large); otherwise \begin{equation}\label{eq:chain-gap} \beta(j,i)=\gamma_c\big((y_i-y_j)-(x_i-x_j)\big) \end{equation} -In implementation, a gap of length $l\not=0$ costs +In implementation, a gap of length $l$ costs \[ -\gamma_c(l)=0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l| +\gamma_c(l)=\left\{\begin{array}{ll} +0.01\cdot \bar{w}\cdot|l|+0.5\log_2|l| & (l\not=0) \\ +0 & (l=0) +\end{array}\right. \] where $\bar{w}$ is the average seed length. For $N$ anchors, directly computing all $f(\cdot)$ with Eq.~(\ref{eq:chain}) takes $O(N^2)$ time. Although theoretically faster @@ -164,7 +169,7 @@ empirical formula: \[ {\rm mapQ}=40\cdot (1-f_2/f_1)\cdot\min\{1,m/10\}\cdot\log f_1 \] -where $m$ is the number of anchors on the primary chain, $f_1$ is the chaining +where $\log$ denotes natural logarithm, $m$ is the number of anchors on the primary chain, $f_1$ is the chaining score, and $f_2\le f_1$ is the score of the best chain that is secondary to the primary chain. Intuitively, a chain is assigned to a higher mapping quality if it is long and its best secondary chain is weak. @@ -253,7 +258,7 @@ performance of minimap2. Traditional SSE implementations~\citep{Farrar:2007hs} based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short sequences, but only 4-way parallelization when the peak alignment score reaches 32767. Long sequence alignment may exceed this threshold. Inspired by -\citet{Wu:1996aa} and the following work, \citet{Suzuki130633} proposed a +\citet{Wu:1996aa} and the following work, \citet{Suzuki:2018aa} proposed a difference-based formulation that lifted this limitation. In case of 2-piece gap cost, define \[ @@ -320,7 +325,7 @@ y_{rt}&=&\max\{0,y_{r-1,t}+u_{r-1,t}-z_{rt}+q\}-q-e\\ \end{equation*} In this formulation, cells with the same diagonal index $r$ are independent of each other. This allows us to fully vectorize the computation of all cells on -the same anti-diagonal in one inner loop. It also simplifies banded alignment, +the same anti-diagonal in one inner loop. It also simplifies banded alignment (500bp band width by default), which would be difficult with striped vectorization~\citep{Farrar:2007hs}. On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the initial @@ -355,7 +360,7 @@ times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}. Without banding, our implementation is slower than Edlib~\citep{Sosic:2017aa}, but with a 1000bp band, it is considerably faster. When performing global alignment between anchors, we expect the alignment to stay close to the diagonal of the -DP matrix. Banding is applicable most of time. +DP matrix. Banding is applicable most of the time. \subsubsection{The Z-drop heuristic} @@ -478,7 +483,7 @@ both C and Python. It is distributed under the MIT license, free to both commercial and academic uses. Minimap2 uses the same base algorithm for all applications, but it has to apply different sets of parameters depending on input data types. Similar to BWA-MEM, minimap2 introduces `presets' that -modify multiple parameters with a simple invokation. Detailed settings +modify multiple parameters with a simple invocation. Detailed settings and command-line options can be found in the minimap2 manpage. In addition to the applications evaluated in the following sections, minimap2 also retains minimap's functionality to find overlaps between long reads and to search @@ -570,7 +575,7 @@ Peak RAM (GByte) & 8.9 & 14.5 & 3.2 & 29.2\vspace{1em}\\ \% approx. introns & 91.8\% & 96.9\% & 92.5\% & 82.4\% \\ \botrule \end{tabular} -}{Mouse reads (AC:SRR5286960; R9.4 chemistry) were mapped to the primary assembly of mouse +}{Mouse cDNA reads (AC:SRR5286960; R9.4 chemistry) were mapped to the primary assembly of mouse genome GRCm38 with the following tools and command options: minimap2 (`-ax splice'); GMAP (`-n 0 --min-intronlength 30 --cross-species'); SpAln (`-Q7 -LS -S3'); STARlong (according to @@ -579,7 +584,7 @@ compared to the EnsEMBL gene annotation, release 89. A predicted intron is \emph{novel} if it has no overlaps with any annotated introns. An intron is \emph{exact} if it is identical to an annotated intron. An intron is \emph{approximate} if both its 5'- and 3'-end are within 10bp around the ends -of an annotated intron.} +of an annotated intron. Chimeric alignments are defined in the SAM spec~\citep{Li:2009ys}.} \end{table} We next aligned real mouse reads~\citep{Byrne:2017aa} with GMAP~(v2017-06-20; @@ -687,7 +692,7 @@ involving $>$100kb introns, which was impractically slow ten years ago. The minimap2 chaining algorithm is fast and highly accurate by itself. In fact, chaining alone is more accurate than all the other long-read mappers in Fig.~\ref{fig:eval}a (data not shown). This accuracy helps to reduce downstream -base-level alignment of candidate chains, which is still times slower than +base-level alignment of candidate chains, which is still several times slower than chaining even with the Suzuki-Kasahara improvement. In addition, taking a general form, minimap2 chaining can be adapted to non-typical data types such as spliced reads and multiple reads per fragment. This gives us the opportunity to