487 lines
13 KiB
Groff
487 lines
13 KiB
Groff
.TH minimap2 1 "22 October 2017" "minimap2-2.2-dirty (r531)" "Bioinformatics tools"
|
|
.SH NAME
|
|
.PP
|
|
minimap2 - mapping and alignment between collections of DNA sequences
|
|
.SH SYNOPSIS
|
|
* Indexing the target sequences (optional):
|
|
.RS 4
|
|
minimap2
|
|
.RB [ -x
|
|
.IR preset ]
|
|
.B -d
|
|
.I target.mmi
|
|
.I target.fa
|
|
.br
|
|
minimap2
|
|
.RB [ -H ]
|
|
.RB [ -k
|
|
.IR kmer ]
|
|
.RB [ -w
|
|
.IR miniWinSize ]
|
|
.RB [ -I
|
|
.IR batchSize ]
|
|
.B -d
|
|
.I target.mmi
|
|
.I target.fa
|
|
.RE
|
|
|
|
* Long-read alignment with CIGAR:
|
|
.RS 4
|
|
minimap2
|
|
.B -a
|
|
.RB [ -x
|
|
.IR preset ]
|
|
.I target.mmi
|
|
.I query.fa
|
|
>
|
|
.I output.sam
|
|
.br
|
|
minimap2
|
|
.B -c
|
|
.RB [ -H ]
|
|
.RB [ -k
|
|
.IR kmer ]
|
|
.RB [ -w
|
|
.IR miniWinSize ]
|
|
.RB [ ... ]
|
|
.I target.fa
|
|
.I query.fa
|
|
>
|
|
.I output.paf
|
|
.RE
|
|
|
|
* Long-read overlap without CIGAR:
|
|
.RS 4
|
|
minimap2
|
|
.B -x
|
|
ava-ont
|
|
.RB [ -t
|
|
.IR nThreads ]
|
|
.I target.fa
|
|
.I query.fa
|
|
>
|
|
.I output.paf
|
|
.RE
|
|
.SH DESCRIPTION
|
|
.PP
|
|
Minimap2 is a fast sequence mapping and alignment program that can find
|
|
overlaps between long noisy reads, or map long reads or their assemblies to a
|
|
reference genome optionally with detailed alignment (i.e. CIGAR). At present,
|
|
it works efficiently with query sequences from a few kilobases to ~100
|
|
megabases in length at a error rate ~15%. Minimap2 outputs in the PAF or the
|
|
SAM format.
|
|
.SH OPTIONS
|
|
.SS Indexing options
|
|
.TP 10
|
|
.BI -k \ INT
|
|
Minimizer k-mer length [15]
|
|
.TP
|
|
.BI -w \ INT
|
|
Minimizer window size [2/3 of k-mer length]. A minimizer is the smallest k-mer
|
|
in a window of w consecutive k-mers.
|
|
.TP
|
|
.B -H
|
|
Use homopolymer-compressed (HPC) minimizers. An HPC sequence is constructed by
|
|
contracting homopolymer runs to a single base. An HPC minimizer is a minimizer
|
|
on the HPC sequence.
|
|
.TP
|
|
.BI -I \ NUM
|
|
Load at most
|
|
.I NUM
|
|
target bases into RAM for indexing [4G]. If there are more than
|
|
.I NUM
|
|
bases in
|
|
.IR target.fa ,
|
|
minimap2 needs to read
|
|
.I query.fa
|
|
multiple times to map it against each batch of target sequences.
|
|
.I NUM
|
|
may be ending with k/K/m/M/g/G. NB: mapping quality is incorrect given a
|
|
multi-part index.
|
|
.TP
|
|
.BI -d \ FILE
|
|
Save the minimizer index of
|
|
.I target.fa
|
|
to
|
|
.I FILE
|
|
[no dump]. Minimap2 indexing is fast. It can index the human genome in a couple
|
|
of minutes. If even shorter startup time is desired, use this option to save
|
|
the index. Indexing options are fixed in the index file. When an index file is
|
|
provided as the target sequences, options
|
|
.BR -H ,
|
|
.BR -k ,
|
|
.BR -w ,
|
|
.B -I
|
|
will be effectively overridden by the options stored in the index file.
|
|
.SS Mapping options
|
|
.TP 10
|
|
.BI -f \ FLOAT
|
|
Ignore top
|
|
.I FLOAT
|
|
fraction of most frequent minimizers [0.0002]
|
|
.TP
|
|
.BI -g \ INT
|
|
Stop chain enlongation if there are no minimizers in
|
|
.IR INT -bp
|
|
[10000].
|
|
.TP
|
|
.BI -r \ INT
|
|
Bandwidth used in chaining and DP-based alignment [500]. This option
|
|
approximately controls the maximum gap size.
|
|
.TP
|
|
.BI -n \ INT
|
|
Discard chains consisting of
|
|
.RI < INT
|
|
number of minimizers [3]
|
|
.TP
|
|
.BI -m \ INT
|
|
Discard chains with chaining score
|
|
.RI < INT
|
|
[40]. Chaining score equals the approximate number of matching bases minus a
|
|
concave gap penalty. It is computed with dynamic programming.
|
|
.TP
|
|
.B -X
|
|
Perform all-vs-all mapping. In this mode, if the query sequence name is
|
|
lexicographically larger than the target sequence name, the hits between them
|
|
will be suppressed; if the query sequence name is the same as the target name,
|
|
diagonal minimizer hits will also be suppressed.
|
|
.TP
|
|
.BI -p \ FLOAT
|
|
Minimal secondary-to-primary score ratio to output secondary mappings [0.8].
|
|
Between two chains overlaping over half of the shorter chain (controlled by
|
|
.BR --mask-level ),
|
|
the chain with a lower score is secondary to the chain with a higher score.
|
|
If the ratio of the scores is below
|
|
.IR FLOAT ,
|
|
the secondary chain will not be outputted or extended with DP alignment later.
|
|
.TP
|
|
.BI -N \ INT
|
|
Output at most
|
|
.I INT
|
|
secondary alignments [5]. This option has no effect when
|
|
.B -X
|
|
is applied.
|
|
.TP
|
|
.BI -G \ NUM
|
|
Maximum gap on the reference (effective with
|
|
.BR -xsplice / --splice ).
|
|
This option also changes the chaining and alignment band width to
|
|
.IR NUM .
|
|
Increasing this option slows down spliced alignment. [200k]
|
|
.TP
|
|
.BI -F \ NUM
|
|
Maximum fragment length (aka insert size; effective with
|
|
.BR -xsr / --frag)
|
|
[800]
|
|
.TP
|
|
.BI --max-chain-skip \ INT
|
|
A heuristics that stops chaining early [50]. Minimap2 uses dynamic programming
|
|
for chaining. The time complexity is quadratic in the number of seeds. This
|
|
option makes minimap2 exits the inner loop if it repeatedly sees seeds already
|
|
on chains. Set
|
|
.I INT
|
|
to a large number to switch off this heurstics.
|
|
.TP
|
|
.B --no-long-join
|
|
Disable the long gap patching heuristic. When this option is applied, the
|
|
maximum alignment gap is mostly controlled by
|
|
.BR -r .
|
|
.TP
|
|
.B --splice
|
|
Enable the splice alignment mode.
|
|
.TP
|
|
.B --sr
|
|
Enable short-read alignment heuristics. In the short-read mode, minimap2
|
|
applies a second round of chaining with a higher minimizer occurrence threshold
|
|
if no good chain is found. In addition, minimap2 attempts to patch gaps between
|
|
seeds with ungapped alignment.
|
|
.TP
|
|
.BR --frag [= no | yes ]
|
|
Whether to enable the fragment mode [no]
|
|
.SS Alignment options
|
|
.TP 10
|
|
.BI -A \ INT
|
|
Matching score [2]
|
|
.TP
|
|
.BI -B \ INT
|
|
Mismatching penalty [4]
|
|
.TP
|
|
.BI -O \ INT1[,INT2]
|
|
Gap open penalty [4,24]. If
|
|
.I INT2
|
|
is not specified, it is set to
|
|
.IR INT1 .
|
|
.TP
|
|
.BI -E \ INT1[,INT2]
|
|
Gap extension penalty [2,1]. A gap of length
|
|
.I k
|
|
costs
|
|
.RI min{ O1 + k * E1 , O2 + k * E2 }.
|
|
In the splice mode, the second gap penalties are not used.
|
|
.TP
|
|
.BI -z \ INT
|
|
Break an alignment if the running score drops too quickly along the diagonal of
|
|
the DP matrix (diagonal X-drop, or Z-drop) [400]. Increasing the value improves
|
|
the contiguity of the alignment at the cost of poor alignment in the middle
|
|
(e.g. caused by a long inversion).
|
|
.TP
|
|
.BI -s \ INT
|
|
Minimal peak DP alignment score to output [40]. The peak score is computed from
|
|
the final CIGAR. It is the score of the max scoring segment in the alignment
|
|
and may be different from the total alignment score.
|
|
.TP
|
|
.BI -u \ CHAR
|
|
How to find canonical splicing sites GT-AG -
|
|
.BR f :
|
|
transcript strand;
|
|
.BR b :
|
|
both strands;
|
|
.BR n :
|
|
no attempt to match GT-AG [n]
|
|
.TP
|
|
.BI --cost-non-gt-ag \ INT
|
|
Cost of non-canonical splicing sites [0].
|
|
.TP
|
|
.BI --end-bonus \ INT
|
|
Score bonus when alignment extends to the end of the query sequence [10].
|
|
.SS Input/output options
|
|
.TP 10
|
|
.B -a
|
|
Generate CIGAR and output alignments in the SAM format. Minimap2 outputs in PAF
|
|
by default.
|
|
.TP
|
|
.B -Q
|
|
Ignore base quality in the input file.
|
|
.TP
|
|
.B -L
|
|
Write CIGAR with >65535 operators at the CG tag. Older tools are unable to
|
|
convert alignments with >65535 CIGAR ops to BAM. This option makes minimap2 SAM
|
|
compatible with older tools. Newer tools recognizes this tag and reconstruct
|
|
the real CIGAR in memory.
|
|
.TP
|
|
.BI -R \ STR
|
|
SAM read group line in a format like
|
|
.RB @RG\\\\tID:foo\\\\tSM:bar
|
|
[].
|
|
.TP
|
|
.B -c
|
|
Generate CIGAR. In PAF, the CIGAR is written to the `cg' custom tag.
|
|
.TP
|
|
.BI --cs[= STR ]
|
|
Output the
|
|
.B cs
|
|
tag.
|
|
.I STR
|
|
can be either
|
|
.I short
|
|
or
|
|
.IR long .
|
|
If no
|
|
.I STR
|
|
is given,
|
|
.I short
|
|
is assumed. [none]
|
|
.TP
|
|
.BI --seed \ INT
|
|
Integer seed for randomizing equally best hits. Minimap2 hashes
|
|
.I INT
|
|
and read name when choosing between equally best hits. [11]
|
|
.TP
|
|
.BI -t \ INT
|
|
Number of threads [3]. Minimap2 uses at most three threads when indexing target
|
|
sequences, and uses up to
|
|
.IR INT +1
|
|
threads when mapping (the extra thread is for I/O, which is frequently idle and
|
|
takes little CPU time).
|
|
.TP
|
|
.B -2
|
|
Use two I/O threads during mapping. By default, minimap2 uses one I/O thread.
|
|
When I/O is slow (e.g. piping to gzip, or reading from a slow pipe), the I/O
|
|
thread may become the bottleneck. Apply this option to use one thread for input
|
|
and another thread for output, at the cost of increased peak RAM.
|
|
.TP
|
|
.BI -K \ NUM
|
|
Number of bases loaded into memory to process in a mini-batch [500M].
|
|
Similar to option
|
|
.BR -I ,
|
|
K/M/G/k/m/g suffix is accepted. A large
|
|
.I NUM
|
|
helps load balancing in the multi-threading mode, at the cost of increased
|
|
memory.
|
|
.TP
|
|
.BR --secondary [= yes | no ]
|
|
Whether to output secondary alignments [yes]
|
|
.TP
|
|
.B --version
|
|
Print version number to stdout
|
|
.SS Preset options
|
|
.TP 10
|
|
.BI -x \ STR
|
|
Preset []. This option applies multiple options at the same time. It should be
|
|
applied before other options because options applied later will overwrite the
|
|
values set by
|
|
.BR -x .
|
|
Available
|
|
.I STR
|
|
are:
|
|
.RS
|
|
.TP 8
|
|
.B map-pb
|
|
PacBio/Oxford Nanopore read to reference mapping
|
|
.RB ( -Hk19 )
|
|
.TP
|
|
.B map-ont
|
|
Slightly more sensitive for Oxford Nanopore to reference mapping
|
|
.RB ( -k15 ).
|
|
For PacBio reads, HPC minimizers consistently leads to faster performance and
|
|
more sensitive results in comparison to normal minimizers. For Oxford Nanopore
|
|
data, normal minimizers are better, though not much. The effectiveness of HPC
|
|
is determined by the sequencing error mode.
|
|
.TP
|
|
.B asm5
|
|
Long assembly to reference mapping
|
|
.RB ( -k19
|
|
.B -w19 -A1 -B19 -O39,81 -E3,1 -s200
|
|
.BR -z200 ).
|
|
Typically, the alignment will not extend to regions with 5% or higher sequence
|
|
divergence. Only use this preset if the average divergence is far below 5%.
|
|
.TP
|
|
.B asm10
|
|
Long assembly to reference mapping
|
|
.RB ( -k19
|
|
.B -w19 -A1 -B9 -O16,41 -E2,1 -s200
|
|
.BR -z200 ).
|
|
Up to 10% sequence divergence.
|
|
.TP
|
|
.B ava-pb
|
|
PacBio all-vs-all overlap mapping
|
|
.RB ( -Hk19
|
|
.B -w5 -Xp0 -m100 -g10000 --max-chain-skip
|
|
.BR 25 ).
|
|
.TP
|
|
.B ava-ont
|
|
Oxford Nanopore all-vs-all overlap mapping
|
|
.RB ( -k15
|
|
.B -w5 -Xp0 -m100 -g10000 --max-chain-skip
|
|
.BR 25 ).
|
|
Similarly, the major difference from
|
|
.B ava-pb
|
|
is that this preset is not using HPC minimizers.
|
|
.TP
|
|
.B splice
|
|
Long-read spliced alignment
|
|
.RB ( -k15
|
|
.B -w5 --splice -g2000 -G200k -A1 -B2 -O2,32 -E1,0 -z200 -ub --cost-non-gt-ag
|
|
.BR 5 ).
|
|
In the splice mode, 1) long deletions are taken as introns and represented as
|
|
the
|
|
.RB ` N '
|
|
CIGAR operator; 2) long insertions are disabled; 3) deletion and insertion gap
|
|
costs are different during chaining; 4) the computation of the
|
|
.RB ` ms '
|
|
tag ignores introns to demote hits to pseudogenes.
|
|
.TP
|
|
.B sr
|
|
Short single-end reads without splicing
|
|
.RB ( -k21
|
|
.B -w11 --sr --frag -A2 -B8 -O12,32 -E2,1 -r50 -p.5 -N20 -f1000,5000 -n2 -m20
|
|
.B -s40 -g200 -2K50m
|
|
.BR --secondary=no ).
|
|
.RE
|
|
.SS Miscellaneous options
|
|
.TP 10
|
|
.B --no-kalloc
|
|
Use the libc default allocator instead of the kalloc thread-local allocator.
|
|
This debugging option is mostly used with Valgrind to detect invalid memory
|
|
accesses. Minimap2 runs slower with this option, especially in the
|
|
multi-threading mode.
|
|
.TP
|
|
.B --print-qname
|
|
Print query names to stderr, mostly to see which query is crashing minimap2.
|
|
.TP
|
|
.B --print-seeds
|
|
Print seed positions to stderr, for debugging only.
|
|
.SH OUTPUT FORMAT
|
|
.PP
|
|
Minimap2 outputs mapping positions in the Pairwise mApping Format (PAF) by
|
|
default. PAF is a TAB-delimited text format with each line consisting of at
|
|
least 12 fields as are described in the following table:
|
|
.TS
|
|
center box;
|
|
cb | cb | cb
|
|
r | c | l .
|
|
Col Type Description
|
|
_
|
|
1 string Query sequence name
|
|
2 int Query sequence length
|
|
3 int Query start coordinate (0-based)
|
|
4 int Query end coordinate (0-based)
|
|
5 char `+' if query/target on the same strand; `-' if opposite
|
|
6 string Target sequence name
|
|
7 int Target sequence length
|
|
8 int Target start coordinate on the original strand
|
|
9 int Target end coordinate on the original strand
|
|
10 int Number of matching bases in the mapping
|
|
11 int Number bases, including gaps, in the mapping
|
|
12 int Mapping quality (0-255 with 255 for missing)
|
|
.TE
|
|
|
|
.PP
|
|
When alignment is available, column 11 gives the total number of sequence
|
|
matches, mismatches and gaps in the alignment; column 10 divided by column 11
|
|
gives the BLAST-like alignment identity. When alignment is unavailable,
|
|
these two columns are approximate. PAF may optionally have additional fields in
|
|
the SAM-like typed key-value format. Minimap2 may output the following tags:
|
|
.TS
|
|
center box;
|
|
cb | cb | cb
|
|
r | c | l .
|
|
Tag Type Description
|
|
_
|
|
tp A Type of aln: P/primary, S/secondary and I/inversion
|
|
cm i Number of minimizers on the chain
|
|
s1 i Chaining score
|
|
s2 i Chaining score of the best secondary chain
|
|
NM i Total number of mismatches and gaps in the alignment
|
|
AS i DP alignment score
|
|
ms i DP score of the max scoring segment in the alignment
|
|
nn i Number of ambiguous bases in the alignment
|
|
ts A Transcript strand (splice mode only)
|
|
cg Z CIGAR string (only in PAF)
|
|
cs Z Difference string
|
|
.TE
|
|
|
|
.PP
|
|
The
|
|
.B cs
|
|
tag encodes difference sequences in the short form or the entire query
|
|
.I AND
|
|
reference sequences in the long form. It consists of a series of operations:
|
|
.TS
|
|
center box;
|
|
cb | cb |cb
|
|
r | l | l .
|
|
Op Regex Description
|
|
_
|
|
= [ACGTN]+ Identical sequence (long form)
|
|
: [0-9]+ Identical sequence length
|
|
* [acgtn][acgtn] Substitution: ref to query
|
|
+ [acgtn]+ Insertion to the reference
|
|
- [acgtn]+ Deletion from the reference
|
|
~ [acgtn]{2}[0-9]+[acgtn]{2} Intron length and splice signal
|
|
.TE
|
|
|
|
.SH LIMITATIONS
|
|
.TP 2
|
|
*
|
|
Minimap2 may produce suboptimal alignments through long low-complexity regions
|
|
where seed positions may be suboptimal. This should not be a big concern
|
|
because even the optimal alignment may be wrong in such regions.
|
|
.TP
|
|
*
|
|
Minimap2 requires SSE2 instructions to compile. It is possible to add
|
|
non-SSE2 support, but it would make minimap2 slower by several times.
|
|
.SH SEE ALSO
|
|
.PP
|
|
miniasm(1), minimap(1), bwa(1).
|