|
|
||
|---|---|---|
| misc | ||
| test | ||
| tex | ||
| .gitignore | ||
| LICENSE.txt | ||
| Makefile | ||
| NEWS.md | ||
| README.md | ||
| align.c | ||
| bseq.c | ||
| bseq.h | ||
| chain.c | ||
| example.c | ||
| format.c | ||
| hit.c | ||
| index.c | ||
| kalloc.c | ||
| kalloc.h | ||
| kdq.h | ||
| khash.h | ||
| kseq.h | ||
| ksort.h | ||
| ksw2.h | ||
| ksw2_extd2_sse.c | ||
| ksw2_extz2_sse.c | ||
| ksw2_ll_sse.c | ||
| kthread.c | ||
| kthread.h | ||
| kvec.h | ||
| main.c | ||
| map.c | ||
| minimap.h | ||
| minimap2.1 | ||
| misc.c | ||
| mmpriv.h | ||
| sdust.c | ||
| sdust.h | ||
| sketch.c | ||
README.md
Getting Started
git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long reads against a reference genome
./minimap2 -ax map10k test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map10k -d MT-human.mmi test/MT-human.fa
./minimap2 -ax map10k MT-human.mmi test/MT-orang.fa > test.sam
# long-read overlap (no test data)
./minimap2 -x ava-pb your-reads.fa your-reads.fa > overlaps.paf
# man page
man ./minimap2.1
Introduction
Minimap2 is a fast sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads or their assemblies to a reference genome optionally with detailed alignment (i.e. CIGAR). At present, it works efficiently with query sequences from a few kilobases to ~100 megabases in length at an error rate ~15%. Minimap2 outputs in the PAF or the SAM format. On limited test data sets, minimap2 is over 20 times faster than most other long-read aligners. It will replace BWA-MEM for long reads and contig alignment.
Minimap2 is the successor of minimap. It uses a similar minimizer-based indexing and seeding algorithm, and improves the original minimap with homopolyer-compressed k-mers (see also SMARTdenovo and longISLND), better chaining and the ability to produce CIGAR with fast extension alignment (see also libgaba and ksw2) and piece-wise affine gap cost.
Installation
For modern x86-64 CPUs, just type make in the source code directory. This
will compile a binary minimap2 which you can copy to your desired location.
If you see compilation errors, try make sse2only=1 to disable SSE4. Minimap2
will run a little slower. At present, minimap2 does not work with non-x86 CPUs
or ancient CPUs that do not support SSE2. SSE2 is critical to the performance
of minimap2.
Algorithm Overview
In the following, minimap2 command line options have a dash ahead and are highlighted in bold.
-
Read -I [=4G] reference bases, extract (-k,-w)-minimizers and index them in a hash table.
-
Read -K [=200M] query bases. For each query sequence, do step 3 through 7:
-
For each (-k,-w)-minimizer on the query, check against the reference index. If a reference minimizer is not among the top -f [=2e-4] most frequent, collect its the occurrences in the reference, which are called seeds.
-
Sort seeds by position in the reference. Chain them with dynamic programming. Each chain represents a potential mapping. For read overlapping, report all chains and then go to step 8. For reference mapping, do step 5 through 7:
-
Let P be the set of primary mappings, which is an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in P by --mask-level [=0.5] or higher fraction of the shorter chain, mark the chain as secondary to the chain in P; otherwise, add the chain to P.
-
Retain all primary mappings. Also retain up to -N [=5] top secondary mappings if their chaining scores are higher than -p [=0.8] of their corresponding primary mappings.
-
If alignment is requested, filter out an internal seed if it potentially leads to both a long insertion and a long deletion. Extend from the left-most seed. Perform global alignments between internal seeds. Split the chain if the accumulative score along the global alignment drops by -z [=400], disregarding long gaps. Extend from the right-most seed. Output chains and their alignments.
-
If there are more query sequences in the input, go to step 2 until no more queries are left.
-
If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop.
Limitations
-
Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions.
-
Minimap2 does not work well with Illumina short reads as of now.
-
Minimap2 requires SSE2 instructions to compile. It is possible to add non-SSE2 support, but it would make minimap2 slower by several times.
In general, minimap2 is a young project with most code written since June, 2017. It may have bugs and room for improvements. Bug reports and suggestions are warmly welcomed.