minimap2

针对三代测序的比对工具，在原程序基础之上，做一些并行优化

Go to file

Heng Li cf55c84056 r460: added option --no-long-join		2017-10-04 12:08:44 -04:00
misc	eval script works with /[12] in SAM	2017-09-27 23:33:59 -04:00
python	Release minimap2-2.2 (r409)	2017-09-17 20:08:47 -04:00
test	test data for spliced alignment	2017-08-25 13:17:56 +08:00
tex	added GMAP iso-seq numbers	2017-09-20 23:54:02 -04:00
.gitignore	renamed mm2-lite.py to minimap2.py	2017-09-17 09:41:37 -04:00
.travis.yml	don't build for python-3.0 (unavailable in travis)	2017-09-17 17:07:42 -04:00
LICENSE.txt	added license	2017-07-01 11:39:19 -04:00
MANIFEST.in	this is embarrassing: rename again to mappy	2017-09-17 00:05:30 -04:00
Makefile	r440: better chain filtering for PE reads	2017-09-26 11:03:36 -04:00
NEWS.md	Release minimap2-2.2 (r409)	2017-09-17 20:08:47 -04:00
README.md	added github downloads counts	2017-09-20 10:11:58 -04:00
align.c	Merge branch 'master' into sr	2017-10-04 11:42:44 -04:00
bseq.c	multi-seg working on toy examples	2017-09-25 13:42:04 -04:00
bseq.h	multi-seg working on toy examples	2017-09-25 13:42:04 -04:00
chain.c	r439: use splice-like chain gap cost between segs	2017-09-25 16:04:38 -04:00
example.c	r369: updated example with the latest API	2017-09-14 22:44:10 -04:00
format.c	multi-seg working on toy examples	2017-09-25 13:42:04 -04:00
getopt.c	r368: API documentation	2017-09-14 22:23:04 -04:00
getopt.h	r315: added getopt from musl	2017-09-01 20:20:34 +08:00
hit.c	r451: changed rep_len mapq heuristic	2017-09-28 14:23:14 -04:00
index.c	two arrays should be freed with kfree(0,)	2017-09-23 10:43:22 -04:00
kalloc.c	r411: refactored kalloc for clarity	2017-09-18 19:49:15 -04:00
kalloc.h	r411: refactored kalloc for clarity	2017-09-18 19:49:15 -04:00
kdq.h	Better MSVC support	2017-09-03 11:05:55 -04:00
khash.h	index can be compiled; not tested yet	2017-04-07 15:30:30 -04:00
kseq.h	Homopolymer-compressed k-mer sketch	2017-04-06 15:37:34 -04:00
ksort.h	Better MSVC support	2017-09-03 11:05:55 -04:00
ksw2.h	r337: support CPU dispatch for gcc-4.8+	2017-09-03 14:29:49 -04:00
ksw2_dispatch.c	r339: improved SIMD detection	2017-09-05 13:10:30 -04:00
ksw2_extd2_sse.c	r338: portable CPU dispatch, which is the default	2017-09-03 20:29:24 -04:00
ksw2_exts2_sse.c	r338: portable CPU dispatch, which is the default	2017-09-03 20:29:24 -04:00
ksw2_extz2_sse.c	r338: portable CPU dispatch, which is the default	2017-09-03 20:29:24 -04:00
ksw2_ll_sse.c	r230: code formatting changes only	2017-07-30 12:31:40 -04:00
kthread.c	Better MSVC support	2017-09-03 11:05:55 -04:00
kthread.h	index can be compiled; not tested yet	2017-04-07 15:30:30 -04:00
kvec.h	Homopolymer-compressed k-mer sketch	2017-04-06 15:37:34 -04:00
main.c	r460: added option --no-long-join	2017-10-04 12:08:44 -04:00
map.c	r460: added option --no-long-join	2017-10-04 12:08:44 -04:00
minimap.h	r460: added option --no-long-join	2017-10-04 12:08:44 -04:00
minimap2.1	replaced --approx-ext with --sr	2017-09-20 10:51:18 -04:00
misc.c	minor tweaks to python	2017-09-16 18:11:43 -04:00
mmpriv.h	r447: paired-end mapping quality	2017-09-27 15:39:25 -04:00
pe.c	r450: differentiate exact repeats via mapq	2017-09-27 23:51:05 -04:00
sdust.c	for better windows compatibility	2017-09-02 17:52:33 -04:00
sdust.h	r188: renamed bseq* to mm_bseq*	2017-07-19 09:26:46 -04:00
setup.py	Release minimap2-2.2 (r409)	2017-09-17 20:08:47 -04:00
sketch.c	Better MSVC support	2017-09-03 11:05:55 -04:00

README.md

Getting Started

git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long reads against a reference genome
./minimap2 -ax map10k test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map10k -d MT-human.mmi test/MT-human.fa
./minimap2 -ax map10k MT-human.mmi test/MT-orang.fa > test.sam
# long-read overlap (no test data)
./minimap2 -x ava-pb your-reads.fa your-reads.fa > overlaps.paf
# spliced alignment (no test data)
./minimap2 -ax splice ref.fa rna-seq-reads.fa > spliced.sam
# man page
man ./minimap2.1

Introduction

Minimap2 is a fast sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads or their assemblies to a reference genome optionally with detailed alignment (i.e. CIGAR). At present, it works efficiently with query sequences from a few kilobases to ~100 megabases in length at an error rate ~15%. Minimap2 outputs in the PAF or the SAM format. On limited test data sets, minimap2 is over 20 times faster than most other long-read aligners. It will replace BWA-MEM for long reads and contig alignment.

Minimap2 is the successor of minimap. It uses a similar minimizer-based indexing and seeding algorithm, and improves the original minimap with homopolyer-compressed k-mers (see also SMARTdenovo and longISLND), better chaining and the ability to produce CIGAR with fast extension alignment (see also libgaba and ksw2) and piece-wise affine gap cost.

If you use minimap2 in your work, please consider to cite:

Li, H. (2017). Minimap2: fast pairwise alignment for long DNA sequences. arXiv:1708.01492.

Installation

For modern x86-64 CPUs, just type make in the source code directory. This will compile a binary minimap2 which you can copy to your desired location. If you see compilation errors, try make sse2only=1 to disable SSE4. Minimap2 will run a little slower. At present, minimap2 does not work with non-x86 CPUs or ancient CPUs that do not support SSE2. SSE2 is critical to the performance of minimap2.

Algorithm Overview

In the following, minimap2 command line options have a dash ahead and are highlighted in bold.

Read -I [=4G] reference bases, extract (-k,-w)-minimizers and index them in a hash table.
Read -K [=200M] query bases. For each query sequence, do step 3 through 7:
For each (-k,-w)-minimizer on the query, check against the reference index. If a reference minimizer is not among the top -f [=2e-4] most frequent, collect its the occurrences in the reference, which are called seeds.
Sort seeds by position in the reference. Chain them with dynamic programming. Each chain represents a potential mapping. For read overlapping, report all chains and then go to step 8. For reference mapping, do step 5 through 7:
Let P be the set of primary mappings, which is an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in P by --mask-level [=0.5] or higher fraction of the shorter chain, mark the chain as secondary to the chain in P; otherwise, add the chain to P.
Retain all primary mappings. Also retain up to -N [=5] top secondary mappings if their chaining scores are higher than -p [=0.8] of their corresponding primary mappings.
If alignment is requested, filter out an internal seed if it potentially leads to both a long insertion and a long deletion. Extend from the left-most seed. Perform global alignments between internal seeds. Split the chain if the accumulative score along the global alignment drops by -z [=400], disregarding long gaps. Extend from the right-most seed. Output chains and their alignments.
If there are more query sequences in the input, go to step 2 until no more queries are left.
If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop.

Limitations

Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions.
Minimap2 does not work well with Illumina short reads as of now.
Minimap2 requires SSE2 instructions to compile. It is possible to add non-SSE2 support, but it would make minimap2 slower by several times.

In general, minimap2 is a young project with most code written since June, 2017. It may have bugs and room for improvements. Bug reports and suggestions are warmly welcomed.