针对三代测序的比对工具,在原程序基础之上,做一些并行优化
 
 
 
 
 
 
Go to file
Heng Li cf55c84056 r460: added option --no-long-join 2017-10-04 12:08:44 -04:00
misc eval script works with /[12] in SAM 2017-09-27 23:33:59 -04:00
python Release minimap2-2.2 (r409) 2017-09-17 20:08:47 -04:00
test test data for spliced alignment 2017-08-25 13:17:56 +08:00
tex added GMAP iso-seq numbers 2017-09-20 23:54:02 -04:00
.gitignore renamed mm2-lite.py to minimap2.py 2017-09-17 09:41:37 -04:00
.travis.yml don't build for python-3.0 (unavailable in travis) 2017-09-17 17:07:42 -04:00
LICENSE.txt added license 2017-07-01 11:39:19 -04:00
MANIFEST.in this is embarrassing: rename again to mappy 2017-09-17 00:05:30 -04:00
Makefile r440: better chain filtering for PE reads 2017-09-26 11:03:36 -04:00
NEWS.md Release minimap2-2.2 (r409) 2017-09-17 20:08:47 -04:00
README.md added github downloads counts 2017-09-20 10:11:58 -04:00
align.c Merge branch 'master' into sr 2017-10-04 11:42:44 -04:00
bseq.c multi-seg working on toy examples 2017-09-25 13:42:04 -04:00
bseq.h multi-seg working on toy examples 2017-09-25 13:42:04 -04:00
chain.c r439: use splice-like chain gap cost between segs 2017-09-25 16:04:38 -04:00
example.c r369: updated example with the latest API 2017-09-14 22:44:10 -04:00
format.c multi-seg working on toy examples 2017-09-25 13:42:04 -04:00
getopt.c r368: API documentation 2017-09-14 22:23:04 -04:00
getopt.h r315: added getopt from musl 2017-09-01 20:20:34 +08:00
hit.c r451: changed rep_len mapq heuristic 2017-09-28 14:23:14 -04:00
index.c two arrays should be freed with kfree(0,) 2017-09-23 10:43:22 -04:00
kalloc.c r411: refactored kalloc for clarity 2017-09-18 19:49:15 -04:00
kalloc.h r411: refactored kalloc for clarity 2017-09-18 19:49:15 -04:00
kdq.h Better MSVC support 2017-09-03 11:05:55 -04:00
khash.h index can be compiled; not tested yet 2017-04-07 15:30:30 -04:00
kseq.h Homopolymer-compressed k-mer sketch 2017-04-06 15:37:34 -04:00
ksort.h Better MSVC support 2017-09-03 11:05:55 -04:00
ksw2.h r337: support CPU dispatch for gcc-4.8+ 2017-09-03 14:29:49 -04:00
ksw2_dispatch.c r339: improved SIMD detection 2017-09-05 13:10:30 -04:00
ksw2_extd2_sse.c r338: portable CPU dispatch, which is the default 2017-09-03 20:29:24 -04:00
ksw2_exts2_sse.c r338: portable CPU dispatch, which is the default 2017-09-03 20:29:24 -04:00
ksw2_extz2_sse.c r338: portable CPU dispatch, which is the default 2017-09-03 20:29:24 -04:00
ksw2_ll_sse.c r230: code formatting changes only 2017-07-30 12:31:40 -04:00
kthread.c Better MSVC support 2017-09-03 11:05:55 -04:00
kthread.h index can be compiled; not tested yet 2017-04-07 15:30:30 -04:00
kvec.h Homopolymer-compressed k-mer sketch 2017-04-06 15:37:34 -04:00
main.c r460: added option --no-long-join 2017-10-04 12:08:44 -04:00
map.c r460: added option --no-long-join 2017-10-04 12:08:44 -04:00
minimap.h r460: added option --no-long-join 2017-10-04 12:08:44 -04:00
minimap2.1 replaced --approx-ext with --sr 2017-09-20 10:51:18 -04:00
misc.c minor tweaks to python 2017-09-16 18:11:43 -04:00
mmpriv.h r447: paired-end mapping quality 2017-09-27 15:39:25 -04:00
pe.c r450: differentiate exact repeats via mapq 2017-09-27 23:51:05 -04:00
sdust.c for better windows compatibility 2017-09-02 17:52:33 -04:00
sdust.h r188: renamed bseq* to mm_bseq* 2017-07-19 09:26:46 -04:00
setup.py Release minimap2-2.2 (r409) 2017-09-17 20:08:47 -04:00
sketch.c Better MSVC support 2017-09-03 11:05:55 -04:00

README.md

Release BioConda PyPI Python Version License Build Status Downloads

Getting Started

git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long reads against a reference genome
./minimap2 -ax map10k test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map10k -d MT-human.mmi test/MT-human.fa
./minimap2 -ax map10k MT-human.mmi test/MT-orang.fa > test.sam
# long-read overlap (no test data)
./minimap2 -x ava-pb your-reads.fa your-reads.fa > overlaps.paf
# spliced alignment (no test data)
./minimap2 -ax splice ref.fa rna-seq-reads.fa > spliced.sam
# man page
man ./minimap2.1

Introduction

Minimap2 is a fast sequence mapping and alignment program that can find overlaps between long noisy reads, or map long reads or their assemblies to a reference genome optionally with detailed alignment (i.e. CIGAR). At present, it works efficiently with query sequences from a few kilobases to ~100 megabases in length at an error rate ~15%. Minimap2 outputs in the PAF or the SAM format. On limited test data sets, minimap2 is over 20 times faster than most other long-read aligners. It will replace BWA-MEM for long reads and contig alignment.

Minimap2 is the successor of minimap. It uses a similar minimizer-based indexing and seeding algorithm, and improves the original minimap with homopolyer-compressed k-mers (see also SMARTdenovo and longISLND), better chaining and the ability to produce CIGAR with fast extension alignment (see also libgaba and ksw2) and piece-wise affine gap cost.

If you use minimap2 in your work, please consider to cite:

Li, H. (2017). Minimap2: fast pairwise alignment for long DNA sequences. arXiv:1708.01492.

Installation

For modern x86-64 CPUs, just type make in the source code directory. This will compile a binary minimap2 which you can copy to your desired location. If you see compilation errors, try make sse2only=1 to disable SSE4. Minimap2 will run a little slower. At present, minimap2 does not work with non-x86 CPUs or ancient CPUs that do not support SSE2. SSE2 is critical to the performance of minimap2.

Algorithm Overview

In the following, minimap2 command line options have a dash ahead and are highlighted in bold.

  1. Read -I [=4G] reference bases, extract (-k,-w)-minimizers and index them in a hash table.

  2. Read -K [=200M] query bases. For each query sequence, do step 3 through 7:

  3. For each (-k,-w)-minimizer on the query, check against the reference index. If a reference minimizer is not among the top -f [=2e-4] most frequent, collect its the occurrences in the reference, which are called seeds.

  4. Sort seeds by position in the reference. Chain them with dynamic programming. Each chain represents a potential mapping. For read overlapping, report all chains and then go to step 8. For reference mapping, do step 5 through 7:

  5. Let P be the set of primary mappings, which is an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in P by --mask-level [=0.5] or higher fraction of the shorter chain, mark the chain as secondary to the chain in P; otherwise, add the chain to P.

  6. Retain all primary mappings. Also retain up to -N [=5] top secondary mappings if their chaining scores are higher than -p [=0.8] of their corresponding primary mappings.

  7. If alignment is requested, filter out an internal seed if it potentially leads to both a long insertion and a long deletion. Extend from the left-most seed. Perform global alignments between internal seeds. Split the chain if the accumulative score along the global alignment drops by -z [=400], disregarding long gaps. Extend from the right-most seed. Output chains and their alignments.

  8. If there are more query sequences in the input, go to step 2 until no more queries are left.

  9. If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop.

Limitations

  • Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions.

  • Minimap2 does not work well with Illumina short reads as of now.

  • Minimap2 requires SSE2 instructions to compile. It is possible to add non-SSE2 support, but it would make minimap2 slower by several times.

In general, minimap2 is a young project with most code written since June, 2017. It may have bugs and room for improvements. Bug reports and suggestions are warmly welcomed.