minimap2/README.md

115 lines
4.9 KiB
Markdown
Raw Normal View History

2017-07-04 23:47:29 +08:00
## Getting Started
```sh
2017-07-19 01:26:51 +08:00
git clone https://github.com/lh3/minimap2
2017-07-04 23:47:29 +08:00
cd minimap2 && make
# long reads against a reference genome
./minimap2 -ax map10k test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -x map10k -d MT-human.mmi test/MT-human.fa
./minimap2 -ax map10k MT-human.mmi test/MT-orang.fa > test.sam
# long-read overlap (no test data)
2017-07-18 23:04:09 +08:00
./minimap2 -x ava-pb your-reads.fa your-reads.fa > overlaps.paf
2017-07-04 23:47:29 +08:00
# man page
man ./minimap2.1
```
## Introduction
Minimap2 is a fast sequence mapping and alignment program that can find
overlaps between long noisy reads, or map long reads or their assemblies to a
reference genome optionally with detailed alignment (i.e. CIGAR). At present,
it works efficiently with query sequences from a few kilobases to ~100
2017-07-17 05:52:57 +08:00
megabases in length at an error rate ~15%. Minimap2 outputs in the [PAF][paf] or
2017-07-04 23:47:29 +08:00
the [SAM format][sam]. On limited test data sets, minimap2 is over 20 times
2017-07-18 23:04:09 +08:00
faster than most other long-read aligners. It will replace BWA-MEM for long
reads and contig alignment.
2017-07-04 23:47:29 +08:00
Minimap2 is the successor of [minimap][minimap]. It uses a similar
minimizer-based indexing and seeding algorithm, and improves the original
minimap with homopolyer-compressed k-mers (see also [SMARTdenovo][smartdenovo]
2017-07-04 23:56:50 +08:00
and [longISLND][longislnd]), better chaining and the ability to produce CIGAR
with fast extension alignment (see also [libgaba][gaba] and [ksw2][ksw2]) and
2017-07-18 23:04:09 +08:00
piece-wise affine gap cost.
2017-07-04 23:47:29 +08:00
2017-07-19 01:56:51 +08:00
## Installation
For modern x86-64 CPUs, just type `make` in the source code directory. This
will compile a binary `minimap2` which you can copy to your desired location.
If you see compilation errors, try `make sse2only=1` to disable SSE4. Minimap2
will run a little slower. At present, minimap2 does not work with non-x86 CPUs
or ancient CPUs that do not support SSE2. SSE2 is critical to the performance
of minimap2.
2017-07-28 06:50:39 +08:00
## Algorithm Overview
In the following, minimap2 command line options have a dash ahead and are
highlighted in bold.
1. Read **-I** [=*4G*] reference bases, extract (**-k**,**-w**)-minimizers and
index them in a hash table.
2. Read **-K** [=*200M*] query bases. For each query sequence, do step 3
through 7:
3. For each (**-k**,**-w**)-minimizer on the query, check against the reference
index. If a reference minimizer is not among the top **-f** [=*2e-4*] most
frequent, collect its the occurrences in the reference, which are called
*seeds*.
4. Sort seeds by position in the reference. Chain them with dynamic
programming. Each chain represents a potential mapping. For read
overlapping, report all chains and then go to step 8. For reference mapping,
do step 5 through 7:
5. Let *P* be the set of primary mappings, which is an empty set initially. For
each chain from the best to the worst according to their chaining scores: if
on the query, the chain overlaps with a chain in *P* by **--mask-level**
[=*0.5*] or higher fraction of the shorter chain, mark the chain as
*secondary* to the chain in *P*; otherwise, add the chain to *P*.
6. Retain all primary mappings. Also retain up to **-N** [=*5*] top secondary
mappings if their chaining scores are higher than **-p** [=*0.8*] of their
corresponding primary mappings.
7. If alignment is requested, filter out an internal seed if it potentially
leads to both a long insertion and a long deletion. Extend from the
left-most seed. Perform global alignments between internal seeds. Split the
chain if the accumulative score along the global alignment drops by **-z**
[=*400*], disregarding long gaps. Extend from the right-most seed. Output
chains and their alignments.
8. If there are more query sequences in the input, go to step 2 until no more
queries are left.
9. If there are more reference sequences, reopen the query file from the start
and go to step 1; otherwise stop.
2017-07-04 23:47:29 +08:00
## Limitations
2017-07-18 23:04:09 +08:00
* At the alignment phase, minimap2 performs global alignments between minimizer
hits. If the positions of these minimizer hits are incorrect, the final
2017-07-28 06:50:39 +08:00
alignment may be suboptimal or unnecessarily fragmented. This should happen
rarely with the latest version.
2017-07-18 23:04:09 +08:00
* Minimap2 may produce poor alignments that may need post-filtering. We are
still exploring a reliable and consistent way to report good alignments.
* Minimap2 does not work well with Illumina short reads as of now.
* Minimap2 requires SSE2 instructions to compile. It is possible to add
2017-07-18 23:37:58 +08:00
non-SSE2 support, but it would make minimap2 slower by several times.
2017-07-18 23:04:09 +08:00
2017-07-28 06:50:39 +08:00
In general, minimap2 is a young project with most code written since June, 2017.
It may have bugs and room for improvements. Bug reports and suggestions are
warmly welcomed.
2017-07-18 23:04:09 +08:00
2017-07-04 23:47:29 +08:00
[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
[sam]: https://samtools.github.io/hts-specs/SAMv1.pdf
[minimap]: https://github.com/lh3/minimap
[smartdenovo]: https://github.com/ruanjue/smartdenovo
[longislnd]: https://www.ncbi.nlm.nih.gov/pubmed/27667791
[gaba]: https://github.com/ocxtal/libgaba
[ksw2]: https://github.com/lh3/ksw2