revamped README; not finished yet
This commit is contained in:
parent
c6384ed2c8
commit
4f6244bd4a
143
README.md
143
README.md
|
|
@ -5,7 +5,7 @@
|
|||
[](LICENSE.txt)
|
||||
[](https://travis-ci.org/lh3/minimap2)
|
||||
[](https://github.com/lh3/minimap2/releases)
|
||||
## Getting Started
|
||||
## <a name="started"></a>Getting Started
|
||||
```sh
|
||||
git clone https://github.com/lh3/minimap2
|
||||
cd minimap2 && make
|
||||
|
|
@ -21,39 +21,118 @@ cd minimap2 && make
|
|||
# man page
|
||||
man ./minimap2.1
|
||||
```
|
||||
## Table of Contents
|
||||
|
||||
## Introduction
|
||||
- [Getting Started](#started)
|
||||
- [Users' Guide](#uguide)
|
||||
- [Installation](#install)
|
||||
- [General usage](#general)
|
||||
- [Use cases](#cases)
|
||||
- [Map long noisy genomic reads](#map-long-genomic)
|
||||
- [Map long mRNA/cDNA reads](#map-long-splice)
|
||||
- [Algorithm overview](#algo)
|
||||
- [Cite minimap2](#cite)
|
||||
- [Limitations](#limit)
|
||||
|
||||
Minimap2 is a fast sequence mapping and alignment program that can find
|
||||
overlaps between long noisy reads, or map long reads or their assemblies to a
|
||||
reference genome optionally with detailed alignment (i.e. CIGAR). At present,
|
||||
it works efficiently with query sequences from a few kilobases to ~100
|
||||
megabases in length at an error rate ~15%. Minimap2 outputs in the [PAF][paf] or
|
||||
the [SAM format][sam]. On limited test data sets, minimap2 is over 20 times
|
||||
faster than most other long-read aligners. It will replace BWA-MEM for long
|
||||
reads and contig alignment.
|
||||
## <a name="uguide"></a>Users' Guide
|
||||
|
||||
Minimap2 is the successor of [minimap][minimap]. It uses a similar
|
||||
minimizer-based indexing and seeding algorithm, and improves the original
|
||||
minimap with homopolyer-compressed k-mers (see also [SMARTdenovo][smartdenovo]
|
||||
and [longISLND][longislnd]), better chaining and the ability to produce CIGAR
|
||||
with fast extension alignment (see also [libgaba][gaba] and [ksw2][ksw2]) and
|
||||
piece-wise affine gap cost.
|
||||
Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA
|
||||
sequences against a large reference database. Typical use cases include: (1)
|
||||
mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)
|
||||
finding overlaps between long reads with error rate up to ~15%; (3)
|
||||
splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads
|
||||
against a reference genome; (4) aligning Illumina single- or paired-end reads;
|
||||
(5) assembly-to-assembly alignment; (6) full-genome alignment between two
|
||||
closely related species with divergence below ~15%.
|
||||
|
||||
If you use minimap2 in your work, please consider to cite:
|
||||
For ~10kb noisy reads sequences, minimap2 is tens of times faster than
|
||||
mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more
|
||||
accurate on simulated long reads and produces biologically meaningful alignment
|
||||
ready for downstream analyses. For >100bp Illumina short reads, minimap2 is
|
||||
three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data.
|
||||
Detailed evaluations are available from the [minimap2 preprint][preprint].
|
||||
|
||||
> Li, H. (2017). Minimap2: fast pairwise alignment for long DNA sequences. [arXiv:1708.01492](https://arxiv.org/abs/1708.01492).
|
||||
### <a name="install"></a>Installation
|
||||
|
||||
## Installation
|
||||
Minimap2 only works on x86-64 CPUs. You can acquire precompiled binaries from
|
||||
the [release page][release]. For example, with:
|
||||
```sh
|
||||
wget --no-check-certificate -O- https://github.com/lh3/minimap2/releases/download/v2.2/minimap2-2.2_x64-linux.tar.bz2 \
|
||||
| tar -jxvf -
|
||||
./minimap2-2.2_x64-linux/minimap2
|
||||
```
|
||||
If you want to compile from the source, you need to have a C compiler, GNU make
|
||||
and zlib development files installed. Just type `make` in the source code
|
||||
directory to compile. If you see compilation errors, try `make sse2only=1`
|
||||
to disable SSE4 code, which will make minimap2 slightly slower at a cost.
|
||||
|
||||
For modern x86-64 CPUs, just type `make` in the source code directory. This
|
||||
will compile a binary `minimap2` which you can copy to your desired location.
|
||||
If you see compilation errors, try `make sse2only=1` to disable SSE4. Minimap2
|
||||
will run a little slower. At present, minimap2 does not work with non-x86 CPUs
|
||||
or ancient CPUs that do not support SSE2. SSE2 is critical to the performance
|
||||
of minimap2.
|
||||
### <a name="general"></a>General usage
|
||||
|
||||
## Algorithm Overview
|
||||
In the simplest form, minimap2 takes a reference database and a query sequence
|
||||
file as input and produce approximate mapping, without base-level alignment
|
||||
(i.e. no CIGAR), in the [PAF format][paf]:
|
||||
```sh
|
||||
minimap2 ref.fa reads.fq > approx-mapping.paf
|
||||
```
|
||||
You ask minimap2 to generate CIGAR at the `cg` tag of PAF with:
|
||||
```sh
|
||||
minimap2 -c ref.fa reads.fq > alignment.paf
|
||||
```
|
||||
or to output alignments in the [SAM format][sam]:
|
||||
```sh
|
||||
minimap2 -a ref.fa reads.fq > alignment.sam
|
||||
```
|
||||
Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You
|
||||
don't need to convert between FASTA and FASTQ or decompress gzip'd files first.
|
||||
|
||||
For the human reference genome, minimap2 takes a few minutes to generate a
|
||||
minimizer index for the reference before mapping. To reduce indexing time, you
|
||||
can optionally save the index with option **-d** and replace the reference
|
||||
sequence file with the index file on the minimap2 command line:
|
||||
```sh
|
||||
minimap2 -d ref.mmi ref.fa # indexing
|
||||
minimap2 -a ref.mmi reads.fq > alignment.sam # alignment
|
||||
```
|
||||
***Importantly***, it should be noted that once you build the index, indexing
|
||||
parameters such as **-k**, **-w**, **-H** and **-I** can't be changed during
|
||||
mapping. If you are running minimap2 for different data types, you will
|
||||
probably need to keep multiple indexes generated with different parameters.
|
||||
This makes minimap2 different BWA which always uses the same index regardless
|
||||
of query data types.
|
||||
|
||||
### <a name="cases"></a>Use cases
|
||||
|
||||
Minimap2 uses the same base algorithm for all applications. However, due to the
|
||||
dramatic different data types (e.g. short vs long reads; DNA vs mRNA reads) it
|
||||
supports, minimap2 needs to be tuned for optimal performance and accuracy.
|
||||
You should usually choose a preset with option **-x**, which sets multiple
|
||||
parameters at the same time.
|
||||
|
||||
#### <a name="map-long-genomic"></a>Map long noisy genomic reads
|
||||
|
||||
```sh
|
||||
minimap2 -ax map-pb ref.fa pacbio-reads.fq > aln.sam # for PacBio subreads
|
||||
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads
|
||||
```
|
||||
The difference between `map-pb` and `map-ont` is that `map-pb` uses
|
||||
homopolymer-compressed (HPC) minimizers as seeds, while `map-ont` uses normal
|
||||
minimizers as seeds. Emperical evaluation shows that HPC minimizers improve
|
||||
performance and sensitivity when aligning PacBio reads, but hurt when aligning
|
||||
Nanopore reads.
|
||||
|
||||
#### <a name="map-long-splice"></a>Map long mRNA/cDNA reads
|
||||
|
||||
```sh
|
||||
minimap2 -ax splice ref.fa spliced.fq > aln.sam # strand unknown
|
||||
minimap2 -ax splice -uf ref.fa spliced.fq > aln.sam # assuming transcript strand
|
||||
```
|
||||
This command line has been tested on PacBio Iso-Seq reads and Nanopore 2D cDNA
|
||||
reads, and been shown to work with Nanopore 1D Direct RNA reads by others. Like
|
||||
typical RNA-seq mappers, minimap2 represents an intron with the `N` CIGAR
|
||||
operator. For spliced reads, minimap2 will try to infer the strand relative to
|
||||
transcript and may write the strand to the `ts` SAM/PAF tag.
|
||||
|
||||
### <a name="algo"></a>Algorithm overview
|
||||
|
||||
In the following, minimap2 command line options have a dash ahead and are
|
||||
highlighted in bold.
|
||||
|
|
@ -97,14 +176,18 @@ highlighted in bold.
|
|||
9. If there are more reference sequences, reopen the query file from the start
|
||||
and go to step 1; otherwise stop.
|
||||
|
||||
## Limitations
|
||||
### <a name="cite"></a>Cite minimap2
|
||||
|
||||
If you use minimap2 in your work, please consider to cite:
|
||||
|
||||
> Li, H. (2017). Minimap2: fast pairwise alignment for long nucleotide sequences. [arXiv:1708.01492][preprint]
|
||||
|
||||
## <a name="limit"></a>Limitations
|
||||
|
||||
* Minimap2 may produce suboptimal alignments through long low-complexity
|
||||
regions where seed positions may be suboptimal. This should not be a big
|
||||
concern because even the optimal alignment may be wrong in such regions.
|
||||
|
||||
* Minimap2 does not work well with Illumina short reads as of now.
|
||||
|
||||
* Minimap2 requires SSE2 instructions to compile. It is possible to add
|
||||
non-SSE2 support, but it would make minimap2 slower by several times.
|
||||
|
||||
|
|
@ -121,3 +204,5 @@ warmly welcomed.
|
|||
[longislnd]: https://www.ncbi.nlm.nih.gov/pubmed/27667791
|
||||
[gaba]: https://github.com/ocxtal/libgaba
|
||||
[ksw2]: https://github.com/lh3/ksw2
|
||||
[preprint]: https://arxiv.org/abs/1708.01492
|
||||
[release]: https://github.com/lh3/minimap2/releases
|
||||
|
|
|
|||
2
index.c
2
index.c
|
|
@ -478,7 +478,7 @@ mm_idx_t *mm_idx_reader_read(mm_idx_reader_t *r, int n_threads)
|
|||
if (r->is_idx) {
|
||||
mi = mm_idx_load(r->fp.idx);
|
||||
if (mi && mm_verbose >= 2 && (mi->k != r->opt.k || mi->w != r->opt.w || mi->is_hpc != r->opt.is_hpc))
|
||||
fprintf(stderr, "[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.\n");
|
||||
fprintf(stderr, "[WARNING]\033[1;31m Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.\033[0m\n");
|
||||
} else
|
||||
mi = mm_idx_gen(r->fp.seq, r->opt.w, r->opt.k, r->opt.bucket_bits, r->opt.is_hpc, r->opt.mini_batch_size, n_threads, r->opt.batch_size, 1);
|
||||
if (mi) {
|
||||
|
|
|
|||
Loading…
Reference in New Issue