diff --git a/README.md b/README.md index e5b459e..7c64212 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ [![License](https://img.shields.io/badge/License-MIT-blue.svg?style=flat)](LICENSE.txt) [![Build Status](https://travis-ci.org/lh3/minimap2.svg?branch=master)](https://travis-ci.org/lh3/minimap2) [![Downloads](https://img.shields.io/github/downloads/lh3/minimap2/total.svg?style=flat)](https://github.com/lh3/minimap2/releases) -## Getting Started +## Getting Started ```sh git clone https://github.com/lh3/minimap2 cd minimap2 && make @@ -21,39 +21,118 @@ cd minimap2 && make # man page man ./minimap2.1 ``` +## Table of Contents -## Introduction +- [Getting Started](#started) +- [Users' Guide](#uguide) + - [Installation](#install) + - [General usage](#general) + - [Use cases](#cases) + - [Map long noisy genomic reads](#map-long-genomic) + - [Map long mRNA/cDNA reads](#map-long-splice) + - [Algorithm overview](#algo) + - [Cite minimap2](#cite) +- [Limitations](#limit) -Minimap2 is a fast sequence mapping and alignment program that can find -overlaps between long noisy reads, or map long reads or their assemblies to a -reference genome optionally with detailed alignment (i.e. CIGAR). At present, -it works efficiently with query sequences from a few kilobases to ~100 -megabases in length at an error rate ~15%. Minimap2 outputs in the [PAF][paf] or -the [SAM format][sam]. On limited test data sets, minimap2 is over 20 times -faster than most other long-read aligners. It will replace BWA-MEM for long -reads and contig alignment. +## Users' Guide -Minimap2 is the successor of [minimap][minimap]. It uses a similar -minimizer-based indexing and seeding algorithm, and improves the original -minimap with homopolyer-compressed k-mers (see also [SMARTdenovo][smartdenovo] -and [longISLND][longislnd]), better chaining and the ability to produce CIGAR -with fast extension alignment (see also [libgaba][gaba] and [ksw2][ksw2]) and -piece-wise affine gap cost. +Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA +sequences against a large reference database. Typical use cases include: (1) +mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) +finding overlaps between long reads with error rate up to ~15%; (3) +splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads +against a reference genome; (4) aligning Illumina single- or paired-end reads; +(5) assembly-to-assembly alignment; (6) full-genome alignment between two +closely related species with divergence below ~15%. -If you use minimap2 in your work, please consider to cite: +For ~10kb noisy reads sequences, minimap2 is tens of times faster than +mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more +accurate on simulated long reads and produces biologically meaningful alignment +ready for downstream analyses. For >100bp Illumina short reads, minimap2 is +three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. +Detailed evaluations are available from the [minimap2 preprint][preprint]. -> Li, H. (2017). Minimap2: fast pairwise alignment for long DNA sequences. [arXiv:1708.01492](https://arxiv.org/abs/1708.01492). +### Installation -## Installation +Minimap2 only works on x86-64 CPUs. You can acquire precompiled binaries from +the [release page][release]. For example, with: +```sh +wget --no-check-certificate -O- https://github.com/lh3/minimap2/releases/download/v2.2/minimap2-2.2_x64-linux.tar.bz2 \ + | tar -jxvf - +./minimap2-2.2_x64-linux/minimap2 +``` +If you want to compile from the source, you need to have a C compiler, GNU make +and zlib development files installed. Just type `make` in the source code +directory to compile. If you see compilation errors, try `make sse2only=1` +to disable SSE4 code, which will make minimap2 slightly slower at a cost. -For modern x86-64 CPUs, just type `make` in the source code directory. This -will compile a binary `minimap2` which you can copy to your desired location. -If you see compilation errors, try `make sse2only=1` to disable SSE4. Minimap2 -will run a little slower. At present, minimap2 does not work with non-x86 CPUs -or ancient CPUs that do not support SSE2. SSE2 is critical to the performance -of minimap2. +### General usage -## Algorithm Overview +In the simplest form, minimap2 takes a reference database and a query sequence +file as input and produce approximate mapping, without base-level alignment +(i.e. no CIGAR), in the [PAF format][paf]: +```sh +minimap2 ref.fa reads.fq > approx-mapping.paf +``` +You ask minimap2 to generate CIGAR at the `cg` tag of PAF with: +```sh +minimap2 -c ref.fa reads.fq > alignment.paf +``` +or to output alignments in the [SAM format][sam]: +```sh +minimap2 -a ref.fa reads.fq > alignment.sam +``` +Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You +don't need to convert between FASTA and FASTQ or decompress gzip'd files first. + +For the human reference genome, minimap2 takes a few minutes to generate a +minimizer index for the reference before mapping. To reduce indexing time, you +can optionally save the index with option **-d** and replace the reference +sequence file with the index file on the minimap2 command line: +```sh +minimap2 -d ref.mmi ref.fa # indexing +minimap2 -a ref.mmi reads.fq > alignment.sam # alignment +``` +***Importantly***, it should be noted that once you build the index, indexing +parameters such as **-k**, **-w**, **-H** and **-I** can't be changed during +mapping. If you are running minimap2 for different data types, you will +probably need to keep multiple indexes generated with different parameters. +This makes minimap2 different BWA which always uses the same index regardless +of query data types. + +### Use cases + +Minimap2 uses the same base algorithm for all applications. However, due to the +dramatic different data types (e.g. short vs long reads; DNA vs mRNA reads) it +supports, minimap2 needs to be tuned for optimal performance and accuracy. +You should usually choose a preset with option **-x**, which sets multiple +parameters at the same time. + +#### Map long noisy genomic reads + +```sh +minimap2 -ax map-pb ref.fa pacbio-reads.fq > aln.sam # for PacBio subreads +minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam # for Oxford Nanopore reads +``` +The difference between `map-pb` and `map-ont` is that `map-pb` uses +homopolymer-compressed (HPC) minimizers as seeds, while `map-ont` uses normal +minimizers as seeds. Emperical evaluation shows that HPC minimizers improve +performance and sensitivity when aligning PacBio reads, but hurt when aligning +Nanopore reads. + +#### Map long mRNA/cDNA reads + +```sh +minimap2 -ax splice ref.fa spliced.fq > aln.sam # strand unknown +minimap2 -ax splice -uf ref.fa spliced.fq > aln.sam # assuming transcript strand +``` +This command line has been tested on PacBio Iso-Seq reads and Nanopore 2D cDNA +reads, and been shown to work with Nanopore 1D Direct RNA reads by others. Like +typical RNA-seq mappers, minimap2 represents an intron with the `N` CIGAR +operator. For spliced reads, minimap2 will try to infer the strand relative to +transcript and may write the strand to the `ts` SAM/PAF tag. + +### Algorithm overview In the following, minimap2 command line options have a dash ahead and are highlighted in bold. @@ -97,14 +176,18 @@ highlighted in bold. 9. If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop. -## Limitations +### Cite minimap2 + +If you use minimap2 in your work, please consider to cite: + +> Li, H. (2017). Minimap2: fast pairwise alignment for long nucleotide sequences. [arXiv:1708.01492][preprint] + +## Limitations * Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions. -* Minimap2 does not work well with Illumina short reads as of now. - * Minimap2 requires SSE2 instructions to compile. It is possible to add non-SSE2 support, but it would make minimap2 slower by several times. @@ -121,3 +204,5 @@ warmly welcomed. [longislnd]: https://www.ncbi.nlm.nih.gov/pubmed/27667791 [gaba]: https://github.com/ocxtal/libgaba [ksw2]: https://github.com/lh3/ksw2 +[preprint]: https://arxiv.org/abs/1708.01492 +[release]: https://github.com/lh3/minimap2/releases diff --git a/index.c b/index.c index 00f212c..9d069d4 100644 --- a/index.c +++ b/index.c @@ -478,7 +478,7 @@ mm_idx_t *mm_idx_reader_read(mm_idx_reader_t *r, int n_threads) if (r->is_idx) { mi = mm_idx_load(r->fp.idx); if (mi && mm_verbose >= 2 && (mi->k != r->opt.k || mi->w != r->opt.w || mi->is_hpc != r->opt.is_hpc)) - fprintf(stderr, "[WARNING] Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.\n"); + fprintf(stderr, "[WARNING]\033[1;31m Indexing parameters (-k, -w or -H) overridden by parameters used in the prebuilt index.\033[0m\n"); } else mi = mm_idx_gen(r->fp.seq, r->opt.w, r->opt.k, r->opt.bucket_bits, r->opt.is_hpc, r->opt.mini_batch_size, n_threads, r->opt.batch_size, 1); if (mi) {