2018-03-13 00:00:44 +08:00
|
|
|
## Table of Contents
|
|
|
|
|
|
|
|
|
|
- [Installation](#install)
|
|
|
|
|
- [Mapping Genomic Reads](#map-reads)
|
2018-03-13 00:50:21 +08:00
|
|
|
* [Mapping long reads](#map-pb)
|
|
|
|
|
* [Mapping Illumina paired-end reads](#map-sr)
|
|
|
|
|
* [Evaluating mapping accuracy with simulated reads (for developers)](#mapeval)
|
2018-03-13 04:12:18 +08:00
|
|
|
- [Full-Genome Alignment](#genome-aln)
|
|
|
|
|
* [Intra-species assembly alignment](#asm-to-ref)
|
|
|
|
|
* [Calling variants from assembly-to-reference alignment](#asm-var)
|
2018-03-13 10:24:22 +08:00
|
|
|
* [Lift Over](#liftover)
|
|
|
|
|
* [Cross-species alignment](#x-species)
|
|
|
|
|
* [Print alignment](#view-aln)
|
2018-03-13 01:05:51 +08:00
|
|
|
- [Read Overlap](#read-overlap)
|
|
|
|
|
* [Long-read overlap](#long-read-overlap)
|
2018-03-13 01:20:25 +08:00
|
|
|
* [Evaluating overlap sensitivity (for developers)](#ov-eval)
|
2018-03-13 00:00:44 +08:00
|
|
|
|
|
|
|
|
## <a name="install"></a>Installation
|
|
|
|
|
|
|
|
|
|
```sh
|
2018-03-13 00:39:57 +08:00
|
|
|
# install minimap2 executables
|
2018-03-13 00:00:44 +08:00
|
|
|
curl -L https://github.com/lh3/minimap2/releases/download/v2.9/minimap2-2.9_x64-linux.tar.bz2 | tar jxf -
|
|
|
|
|
cp minimap2-2.9_x64-linux/{minimap2,k8,paftools.js} . # copy executables
|
2018-03-13 00:39:57 +08:00
|
|
|
export PATH="$PATH:"`pwd` # put the current directory on PATH
|
|
|
|
|
# download example datasets
|
|
|
|
|
curl -L https://github.com/lh3/minimap2/releases/download/v2.0/cookbook-data.tgz | tar zxf -
|
2018-03-13 00:00:44 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## <a name="map-reads"></a>Mapping Genomic Reads
|
|
|
|
|
|
2018-03-13 00:50:21 +08:00
|
|
|
### <a name="map-pb"></a>Mapping long reads
|
2018-03-13 00:42:13 +08:00
|
|
|
```sh
|
|
|
|
|
minimap2 -ax map-pb -t4 ecoli_ref.fa ecoli_p6_25x_canu.fa > mapped.sam
|
|
|
|
|
```
|
|
|
|
|
Alternatively, you can create a minimap2 index first and then map:
|
|
|
|
|
```sh
|
|
|
|
|
minimap2 -x map-pb -d ecoli-pb.mmi ecoli_ref.fa # create an index
|
|
|
|
|
minimap2 -ax map-pb ecoli-pb.mmi ecoli_p6_25x_canu.fa > mapped.sam
|
|
|
|
|
```
|
|
|
|
|
This will save you a couple of minutes when you map against the human genome.
|
|
|
|
|
**HOWEVER**, key algorithm parameters such as the k-mer length and window
|
|
|
|
|
size can't be changed after indexing. Minimap2 will give you a warning if
|
|
|
|
|
parameters used in a pre-built index doesn't match parameters on the command
|
2018-03-13 00:50:21 +08:00
|
|
|
line. **Please always make sure you are using an intended pre-built index.**
|
2018-03-13 00:39:57 +08:00
|
|
|
|
2018-03-13 00:50:21 +08:00
|
|
|
### <a name="map-sr"></a>Mapping Illumina paired-end reads:
|
2018-03-13 00:42:13 +08:00
|
|
|
```sh
|
2018-03-13 04:12:18 +08:00
|
|
|
minimap2 -ax sr -t4 ecoli_ref.fa ecoli_mason_1.fq ecoli_mason_2.fq > mapped-sr.sam
|
2018-03-13 00:42:13 +08:00
|
|
|
```
|
2018-03-13 00:39:57 +08:00
|
|
|
|
2018-03-13 00:50:21 +08:00
|
|
|
### <a name="mapeval"></a>Evaluating mapping accuracy with simulated reads (for developers)
|
2018-03-13 00:42:13 +08:00
|
|
|
```sh
|
|
|
|
|
minimap2 -ax sr ecoli_ref.fa ecoli_mason_1.fq ecoli_mason_2.fq | paftools.js mapeval -
|
|
|
|
|
```
|
|
|
|
|
The output is:
|
|
|
|
|
```
|
|
|
|
|
Q 60 19712 0 0.000000000 19712
|
|
|
|
|
Q 0 282 219 0.010953286 19994
|
|
|
|
|
U 6
|
|
|
|
|
```
|
2018-03-13 00:50:21 +08:00
|
|
|
where a `U`-line gives the number of unmapped reads (for SAM input only); a
|
|
|
|
|
`Q`-line gives:
|
|
|
|
|
|
2018-03-13 00:42:13 +08:00
|
|
|
1. Mapping quality (mapQ) threshold
|
|
|
|
|
2. Number of mapped reads between this threshold and the previous mapQ threshold.
|
|
|
|
|
3. Number of wrong mappings in the same mapQ interval
|
|
|
|
|
4. Accumulative mapping error rate
|
|
|
|
|
5. Accumulative number of mappings
|
2018-03-13 00:50:21 +08:00
|
|
|
|
|
|
|
|
For `paftools.js mapeval` to work, you need to encode the true read positions
|
|
|
|
|
in read names in the right format. For [PBSIM][pbsim] and [mason2][mason2], we
|
|
|
|
|
provide scripts to generate the right format. Simulated reads in this cookbook
|
|
|
|
|
were created with the following command lines:
|
|
|
|
|
```sh
|
|
|
|
|
# in PBSIM source code directory:
|
|
|
|
|
src/pbsim ../ecoli_ref.fa --depth 1 --sample-fastq sample/sample.fastq
|
|
|
|
|
paftools.js pbsim2fq ../ecoli_ref.fa.fai sd_0001.maf > ../ecoli_pbsim.fa
|
|
|
|
|
|
|
|
|
|
# mason2 simulation
|
|
|
|
|
mason_simulator --illumina-prob-mismatch-scale 2.5 -ir ecoli_ref.fa -n 10000 -o tmp-l.fq -or tmp-r.fq -oa tmp.sam
|
|
|
|
|
paftools.js mason2fq tmp.sam | seqtk seq -1 > ecoli_mason_1.fq
|
|
|
|
|
paftools.js mason2fq tmp.sam | seqtk seq -1 > ecoli_mason_2.fq
|
|
|
|
|
```
|
|
|
|
|
|
2018-03-13 01:20:25 +08:00
|
|
|
|
|
|
|
|
|
2018-03-13 04:12:18 +08:00
|
|
|
## <a name="genome-aln"></a>Full-Genome Alignment
|
|
|
|
|
|
|
|
|
|
### <a name="asm-to-ref"></a>Intra-species assembly alignment
|
|
|
|
|
```sh
|
2018-03-13 10:24:22 +08:00
|
|
|
# option "--cs" is recommended as paftools.js may need it
|
2018-03-13 04:12:18 +08:00
|
|
|
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa > ecoli_canu.paf
|
|
|
|
|
```
|
|
|
|
|
Here `ecoli_canu.fa` is the Canu assembly of `ecoli_p6_25x_canu.fa`. This
|
|
|
|
|
command line outputs alignments in the [PAF format][paf]. Use `-a` instead of
|
|
|
|
|
`-c` to get output in the SAM format.
|
|
|
|
|
|
|
|
|
|
### <a name="asm-var"></a>Calling variants from assembly-to-reference alignment
|
|
|
|
|
```sh
|
|
|
|
|
# don't forget the "--cs" option; otherwise it doesn't work
|
|
|
|
|
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa \
|
|
|
|
|
| sort -k6,6 -k8,8n \
|
2018-03-13 10:24:22 +08:00
|
|
|
| paftools.js call -f ecoli_ref.fa - > out.vcf
|
2018-03-13 04:12:18 +08:00
|
|
|
```
|
2018-03-13 10:24:22 +08:00
|
|
|
Without option `-f`, `paftools.js call` outputs in a custom format. In this
|
|
|
|
|
format, lines starting with `R` give the regions covered by one contig only.
|
|
|
|
|
This information is not available in the VCF output.
|
|
|
|
|
|
|
|
|
|
### <a name="liftover"></a>Lift over
|
|
|
|
|
```sh
|
|
|
|
|
echo -e 'tig00000001\t200000\t300000' | paftools.js liftover ecoli_canu.paf -
|
2018-03-13 04:12:18 +08:00
|
|
|
```
|
2018-03-13 10:24:22 +08:00
|
|
|
This lifts over a region on query sequences to one or multiple regions on
|
|
|
|
|
reference sequences. Note that this paftools.js command may not be efficient
|
|
|
|
|
enough to lift millions of regions.
|
|
|
|
|
|
|
|
|
|
### <a name="x-species"></a>Cross-species alignment
|
|
|
|
|
```sh
|
|
|
|
|
minimap2 -cx asm20 --cs ecoli_ref.fa ecoli_O104:H4.fa > ecoli_O104:H4.paf
|
|
|
|
|
sort -k6,6 -k8,8n ecoli_O104:H4.paf | paftools.js call -f ecoli_ref.fa -L10000 -l1000 - > out.vcf
|
|
|
|
|
```
|
|
|
|
|
Minimap2 only works when the sequence divergence is no more than ~15%.
|
|
|
|
|
|
|
|
|
|
### <a name="view-aln"></a>Print alignment
|
|
|
|
|
```sh
|
|
|
|
|
# option "--cs" required; minimap2-r741 or higher required for the "asm20" preset
|
|
|
|
|
minimap2 -cx asm20 --cs ecoli_ref.fa ecoli_O104:H4.fa | paftools.js view - | less -S
|
2018-03-13 04:12:18 +08:00
|
|
|
```
|
2018-03-13 10:24:22 +08:00
|
|
|
This prints the alignment in a BLAST-like format.
|
2018-03-13 04:12:18 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2018-03-13 01:05:51 +08:00
|
|
|
## <a name="read-overlap"></a>Read Overlap
|
|
|
|
|
|
|
|
|
|
### <a name="long-read-overlap"></a>Long read overlap
|
|
|
|
|
```sh
|
|
|
|
|
# For pacbio reads:
|
|
|
|
|
minimap2 -x ava-pb ecoli_p6_25x_canu.fa ecoli_p6_25x_canu.fa > overlap.paf
|
|
|
|
|
# For Nanopore reads (ava-ont also works with PacBio but not as good):
|
|
|
|
|
minimap2 -x ava-ont -r 10000 ecoli_p6_25x_canu.fa ecoli_p6_25x_canu.fa > overlap.paf
|
2018-03-13 04:12:18 +08:00
|
|
|
# If you have miniasm installed:
|
|
|
|
|
miniasm -f ecoli_p6_25x_canu.fa overlap.paf > asm.gfa
|
2018-03-13 01:05:51 +08:00
|
|
|
```
|
|
|
|
|
Here we explicitly applied `-r 10000`. We are considering to set this as the
|
|
|
|
|
default for the `ava-ont` mode as this seems to improve the contiguity for
|
|
|
|
|
nanopore read assembly (Loman, personal communication).
|
|
|
|
|
|
2018-03-13 01:20:25 +08:00
|
|
|
*Minimap2 doesn't work well with short-read overlap.*
|
2018-03-13 01:05:51 +08:00
|
|
|
|
2018-03-13 01:20:25 +08:00
|
|
|
### <a name="ov-eval"></a>Evaluating overlap sensitivity (for developers)
|
2018-03-13 01:05:51 +08:00
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
# read to reference mapping
|
|
|
|
|
minimap2 -cx map-pb ecoli_ref.fa ecoli_p6_25x_canu.fa > to-ref.paf
|
|
|
|
|
# evaluate overlap sensitivity
|
|
|
|
|
sort -k6,6 -k8,8n to-ref.paf | paftools.js ov-eval - overlap.paf
|
|
|
|
|
```
|
|
|
|
|
You can see that for PacBio reads, minimap2 achieves higher overlap sensitivity
|
|
|
|
|
with `-x ava-pb` (99% vs 93% with `-x ava-ont`).
|
|
|
|
|
|
2018-03-13 01:20:25 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
## <a name="map-rna"></a>Mapping Long RNA-seq Reads
|
|
|
|
|
|
|
|
|
|
(Unfinished section. Data to come later...)
|
|
|
|
|
|
|
|
|
|
### <a name="map-direct-rna"></a>Mapping Nanopore direct-RNA reads
|
|
|
|
|
|
|
|
|
|
```sh
|
|
|
|
|
minimap2 -ax splice -k14 -uf ref.fa reads.fa > aln.sam
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2018-03-13 00:50:21 +08:00
|
|
|
[pbsim]: https://github.com/pfaucon/PBSIM-PacBio-Simulator
|
|
|
|
|
[mason2]: https://github.com/seqan/seqan/tree/master/apps/mason2
|
2018-03-13 04:12:18 +08:00
|
|
|
[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md
|