From 1776311a9b4cc3f846f7c3541bf9678c414491df Mon Sep 17 00:00:00 2001 From: Heng Li Date: Mon, 26 Mar 2018 14:37:02 -0400 Subject: [PATCH] updated cookbook --- README.md | 2 +- cookbook.md | 101 +++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 82 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index 22d5c9c..9cea95a 100644 --- a/README.md +++ b/README.md @@ -137,7 +137,7 @@ Nanopore reads. #### Map long mRNA/cDNA reads ```sh -minimap2 -ax splice -uf ref.fa iso-seq.fq > aln.sam # PacBio Iso-seq/traditional cDNA +minimap2 -ax splice -uf -C5 ref.fa iso-seq.fq > aln.sam # PacBio Iso-seq/traditional cDNA minimap2 -ax splice ref.fa nanopore-cdna.fa > aln.sam # Nanopore 2D cDNA-seq minimap2 -ax splice -uf -k14 ref.fa direct-rna.fq > aln.sam # Nanopore Direct RNA-seq minimap2 -ax splice --splice-flank=no SIRV.fa SIRV-seq.fa # mapping against SIRV control diff --git a/cookbook.md b/cookbook.md index 8c2ee82..05990bc 100644 --- a/cookbook.md +++ b/cookbook.md @@ -1,29 +1,41 @@ ## Table of Contents -- [Installation](#install) +- [Introduction & Installation](#intro) - [Mapping Genomic Reads](#map-reads) * [Mapping long reads](#map-pb) * [Mapping Illumina paired-end reads](#map-sr) * [Evaluating mapping accuracy with simulated reads (for developers)](#mapeval) +- [Mapping Long RNA-seq Reads](#map-rna) + * [Mapping Nanopore 2D cDNA reads](#map-ont-cdna-2d) + * [Mapping Nanopore direct-RNA reads](#map-direct-rna) + * [Mapping PacBio Iso-seq reads](#map-iso-seq) - [Full-Genome Alignment](#genome-aln) * [Intra-species assembly alignment](#asm-to-ref) * [Cross-species full-genome alignment](#x-species) * [Eyeballing alignment](#view-aln) * [Calling variants from assembly-to-reference alignment](#asm-var) - * [Lift Over](#liftover) + * [Constructing self-homology map](#hom-map) + * [Lift Over (for developers)](#liftover) - [Read Overlap](#read-overlap) * [Long-read overlap](#long-read-overlap) * [Evaluating overlap sensitivity (for developers)](#ov-eval) -## Installation +## Introduction & Installation +This cookbook walks you through a variety of applications of minimap2 and its +companion script `paftools.js`. All data here are freely available from the +minimap2 release page at version tag [v2.10][v2.10]. Some examples only work +with v2.10 or later. + +To acquire the data used in this cookbook and to install minimap2 and paftools, +please follow the command lines below: ```sh # install minimap2 executables -curl -L https://github.com/lh3/minimap2/releases/download/v2.9/minimap2-2.9_x64-linux.tar.bz2 | tar jxf - -cp minimap2-2.9_x64-linux/{minimap2,k8,paftools.js} . # copy executables -export PATH="$PATH:"`pwd` # put the current directory on PATH +curl -L https://github.com/lh3/minimap2/releases/download/v2.10/minimap2-2.10_x64-linux.tar.bz2 | tar jxf - +cp minimap2-2.10_x64-linux/{minimap2,k8,paftools.js} . # copy executables +export PATH="$PATH:"`pwd` # put the current directory on PATH # download example datasets -curl -L https://github.com/lh3/minimap2/releases/download/v2.0/cookbook-data.tgz | tar zxf - +curl -L https://github.com/lh3/minimap2/releases/download/v2.10/cookbook-data.tgz | tar zxf - ``` ## Mapping Genomic Reads @@ -84,6 +96,56 @@ paftools.js mason2fq tmp.sam | seqtk seq -2 > ecoli_mason_2.fq +## Mapping Long RNA-seq Reads + +### Mapping Nanopore 2D cDNA reads +```sh +minimap2 -ax splice SIRV_E2.fa SIRV_ont-cdna.fa > aln.sam +``` +You can compare the alignment to the true annotations with: +```sh +paftools.js junceval SIRV_E2C.gtf aln.sam +``` +It gives the percentage of introns found in the annotation. For SIRV data, it +is possible to achieve higher junction accuracy with +```sh +minimap2 -ax splice --splice-flank=no SIRV_E2.fa SIRV_ont-cdna.fa | paftools.js junceval SIRV_E2C.gtf +``` +This is because minimap2 models one additional evolutionarily conserved base +around a canonical junction, but SIRV doesn't honor this signal. Option +`--splice-flank=no` asks minimap2 no to model this additional base. + +In the output a tag `ts:A:+` indicates that the read strand is the same as the +transcript strand; `ts:A:-` indicates the read strand is opposite to the +transcript strand. This tag is inferred from the GT-AG signal and is thus only +available to spliced reads. + +### Mapping Nanopore direct-RNA reads +```sh +minimap2 -ax splice -k14 -uf SIRV_E2.fa SIRV_ont-drna.fa > aln.sam +``` +Direct-RNA reads are noisier, so we use a shorter k-mer for improved +sensitivity. Here, option `-uf` forces minimap2 to map reads to the forward +transcript strand only because direct-RNA reads are stranded. Again, applying +`--splice-flank=no` helps junction accuracy for SIRV data. + +### Mapping PacBio Iso-seq reads +```sh +minimap2 -ax splice -uf -C5 SIRV_E2.fa SIRV_iso-seq.fq > aln.sam +``` +Option `-C5` reduces the penalty on non-canonical splicing sites. It helps +to align such sites correctly for data with low error rate such as Iso-seq +reads and traditional cDNAs. On this example, minimap2 makes one junction +error. Applying `--splice-flank=no` fixes this alignment error. + +Note that the command line above is optimized for the final Iso-seq reads. +PacBio's Iso-seq pipeline produces intermediate sequences at varying quality. +For example, some intermediate reads are not stranded. For these reads, option +`-uf` will lead to more errors. Please revise the minimap2 command line +accordingly. + + + ## Full-Genome Alignment ### Intra-species assembly alignment @@ -124,7 +186,17 @@ Without option `-f`, `paftools.js call` outputs in a custom format. In this format, lines starting with `R` give the regions covered by one contig only. This information is not available in the VCF output. -### Lift over +### Constructing self-homology map +```sh +minimap2 -DP -k19 -w19 -m200 ecoli_ref.fa ecoli_ref.fa > out.paf +``` +Option `-D` asks minimap2 to ignore anchors from perfect self match and `-P` +outputs all chains. For large nomes, we don't recommend to perform base-level +alignment (with `-c`, `-a` or `--cs`) when `-P` is applied. This is because +base-alignment is slow and occasionally gives wrong alignments close to the +diagonal of a dotter plot. For E. coli, though, base-alignment is still fast. + +### Lift over (for developers) ```sh minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa > ecoli_canu.paf echo -e 'tig00000001\t200000\t300000' | paftools.js liftover ecoli_canu.paf - @@ -165,18 +237,7 @@ with `-x ava-pb` (99% vs 93% with `-x ava-ont`). -## Mapping Long RNA-seq Reads - -(Unfinished section. Data to come later...) - -### Mapping Nanopore direct-RNA reads - -```sh -minimap2 -ax splice -k14 -uf ref.fa reads.fa > aln.sam -``` - - - [pbsim]: https://github.com/pfaucon/PBSIM-PacBio-Simulator [mason2]: https://github.com/seqan/seqan/tree/master/apps/mason2 [paf]: https://github.com/lh3/miniasm/blob/master/PAF.md +[v2.10]: https://github.com/lh3/minimap2/releases/tag/v2.10