diff --git a/misc/README.md b/misc/README.md index 8c0048b..b310dc5 100644 --- a/misc/README.md +++ b/misc/README.md @@ -1,24 +1,28 @@ ## Getting Started ```sh +# install minimap2 +git clone https://github.com/lh3/minimap2 +cd minimap2 && make +# install the k8 javascript shell curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf - -cp k8-0.2.4/k8-`uname -s` $HOME/bin/k8 # assuming $HOME/bin in your $PATH -gff2bed.js anno.gtf | less -S # convert GTF/GFF3 to BED12 (if k8 installed to $PATH) -k8 gff2bed.js anno.gtf | less -S # convert GTF/GFF3 to BED12 (if k8 not installed) -sam2paf.js aln.sam.gz | less -S # convert SAM to PAF -minimap2 --cs test/MT-*.fa | paf2aln.js - | less # pretty print base alignment -minimap2 -cx splice ref.fa rna-seq.fq | splice2bed.js - # convert splice aln to BED12 +cp k8-0.2.4/k8-`uname -s` k8 # or copy it to a directory on your $PATH +# export PATH="$PATH:`pwd`:`pwd`/misc" # run this if k8, minimap2 or paftools.js not on your $PATH +minimap2 --cs test/MT-human.fa test/MT-orang.fa | paftools.js view - # view alignment +minimap2 -c test/MT-human.fa test/MT-orang.fa | paftools.js stat - # basic alignment statistics +minimap2 -c --cs test/MT-human.fa test/MT-orang.fa \ + | sort -k6,6 -k8,8n | paftools.js call -L15000 - # calling variants from asm-to-ref alignment +minimap2 -c test/MT-human.fa test/MT-orang.fa \ + | paftools.js liftover -l10000 - <(echo -e "MT_orang\t2000\t5000") # liftOver +# no test data for the following examples +paftools.js junceval -e anno.gtf splice.sam > out.txt # compare splice junctions to annotations +paftools.js splice2bed anno.gtf > anno.bed # convert GTF/GFF3 to BED12 ``` ## Table of Contents - [Getting Started](#started) - [Introduction](#intro) -- [Format Conversion](#conv) - - [Convert PAF to MAF or BLAST-like format](#paf2aln) - - [Convert SAM to PAF](#sam2paf) - - [Convert GTF/GFF3 to BED12](#gff2bed) - - [Convert spliced alignment to BED12](#splice2bed) - [Evaluation](#eval) - [Evaluating mapping accuracy with simulated reads](#mapeval) - [Evaluating read overlap sensitivity](#oveval) @@ -26,58 +30,43 @@ minimap2 -cx splice ref.fa rna-seq.fq | splice2bed.js - # convert splice aln t ## Introduction -This directory contains auxiliary scripts for format conversion, mapping -accuracy evaluation and miscellaneous purposes. These scripts *require* -the [k8 Javascript shell][k8] to run. On Linux or Mac, you can download -the precompiled k8 binary with: +paftools.js is a script that processes alignments in the [PAF format][paf], +such as converting between formats, evaluating mapping accuracy, lifting over +BED files based on alignment, and calling variants from assembly-to-assembly +alignment. This script *requires* the [k8 Javascript shell][k8] to run. On +Linux or Mac, you can download the precompiled k8 binary with: ```sh curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf - cp k8-0.2.4/k8-`uname -s` $HOME/bin/k8 # assuming $HOME/bin in your $PATH ``` - It is highly recommended to copy the executable `k8` to a directory on your -`$PATH` such as `/usr/bin/env` can find it. Like python or perl scripts, once -you install `k8`, you can launch these k8 scripts either with +`$PATH` such as `/usr/bin/env` can find it. Like python scripts, once you +install `k8`, you can launch these k8 scripts either with ```sh -path/to/gff2bed.js anno.gtf.gz +path/to/paftools.js ``` or with ```sh -k8 path/to/gff2bed.js anno.gtf +k8 path/to/paftools.js ``` -All k8 scripts seamlessly work with both plain text files and gzip'd text files. - -## Format Conversion - -* Script [paf2aln.js](paf2aln.js) converts PAF with the - [cs tag][cs] to [MAF][maf] or BLAST-like output. It only works with minimap2 - output generated using the `--cs` tag. - -* Script [sam2paf.js](sam2paf.js) converts alignments in - the SAM format to PAF. - -* Script [gff2bed.js](gff2bed.js) converts GFF format to - 12-column BED format. It seamlessly works with both GTF and GFF3. - -* Script [splice2bed.js](splice2bed.js) converts - spliced alignment in SAM or PAF to 12-column BED format. +paftools.js seamlessly reads both plain text files and gzip'd text files. ## Evaluation ### Evaluating mapping accuracy with simulated reads -Script [sim-pbsim.js](sim-pbsim.js) converts the MAF output of [pbsim][pbsim] +The **pbsim2fq** command of paftools.js converts the MAF output of [pbsim][pbsim] to FASTQ and encodes the true mapping position in the read name in a format like -`S1_33!chr1!225258409!225267761!-`. Similarly, script -[sim-mason2.js](sim-mason2.js) converts [mason2][mason2] simulated SAM to -FASTQ. +`S1_33!chr1!225258409!225267761!-`. Similarly, the **mason2fq** command +converts [mason2][mason2] simulated SAM to FASTQ. + +Command **mapeval** evaluates mapped SAM/PAF. Here is example output: -Script [sim-eval.js](sim-eval.js) evaluates mapped SAM/PAF. Here is example output: ``` Q 60 32478 0 0.000000000 32478 Q 22 16 1 0.000030775 32494 @@ -94,11 +83,12 @@ Q 1 248 94 0.003267381 33054 Q 0 31 17 0.003778147 33085 U 3 ``` + where each Q-line gives the quality threshold, the number of reads mapped with mapping quality equal to or greater than the threshold, number of wrong mappings, accumulative mapping error rate and the accumulative number of -mapped reads. The U-line gives the number of unmapped reads if they are present -in the SAM file. +mapped reads. The U-line, if present, gives the number of unmapped reads if +they are present in the SAM file. Suppose the reported mapping coordinate overlap with the true coordinate like the following: @@ -125,7 +115,7 @@ k8 ov-eval.js reads-to-ref.paf ovlp.paf ## Calling Variants from Haploid Assemblies -Command `paftools.js call` calls variants from coordinate-sorted +The **call** command of paftools.js calls variants from coordinate-sorted assembly-to-reference alignment. It calls variants from the [cs tag][cs] and identifies confident/callable regions as those covered by exactly one contig. Here are example command lines: @@ -139,34 +129,30 @@ k8 paftools.js call asm.srt.paf > asm.var.txt Here is sample output: ``` -V chr1 3181702 3181703 1 60 c t -V chr1 3181730 3181768 1 60 gtcttacacacggagtcttacacacggtcttacacaca - -R chr1 3181796 3260557 -V chr1 3181818 3181822 1 60 tgcg - -V chr1 3181831 3181832 1 60 a g -V chr1 3181832 3181833 1 60 t c -V chr1 3181833 3181834 1 60 t g -V chr1 3181874 3181874 1 60 - ca -V chr1 3181879 3181880 1 60 g a -V chr1 3181886 3181887 1 60 c g -V chr1 3181911 3181911 1 60 - agtcttacacatgcagtcttacacat -V chr1 3181924 3181925 1 60 t c -V chr1 3182079 3182080 1 60 g a -V chr1 3182150 3182151 1 60 t c -V chr1 3182336 3182337 1 60 t c +V chr1 2276040 2276041 1 60 c g LJII01000171.1 1217409 1217410 + +V chr1 2280409 2280410 1 60 a g LJII01000171.1 1221778 1221779 + +V chr1 2280504 2280505 1 60 a g LJII01000171.1 1221873 1221874 + +R chr1 2325140 2436340 +V chr1 2325287 2325287 1 60 - ct LJII01000171.1 1272894 1272896 + +V chr1 2325642 2325644 1 60 tt - LJII01000171.1 1273251 1273251 + +V chr1 2326051 2326052 1 60 c t LJII01000171.1 1273658 1273659 + +V chr1 2326287 2326288 1 60 c t LJII01000171.1 1273894 1273895 + ``` -where a line starting with `R` gives regions covered by one contig, and a -V-line encodes a variant in the following format: chr, start, end, contig -depth, mapping quality, REF allele and ALT allele. +where a line starting with `R` gives regions covered by one query contig, and a +V-line encodes a variant in the following format: chr, start, end, query depth, +mapping quality, REF allele, ALT allele, query name, query start, end and the +query orientation. Generally, you should only look at variants where column 5 +is one. -By default, when calling variants, this script ignores alignments 50kb or -shorter; when deriving callable regions, it ignores alignments 10kb or shorter. -It uses two thresholds to avoid edge effects. These defaults are designed for -long-read assemblies. For short reads, both should be reduced. +By default, when calling variants, "paftools.js call" ignores alignments 50kb +or shorter; when deriving callable regions, it ignores alignments 10kb or +shorter. It uses two thresholds to avoid edge effects. These defaults are +designed for long-read assemblies. For short reads, both should be reduced. +[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md [cs]: https://github.com/lh3/minimap2#cs [k8]: https://github.com/attractivechaos/k8 [maf]: https://genome.ucsc.edu/FAQ/FAQformat#format5 diff --git a/misc/paftools.js b/misc/paftools.js index 0290197..0fd154f 100755 --- a/misc/paftools.js +++ b/misc/paftools.js @@ -175,7 +175,7 @@ function paf_liftover(args) { if (fn == null) return null; if (typeof to_merge == 'undefined') to_merge = true; - var file = new File(fn); + var file = fn == '-'? new File() : new File(fn); var buf = new Bytes(); var bed = {}; while (file.readline(buf) >= 0) { @@ -214,7 +214,7 @@ function paf_liftover(args) } var bed = read_bed(args[getopt.ind+1], to_merge); - var file = new File(args[getopt.ind]); + var file = args[getopt.ind] == '-'? new File() : new File(args[getopt.ind]); var buf = new Bytes(); while (file.readline(buf) >= 0) { var t = buf.toString().split("\t"); @@ -485,7 +485,7 @@ function paf_stat(args) } var buf = new Bytes(); - var file = new File(args[getopt.ind]); + var file = args[getopt.ind] == '-'? new File() : new File(args[getopt.ind]); var re = /(\d+)([MIDSHNX=])/g; var lineno = 0, n_pri = 0, n_2nd = 0, n_seq = 0, n_cigar_64k = 0, l_tot = 0, l_cov = 0;