diff --git a/misc/README.md b/misc/README.md index d3eecd4..3f7e728 100644 --- a/misc/README.md +++ b/misc/README.md @@ -1,10 +1,12 @@ ## Getting Started + ```sh curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf - -cp k8-0.2.4/k8-`uname -s` k8 # or better copy to a directory on PATH -minimap2 --cs test/MT-*.fa | paf2aln.js - | less # pretty print base alignment -sam2paf.js aln.sam.gz | less -S # convert SAM to PAF -gff2bed.js anno.gtf | less -S # convert GTF/GFF3 to BED12 +cp k8-0.2.4/k8-`uname -s` $HOME/bin/k8 # assuming $HOME/bin in your $PATH +gff2bed.js anno.gtf | less -S # convert GTF/GFF3 to BED12 (if k8 installed to $PATH) +k8 gff2bed.js anno.gtf | less -S # convert GTF/GFF3 to BED12 (if k8 not installed) +sam2paf.js aln.sam.gz | less -S # convert SAM to PAF +minimap2 --cs test/MT-*.fa | paf2aln.js - | less # pretty print base alignment minimap2 -cx splice ref.fa rna-seq.fq | splice2bed.js - # convert splice aln to BED12 ``` @@ -12,15 +14,15 @@ minimap2 -cx splice ref.fa rna-seq.fq | splice2bed.js - # convert splice aln t - [Getting Started](#started) - [Introduction](#intro) -- [Calling Variants from Assemblies](#asmvar) - [Format Conversion](#conv) - - [Convert PAF to other formats](#paf2aln) + - [Convert PAF to MAF or BLAST-like format](#paf2aln) - [Convert SAM to PAF](#sam2paf) - - [Convert GTF/GFF3 to BED12 format](#gff2bed) + - [Convert GTF/GFF3 to BED12](#gff2bed) - [Convert spliced alignment to BED12](#splice2bed) - [Evaluation](#eval) - [Evaluating mapping accuracy with simulated reads](#mapeval) - [Evaluating read overlap sensitivity](#oveval) +- [Calling Variants from Assemblies](#asmvar) ## Introduction @@ -28,40 +30,42 @@ This directory contains auxiliary scripts for format conversion, mapping accuracy evaluation and miscellaneous purposes. These scripts *require* the [k8 Javascript shell][k8] to run. On Linux or Mac, you can download the precompiled k8 binary with: + ```sh curl -L https://github.com/attractivechaos/k8/releases/download/v0.2.4/k8-0.2.4.tar.bz2 | tar -jxf - -cp k8-0.2.4/k8-`uname -s` k8 +cp k8-0.2.4/k8-`uname -s` $HOME/bin/k8 # assuming $HOME/bin in your $PATH ``` + It is highly recommended to copy the executable `k8` to a directory on your -`PATH` such as `/usr/bin/env` can find it. +`$PATH` such as `/usr/bin/env` can find it. Like python or perl scripts, once +you install `k8`, you can launch these k8 scripts either with -## Calling Variants from Assemblies +```sh +path/to/gff2bed.js anno.gtf.gz +``` -Script [paf2diff.js](paf2diff.js) calls variants from coordinate-sorted -assembly-to-reference alignment having the [cs tag][cs] (requiring the `--cs` -minimap2 option). +or with + +```sh +k8 path/to/gff2bed.js anno.gtf +``` + +All k8 scripts seamlessly work with both plain text files and gzip'd text files. ## Format Conversion -### Convert PAF to other formats +* Script [paf2aln.js](paf2aln.js) converts PAF with the + [cs tag][cs] to [MAF][maf] or BLAST-like output. It only works with minimap2 + output generated using the `--cs` tag. -Script [paf2aln.js](paf2aln.js) converts PAF with the [cs tag][cs] to -[MAF][maf] or BLAST-like output. It only works with minimap2 output generated -using the `--cs` tag. +* Script [sam2paf.js](sam2paf.js) converts alignments in + the SAM format to PAF. -### Convert SAM to PAF +* Script [gff2bed.js](gff2bed.js) converts GFF format to + 12-column BED format. It seamlessly works with both GTF and GFF3. -Script [sam2paf.js](sam2paf.js) converts alignments in the SAM format to PAF. - -### Convert GTF/GFF3 to BED12 format - -Script [gff2bed.js](gff2bed.js) converts GFF format to 12-column BED format. It -seamlessly works with both GTF and GFF3. - -### Convert spliced alignment to BED12 - -Script [splice2bed.js](splice2bed.js) converts spliced alignment in SAM or PAF -to 12-column BED format. +* Script [splice2bed.js](splice2bed.js) converts + spliced alignment in SAM or PAF to 12-column BED format. ## Evaluation @@ -96,10 +100,72 @@ mappings, accumulative mapping error rate and the accumulative number of mapped reads. The U-line gives the number of unmapped reads if they are present in the SAM file. +Suppose the reported mapping coordinate overlap with the true coordinate like +the following: + +``` +truth: -------------------- +mapper: ---------------------- + |<- l1 ->|<-- o -->|<-- l2 -->| +``` + +Let `r=o/(l1+o+l2)`. The reported mapping is considered correct if `r>0.1` by +default. + ### Evaluating read overlap sensitivity -Script [ov-eval.js](ov-eval.js) takes read-to-reference alignment in PAF and -read overlaps in PAF and evaluates the sensitivity. +Script [ov-eval.js](ov-eval.js) takes sorted read-to-reference alignment and +read overlaps in PAF as input, and evaluates the sensitivity. For example: + +```sh +minimap2 -cx map-pb ref.fa reads.fq.gz | sort -k6,6 -k8,8n > reads-to-ref.paf +minimap2 -x ava-pb reads.fq.gz reads.fq.gz > ovlp.paf +k8 ov-eval.js reads-to-ref.paf ovlp.paf +``` + +## Calling Variants from Haploid Assemblies + +Script [paf2diff.js](paf2diff.js) calls variants from coordinate-sorted +assembly-to-reference alignment. It calls variants from the [cs tag][cs] and +identifies confident/callable regions as those covered by exactly one contig. +Here are example command lines: + +```sh +minimap2 -cx asm5 -t8 --cs ref.fa asm.fa > asm.paf # keeping this file is recommended; --cs required! +sort -k6,6 -k8,8n asm.paf > asm.srt.paf # sort by reference start coordinate +k8 paf2diff.js asm.srt.paf > asm.var.txt +``` + +Here is sample output: + +``` +V chr1 3181702 3181703 1 60 c t +V chr1 3181730 3181768 1 60 gtcttacacacggagtcttacacacggtcttacacaca - +R chr1 3181796 3260557 +V chr1 3181818 3181822 1 60 tgcg - +V chr1 3181831 3181832 1 60 a g +V chr1 3181832 3181833 1 60 t c +V chr1 3181833 3181834 1 60 t g +V chr1 3181874 3181874 1 60 - ca +V chr1 3181879 3181880 1 60 g a +V chr1 3181886 3181887 1 60 c g +V chr1 3181911 3181911 1 60 - agtcttacacatgcagtcttacacat +V chr1 3181924 3181925 1 60 t c +V chr1 3182079 3182080 1 60 g a +V chr1 3182150 3182151 1 60 t c +V chr1 3182336 3182337 1 60 t c +``` + +where a line starting with `R` gives regions covered by one contig, and a +V-line encodes a variant in the following format: chr, start, end, contig +depth, mapping quality, REF allele and ALT allele. + +By default, when calling variants, this script ignores alignments 50kb or +shorter; when deriving callable regions, it ignores alignments 10kb or shorter. +It uses two thresholds to avoid edge effects. These defaults are designed for +long-read assemblies. For short reads, both should be reduced. + + [cs]: https://github.com/lh3/minimap2#cs [k8]: https://github.com/attractivechaos/k8 diff --git a/misc/sim-eval.js b/misc/sim-eval.js index 88ff770..67e208b 100755 --- a/misc/sim-eval.js +++ b/misc/sim-eval.js @@ -46,7 +46,16 @@ while ((c = getopt(arguments, "Q:r:m:c")) != null) { else if (c == 'c') cap_short_mapq = true; } -var file = arguments.length == getopt.ind || arguments[getopt.ind] == '-'? new File() : new File(arguments[getopt.ind]); +if (arguments.length == getopt.ind) { + warn("Usage: k8 sim-eval.js [options] |"); + warn("Options:"); + warn(" -r FLOAT mapping correct if overlap_length/union_length>FLOAT [" + ovlp_ratio + "]"); + warn(" -Q INT print wrong mappings with mapQ>INT [don't print]"); + warn(" -m INT 0: eval the longest aln only; 1: first aln only; 2: all primary aln [0]"); + exit(1); +} + +var file = arguments[getopt.ind] == '-'? new File() : new File(arguments[getopt.ind]); var buf = new Bytes(); var tot = [], err = [];