working on section "Full-Genome Alignment"

found an apparent bug in paftools.js call. To debug...
This commit is contained in:
Heng Li 2018-03-12 16:12:18 -04:00
parent bdc615c1d4
commit e3f226a9d9
1 changed files with 51 additions and 1 deletions

View File

@ -5,6 +5,9 @@
* [Mapping long reads](#map-pb)
* [Mapping Illumina paired-end reads](#map-sr)
* [Evaluating mapping accuracy with simulated reads (for developers)](#mapeval)
- [Full-Genome Alignment](#genome-aln)
* [Intra-species assembly alignment](#asm-to-ref)
* [Calling variants from assembly-to-reference alignment](#asm-var)
- [Read Overlap](#read-overlap)
* [Long-read overlap](#long-read-overlap)
* [Evaluating overlap sensitivity (for developers)](#ov-eval)
@ -39,7 +42,7 @@ line. **Please always make sure you are using an intended pre-built index.**
### <a name="map-sr"></a>Mapping Illumina paired-end reads:
```sh
minimap2 -ax sr ecoli_ref.fa ecoli_mason_1.fq ecoli_mason_2.fq > mapped-sr.sam
minimap2 -ax sr -t4 ecoli_ref.fa ecoli_mason_1.fq ecoli_mason_2.fq > mapped-sr.sam
```
### <a name="mapeval"></a>Evaluating mapping accuracy with simulated reads (for developers)
@ -78,6 +81,50 @@ paftools.js mason2fq tmp.sam | seqtk seq -1 > ecoli_mason_2.fq
## <a name="genome-aln"></a>Full-Genome Alignment
### <a name="asm-to-ref"></a>Intra-species assembly alignment
```sh
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa > ecoli_canu.paf
```
Here `ecoli_canu.fa` is the Canu assembly of `ecoli_p6_25x_canu.fa`. This
command line outputs alignments in the [PAF format][paf]. Use `-a` instead of
`-c` to get output in the SAM format.
### <a name="asm-var"></a>Calling variants from assembly-to-reference alignment
```sh
# don't forget the "--cs" option; otherwise it doesn't work
minimap2 -cx asm5 --cs ecoli_ref.fa ecoli_canu.fa \
| sort -k6,6 -k8,8n \
| paftools.js call - > out.txt
```
The first few lines in `out.txt` are
```
R NC_000913 0 257899
V NC_000913 668 669 1 60 g - tig00000001 3439623 3439623 +
V NC_000913 896 897 1 60 g - tig00000001 3439850 3439850 +
V NC_000913 1304 1305 1 60 g - tig00000001 3440257 3440257 +
```
where an `R`-line gives the regions coverged by one contig; a `V`-line consists of:
1. Reference sequence name
2. Reference start
3. Reference end (a 0-based, half-close-half-open interval)
4. Contig coverage (please ignore a `V`-line if this number is not one!)
5. Mean mapping quality of contigs covering the site
6. Reference base ("-" for an insertion)
7. Contig base ("-" for a deletion)
8. Contig name
9. Contig start
10. Contig end
11. Contig strand
`paftools.js call` does not support VCF output. This feature is planned but has
not been implemented yet.
## <a name="read-overlap"></a>Read Overlap
### <a name="long-read-overlap"></a>Long read overlap
@ -86,6 +133,8 @@ paftools.js mason2fq tmp.sam | seqtk seq -1 > ecoli_mason_2.fq
minimap2 -x ava-pb ecoli_p6_25x_canu.fa ecoli_p6_25x_canu.fa > overlap.paf
# For Nanopore reads (ava-ont also works with PacBio but not as good):
minimap2 -x ava-ont -r 10000 ecoli_p6_25x_canu.fa ecoli_p6_25x_canu.fa > overlap.paf
# If you have miniasm installed:
miniasm -f ecoli_p6_25x_canu.fa overlap.paf > asm.gfa
```
Here we explicitly applied `-r 10000`. We are considering to set this as the
default for the `ava-ont` mode as this seems to improve the contiguity for
@ -120,3 +169,4 @@ minimap2 -ax splice -k14 -uf ref.fa reads.fa > aln.sam
[pbsim]: https://github.com/pfaucon/PBSIM-PacBio-Simulator
[mason2]: https://github.com/seqan/seqan/tree/master/apps/mason2
[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md