note on long cigar in README

This commit is contained in:
Heng Li 2017-10-21 22:28:06 -04:00
parent beeb806829
commit 37e627aa98
1 changed files with 21 additions and 0 deletions

View File

@ -34,6 +34,7 @@ man ./minimap2.1
- [Map short accurate genomic reads](#short-genomic)
- [Full genome/assembly alignment](#full-genome)
- [Advanced features](#advanced)
- [Working CIGARs with >65535 operations in BAM](#long-cigar)
- [The cs optional tag](#cs)
- [Evaluation scripts](#eval)
- [Algorithm overview](#algo)
@ -178,6 +179,26 @@ according to the sequence divergence.
### <a name="advanced"></a>Advanced features
#### <a name="long-cigar"></a>Working CIGARs with >65535 operations in BAM
At present, BAM does not work with CIGAR strings with >65535 operations.
However, aligning ultra-long nanopore reads with minimap2 may align ~1% of read
bases with long CIGARs beyond the capability of BAM. If you convert such SAM to
BAM, recent samtools will throw an error and abort. Older samtools and other
tools may even silently create corrupted and unreadable BAMs.
To avoid this issue, you can add option `-L` at the minimap2 command line.
This option moves a long CIGAR to the `CG` tag and leaves a fully clipped CIGAR
at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and
sorting) still work with such BAM records; tools that read CIGAR will
effectively ignore these records. I have pull requests to the SAM spec, htslib,
htsjdk, bedtools2, Rsamtools and igv.js. If they are accepted, future versions
of these tools will seamlessly recognize long-cigar records generated by option
`-L`.
In summary, if you work with ultra-long reads and use tools that only process
BAM files, please add option `-L`.
#### <a name="cs"></a>The cs optional tag
The `cs` SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular