diff --git a/README.md b/README.md index 8e4a9f0..44f97fe 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,7 @@ man ./minimap2.1 - [Map short accurate genomic reads](#short-genomic) - [Full genome/assembly alignment](#full-genome) - [Advanced features](#advanced) + - [Working CIGARs with >65535 operations in BAM](#long-cigar) - [The cs optional tag](#cs) - [Evaluation scripts](#eval) - [Algorithm overview](#algo) @@ -178,6 +179,26 @@ according to the sequence divergence. ### Advanced features +#### Working CIGARs with >65535 operations in BAM + +At present, BAM does not work with CIGAR strings with >65535 operations. +However, aligning ultra-long nanopore reads with minimap2 may align ~1% of read +bases with long CIGARs beyond the capability of BAM. If you convert such SAM to +BAM, recent samtools will throw an error and abort. Older samtools and other +tools may even silently create corrupted and unreadable BAMs. + +To avoid this issue, you can add option `-L` at the minimap2 command line. +This option moves a long CIGAR to the `CG` tag and leaves a fully clipped CIGAR +at the SAM CIGAR column. Current tools that don't read CIGAR (e.g. merging and +sorting) still work with such BAM records; tools that read CIGAR will +effectively ignore these records. I have pull requests to the SAM spec, htslib, +htsjdk, bedtools2, Rsamtools and igv.js. If they are accepted, future versions +of these tools will seamlessly recognize long-cigar records generated by option +`-L`. + +In summary, if you work with ultra-long reads and use tools that only process +BAM files, please add option `-L`. + #### The cs optional tag The `cs` SAM/PAF tag encodes bases at mismatches and INDELs. It matches regular