FastDup/ext/htslib/README.large_positions.md

235 lines
7.6 KiB
Markdown

# HTSlib 64 bit reference positions
HTSlib version 1.10 onwards internally use 64 bit reference positions. This
is to support analysis of species like axolotl, tulip and marbled lungfish
which have, or are expected to have, chromosomes longer than two gigabases.
# File format support
Currently 64 bit positions can only be stored in SAM and VCF format files.
Binary BAM, CRAM and BCF cannot be used due to limitations in the formats
themselves. As SAM and VCF are text formats, they have no limit on the
size of numeric values. Note that while 64 bit positions are supported by
default for SAM, for VCF they must be enabled explicitly at compile time
by editing Makefile and adding -DVCF_ALLOW_INT64=1 to CFLAGS.
# Compatibility issues to check
Various data structure members, function parameters, and return values have
been expanded from 32 to 64 bits. As a result, some changes may be needed to
code that uses the library, even if it does not support long references.
## Variadic functions taking format strings
The type of various structure members (e.g. `bam1_core_t::pos`) and return
values from some functions (e.g. `bam_cigar2rlen()`) have been changed to
`hts_pos_t`, which is a 64-bit signed integer. Using these in 32-bit
code will generally work (as long as the stored positions are within range),
however care needs to be taken when these values are passed directly
to functions like `printf()` which take a variable-length argument list and
a format string.
Header file `htslib/hts.h` defines macro `PRIhts_pos` which can be
used in `printf()` format strings to get the correct format specifier for
an `hts_pos_t` value. Code that needs to print positions should be
changed from:
```c
printf("Position is %d\n", bam->core.pos);
```
to:
```c
printf("Position is %"PRIhts_pos"\n", bam->core.pos);
```
If for some reason compatibility with older versions of HTSlib (which do
not have `hts_pos_t` or `PRIhts_pos`) is needed, the value can be cast to
`int64_t` and printed as an explicitly 64-bit value:
```c
#include <inttypes.h> // For PRId64 and int64_t
printf("Position is %" PRId64 "\n", (int64_t) bam->core.pos);
```
Passing incorrect types to variadic functions like `printf()` can lead
to incorrect behaviour and security risks, so it important to track down
and fix all of the places where this may happen. Modern C compilers like
gcc (version 3.0 onwards) and clang can check `printf()` and `scanf()`
parameter types for compatibility against the format string. To
enable this, build code with `-Wall` or `-Wformat` and fix all the
reported warnings.
Where functions that take `printf`-style format strings are implemented,
they should use the appropriate gcc attributes to enable format string
checking. `htslib/hts_defs.h` includes macros `HTS_FORMAT` and
`HTS_PRINTF_FMT` which can be used to provide the attribute declaration
in a portable way. For example, `test/sam.c` uses them for a function
that prints error messages:
```
void HTS_FORMAT(HTS_PRINTF_FMT, 1, 2) fail(const char *fmt, ...) { /* ... */ }
```
## Implicit type conversions
Conversion of signed `int` or `int32_t` to `hts_pos_t` will always work.
Conversion of `hts_pos_t` to `int` or `int32_t` will work as long as the value
converted is within the range that can be stored in the destination.
Code that casts unsigned `uint32_t` values to signed with the expectation
that the result may be negative will no longer work as `hts_pos_t` can store
values over UINT32_MAX. Such code should be changed to use signed values.
Functions hts_parse_region() and hts_parse_reg64() return special value
`HTS_POS_MAX` for regions which extend to the end of the reference.
This value is slightly smaller than INT64_MAX, but should be larger than
any reference that is likely to be used. When cast to `int32_t` the
result should be `INT32_MAX`.
# Upgrading code to work with 64 bit positions
Variables used to store reference positions should be changed to
type `hts_pos_t`. Use `PRIhts_pos` in format strings when printing them.
When converting positions stored in strings, use `strtoll()` in place of
`atoi()` or `strtol()` (which produces a 32 bit value on 64-bit Windows and
all 32-bit platforms).
Programs which need to look up a reference sequence length from a `sam_hdr_t`
structure should use `sam_hdr_tid2len()` instead of the old
`sam_hdr_t::target_len` array (which is left as 32-bit for reasons of
compatibility). `sam_hdr_tid2len()` returns `hts_pos_t`, so works correctly
for large references.
Various functions which take pointer arguments have new versions which
support `hts_pos_t *` arguments. Code supporting 64-bit positions should
use the new versions. These are:
Original function | 64-bit version
------------------ | --------------------
fai_fetch() | fai_fetch64()
fai_fetchqual() | fai_fetchqual64()
faidx_fetch_seq() | faidx_fetch_seq64()
faidx_fetch_qual() | faidx_fetch_qual64()
hts_parse_reg() | hts_parse_reg64() or hts_parse_region()
bam_plp_auto() | bam_plp64_auto()
bam_plp_next() | bam_plp64_next()
bam_mplp_auto() | bam_mplp64_auto()
Limited support has been added for 64-bit INFO values in VCF files, for large
values in structural variant END tags. New functions `bcf_update_info_int64()`
and `bcf_get_info_int64()` can be used to set and fetch 64-bit INFO values.
They both take arrays of `int64_t`. `bcf_int64_missing` and
`bcf_int64_vector_end` can be used to set missing and vector end values in
these arrays. The INFO data is stored in the minimum size needed, so there
is no harm in using these functions to store smaller integer values.
# Structure members that have changed size
```
File htslib/hts.h:
hts_pair32_t::begin
hts_pair32_t::end
(typedef hts_pair_pos_t is provided as a better-named replacement for hts_pair32_t)
hts_reglist_t::min_beg
hts_reglist_t::max_end
hts_itr_t::beg
hts_itr_t::end
hts_itr_t::curr_beg
hts_itr_t::curr_end
File htslib/regidx.h:
reg_t::start
reg_t::end
File htslib/sam.h:
bam1_core_t::pos
bam1_core_t::mpos
bam1_core_t::isize
File htslib/synced_bcf_reader.h:
bcf_sr_regions_t::start
bcf_sr_regions_t::end
bcf_sr_regions_t::prev_start
File htslib/vcf.h:
bcf_idinfo_t::info
bcf_info_t::v1::i
bcf1_t::pos
bcf1_t::rlen
```
# Functions where parameters or the return value have changed size
Functions are annotated as follows:
* `[new]` The function has been added since version 1.9
* `[parameters]` Function parameters have changed size
* `[return]` Function return value has changed size
```
File htslib/faidx.h:
[new] fai_fetch64()
[new] fai_fetchqual64()
[new] faidx_fetch_seq64()
[new] faidx_fetch_qual64()
[new] fai_parse_region()
File htslib/hts.h:
[parameters] hts_idx_push()
[new] hts_parse_reg64()
[parameters] hts_itr_query()
[parameters] hts_reg2bin()
File htslib/kstring.h:
[new] kputll()
File htslib/regidx.h:
[parameters] regidx_overlap()
File htslib/sam.h:
[new] sam_hdr_tid2len()
[return] bam_cigar2qlen()
[return] bam_cigar2rlen()
[return] bam_endpos()
[parameters] bam_itr_queryi()
[parameters] sam_itr_queryi()
[new] bam_plp64_next()
[new] bam_plp64_auto()
[new] bam_mplp64_auto()
[parameters] sam_cap_mapq()
[parameters] sam_prob_realn()
File htslib/synced_bcf_reader.h:
[parameters] bcf_sr_seek()
[parameters] bcf_sr_regions_overlap()
File htslib/tbx.h:
[parameters] tbx_readrec()
File htslib/vcf.h:
[parameters] bcf_readrec()
[new] bcf_update_info_int64()
[new] bcf_get_info_int64()
[return] bcf_dec_int1()
[return] bcf_dec_typed_int1()
```