FastBQSR/ext/htslib/cram/README

CRAM encoding internals
=======================

A quick summary of functions involved.

The encoder works by accumulating a bunch of BAM records (via the
cram_put_bam_seq function), and at a certain point (eg counter of
records, or switching reference) the array of BAM records it turned
into a container, which in turn creates slices, holding CRAM
data-series in blocks.  The function that turns an array of BAM
objects into the container is below.

cram_encode_container func:
    Validate references MD5 against header, unless no_ref mode
    If embed_ref <= 1, fetch ref
        Switch to embed_ref=2 if failed

    Foreach slice:
        If embed_ref == 2
	    call cram_generate_reference
	        if failed switch to no_ref mode
	Foreach sequence
	    call process_one_read to append BAM onto each data series (DS)
	        call cram_stats_add for each DS to gather metrics
		call cram_encode_aux

    # We now have cram DS, per slice
    call cram_encoder_init, per DS (based on cram_stats_add data)

    Foreach slice:
        call cram_encode_slice to turn DS to blocks
	    call cram_compess_slice

    call cram_encode_compression_header

Threading
---------

CRAM can be multi-threaded, but this brings complications.

The above function is the main CPU user, so it is this bit which can
be executed in parallel from multiple threads.  To understand this we
need to now look at how the primary loop works when writing a CRAM:

Encoding main thread:
    repeatedly calls cram_put_bam_seq
        calls cram_new_container on first time through to initialise
	calls cram_next_container when current is full or we need to flush
	    calls cram_flush_container_mt to flush last container
        pushes BAM object onto current container

If non-threaded, cram_flush_container_mt does:
    call cram_flush_container
        call cram_encode_container to go from BAM to CRAM data-series
 	call cram_flush_container2 (writes it out)

If threaded, cram_flush_container_mt does:
    Main: Dispatch cram_flush_thread job
        Thread: call cram_encode_container to go from BAM to CRAM data-series
    Main: Call cram_flush_result to drain queue of encoded containers
        Main: Call cram_flush_container2 (writes it out);


Decisions on when to create new containers, detection of sorted vs unsorted,
switching to multi-seq mode, etc occur at the main thread in
cram_put_bam_seq.

We can change our mind on container parameters at any point up until
the cram_encode_container call.  At that point these parameters get
baked into a container compression header and all data-series
generated need to be in sync with the parameters.

It is possible that some parameter changes can get detected while
encoding the container, as it is there where we fetch references.  Eg
the need to enable embedded reference or switch to non-ref mode.

While encoding a container, we can change the parameters for *this*
container, and we can also set the default parameter for subsequent
new parameters via the global cram fd to avoid spamming attempts to
load a reference which doesn't exist, but we cannot change other
containers that are being processed in parallel.  They'll fend for
themselves.

References
----------

To avoid spamming the reference servers, there is a shared cache of
references being currently used by all the worker threads (leading to
confusing terminology of reference-counting of references).  So each
container fetches its section of reference, but the memory for that is
handled via its own layer.

The shared references and ref meta-data is held in cram_fd -> refs (a
refs_t pointer):

    // References structure.
    struct refs_t {
        string_alloc_t *pool;  // String pool for holding filenames and SN vals
    
        khash_t(refs) *h_meta; // ref_entry*, index by name
        ref_entry **ref_id;    // ref_entry*, index by ID
        int nref;              // number of ref_entry
    
        char *fn;              // current file opened
        BGZF *fp;              // and the hFILE* to go with it.
    
        int count;             // how many cram_fd sharing this refs struct
    
        pthread_mutex_t lock;  // Mutex for multi-threaded updating
        ref_entry *last;       // Last queried sequence
        int last_id;           // Used in cram_ref_decr_locked to delay free
    };

Within this, ref_entry is the per-reference information:

    typedef struct ref_entry {
        char *name;
        char *fn;
        int64_t length;
        int64_t offset;
        int bases_per_line;
        int line_length;
        int64_t count;     // for shared references so we know to dealloc seq
        char *seq;
        mFILE *mf;
        int is_md5;        // Reference comes from a raw seq found by MD5
        int validated_md5;
    } ref_entry;

Sharing of references to track use between threads is via
cram_ref_incr* and cram_ref_decr* (which locked and unlocked
variants).  We free a reference when the usage count hits zero.  To
avoid spamming discard and reload in single-thread creation of a
pos-sorted CRAM, we keep track of the last reference in cram_fd and
delay discard by one loop iteration.

There are complexities here around whether the references come from a
single ref.fa file, are from a local MD5sum cache with one file per
reference (mmapped), or whether they're fetched from some remote
REF_PATH query such as the EBI.  (This later case typically downloads
to a local md5 based ref-cache first and mmaps from there.)

The refs struct start off by being populated from the SAM header.  We
have M5 tag and name known, maybe a filename, but length is 0 and seq
is NULL.  This is done by cram_load_reference:

cram_load_reference (cram_fd, filename):
    if filename non-NULL
        call refs_load_fai
	    Populates ref_entry with filename, name, length, line-len, etc
    	sanitise_SQ_lines
    If no refs loaded
        call refs_from_header
	    populates ref_entry with name.
	    Sets length=0 as marker for not-yet-loaded

The main interface used from the code is cram_get_ref().  It takes a
reference ID, start and end coordinate and returns a pointer to the
relevant sub-sequence.

cram_get_ref:
    r = fd->refs->ref_id[id];    // current ref
    call cram_populate_ref if stored length is 0 (ie ref.fa set)
        search REF_PATH / REF_CACHE
	call bgzf_open if local_path
	call open_path_mfile otherwise
	copy to local REF_CACHE if required (eg remote fetch)

    If start = 1 and end = ref-length
       If ref seq unknown
           call cram_ref_load to load entire ref and use that

    If ref seq now known, return it

    // Otherwise known via .fai or we've errored by now.
    call load_ref_portion to return a sub-seq from index fasta

The encoder asks for the entire reference rather than a small portion
of it as we're usually encoding a large amount.  The decoder may be
dealing with small range queries, so it only asks for the relevant
sub-section of reference as specified in the cram slice headers.


TODO
====

- Multi-ref mode is enabled when we have too many small containers in
  a row.

  Instead of firing off new containers when we switch reference, we
  could always make a new container after N records, separating off
  M <= N to make the container such that all M are the same reference,
  and shuffling any remaining N-M down as the start of the next.

  This means we can detect how many new containers we would create,
  and enable multi-ref mode straight away rather than keeping a recent
  history of how many small containers we've emitted.

- The cache of references currently being used is a better place to
  track the global embed-ref and non-ref logic.  Better than cram_fd.
  Cram_fd is a one-way change, as once we enable non-ref we'll stick
  with it.

  However if it was per-ref in the ref-cache then we'd probe and try
  each reference once, and then all new containers for that ref would
  honour the per-ref parameters.  So a single missing reference in the
  middle of a large file wouldn't change behaviour for all subsequence
  references.

  Optionally we could still do meta-analysis on how many references
  are failing, and switch the global cram_fd params to avoid repeated
  testing of reference availability if it's becoming obvious that none
  of them are known.
FastBQSR Init 2025-11-23 23:03:37 +08:00			`CRAM encoding internals`
			`=======================`

			`A quick summary of functions involved.`

			`The encoder works by accumulating a bunch of BAM records (via the`
			`cram_put_bam_seq function), and at a certain point (eg counter of`
			`records, or switching reference) the array of BAM records it turned`
			`into a container, which in turn creates slices, holding CRAM`
			`data-series in blocks. The function that turns an array of BAM`
			`objects into the container is below.`

			`cram_encode_container func:`
			`Validate references MD5 against header, unless no_ref mode`
			`If embed_ref <= 1, fetch ref`
			`Switch to embed_ref=2 if failed`

			`Foreach slice:`
			`If embed_ref == 2`
			`call cram_generate_reference`
			`if failed switch to no_ref mode`
			`Foreach sequence`
			`call process_one_read to append BAM onto each data series (DS)`
			`call cram_stats_add for each DS to gather metrics`
			`call cram_encode_aux`

			`# We now have cram DS, per slice`
			`call cram_encoder_init, per DS (based on cram_stats_add data)`

			`Foreach slice:`
			`call cram_encode_slice to turn DS to blocks`
			`call cram_compess_slice`

			`call cram_encode_compression_header`

			`Threading`
			`---------`

			`CRAM can be multi-threaded, but this brings complications.`

			`The above function is the main CPU user, so it is this bit which can`
			`be executed in parallel from multiple threads. To understand this we`
			`need to now look at how the primary loop works when writing a CRAM:`

			`Encoding main thread:`
			`repeatedly calls cram_put_bam_seq`
			`calls cram_new_container on first time through to initialise`
			`calls cram_next_container when current is full or we need to flush`
			`calls cram_flush_container_mt to flush last container`
			`pushes BAM object onto current container`

			`If non-threaded, cram_flush_container_mt does:`
			`call cram_flush_container`
			`call cram_encode_container to go from BAM to CRAM data-series`
			`call cram_flush_container2 (writes it out)`

			`If threaded, cram_flush_container_mt does:`
			`Main: Dispatch cram_flush_thread job`
			`Thread: call cram_encode_container to go from BAM to CRAM data-series`
			`Main: Call cram_flush_result to drain queue of encoded containers`
			`Main: Call cram_flush_container2 (writes it out);`



			`Decisions on when to create new containers, detection of sorted vs unsorted,`
			`switching to multi-seq mode, etc occur at the main thread in`
			`cram_put_bam_seq.`

			`We can change our mind on container parameters at any point up until`
			`the cram_encode_container call. At that point these parameters get`
			`baked into a container compression header and all data-series`
			`generated need to be in sync with the parameters.`

			`It is possible that some parameter changes can get detected while`
			`encoding the container, as it is there where we fetch references. Eg`
			`the need to enable embedded reference or switch to non-ref mode.`

			`While encoding a container, we can change the parameters for this`
			`container, and we can also set the default parameter for subsequent`
			`new parameters via the global cram fd to avoid spamming attempts to`
			`load a reference which doesn't exist, but we cannot change other`
			`containers that are being processed in parallel. They'll fend for`
			`themselves.`

			`References`
			`----------`

			`To avoid spamming the reference servers, there is a shared cache of`
			`references being currently used by all the worker threads (leading to`
			`confusing terminology of reference-counting of references). So each`
			`container fetches its section of reference, but the memory for that is`
			`handled via its own layer.`

			`The shared references and ref meta-data is held in cram_fd -> refs (a`
			`refs_t pointer):`

			`// References structure.`
			`struct refs_t {`
			`string_alloc_t *pool; // String pool for holding filenames and SN vals`

			`khash_t(refs) h_meta; // ref_entry, index by name`
			`ref_entry *ref_id; // ref_entry, index by ID`
			`int nref; // number of ref_entry`

			`char *fn; // current file opened`
			`BGZF fp; // and the hFILE to go with it.`

			`int count; // how many cram_fd sharing this refs struct`

			`pthread_mutex_t lock; // Mutex for multi-threaded updating`
			`ref_entry *last; // Last queried sequence`
			`int last_id; // Used in cram_ref_decr_locked to delay free`
			`};`

			`Within this, ref_entry is the per-reference information:`

			`typedef struct ref_entry {`
			`char *name;`
			`char *fn;`
			`int64_t length;`
			`int64_t offset;`
			`int bases_per_line;`
			`int line_length;`
			`int64_t count; // for shared references so we know to dealloc seq`
			`char *seq;`
			`mFILE *mf;`
			`int is_md5; // Reference comes from a raw seq found by MD5`
			`int validated_md5;`
			`} ref_entry;`

			`Sharing of references to track use between threads is via`
			`cram_ref_incr* and cram_ref_decr* (which locked and unlocked`
			`variants). We free a reference when the usage count hits zero. To`
			`avoid spamming discard and reload in single-thread creation of a`
			`pos-sorted CRAM, we keep track of the last reference in cram_fd and`
			`delay discard by one loop iteration.`

			`There are complexities here around whether the references come from a`
			`single ref.fa file, are from a local MD5sum cache with one file per`
			`reference (mmapped), or whether they're fetched from some remote`
			`REF_PATH query such as the EBI. (This later case typically downloads`
			`to a local md5 based ref-cache first and mmaps from there.)`

			`The refs struct start off by being populated from the SAM header. We`
			`have M5 tag and name known, maybe a filename, but length is 0 and seq`
			`is NULL. This is done by cram_load_reference:`

			`cram_load_reference (cram_fd, filename):`
			`if filename non-NULL`
			`call refs_load_fai`
			`Populates ref_entry with filename, name, length, line-len, etc`
			`sanitise_SQ_lines`
			`If no refs loaded`
			`call refs_from_header`
			`populates ref_entry with name.`
			`Sets length=0 as marker for not-yet-loaded`

			`The main interface used from the code is cram_get_ref(). It takes a`
			`reference ID, start and end coordinate and returns a pointer to the`
			`relevant sub-sequence.`

			`cram_get_ref:`
			`r = fd->refs->ref_id[id]; // current ref`
			`call cram_populate_ref if stored length is 0 (ie ref.fa set)`
			`search REF_PATH / REF_CACHE`
			`call bgzf_open if local_path`
			`call open_path_mfile otherwise`
			`copy to local REF_CACHE if required (eg remote fetch)`

			`If start = 1 and end = ref-length`
			`If ref seq unknown`
			`call cram_ref_load to load entire ref and use that`

			`If ref seq now known, return it`

			`// Otherwise known via .fai or we've errored by now.`
			`call load_ref_portion to return a sub-seq from index fasta`

			`The encoder asks for the entire reference rather than a small portion`
			`of it as we're usually encoding a large amount. The decoder may be`
			`dealing with small range queries, so it only asks for the relevant`
			`sub-section of reference as specified in the cram slice headers.`


			`TODO`
			`====`

			`- Multi-ref mode is enabled when we have too many small containers in`
			`a row.`

			`Instead of firing off new containers when we switch reference, we`
			`could always make a new container after N records, separating off`
			`M <= N to make the container such that all M are the same reference,`
			`and shuffling any remaining N-M down as the start of the next.`

			`This means we can detect how many new containers we would create,`
			`and enable multi-ref mode straight away rather than keeping a recent`
			`history of how many small containers we've emitted.`

			`- The cache of references currently being used is a better place to`
			`track the global embed-ref and non-ref logic. Better than cram_fd.`
			`Cram_fd is a one-way change, as once we enable non-ref we'll stick`
			`with it.`

			`However if it was per-ref in the ref-cache then we'd probe and try`
			`each reference once, and then all new containers for that ref would`
			`honour the per-ref parameters. So a single missing reference in the`
			`middle of a large file wouldn't change behaviour for all subsequence`
			`references.`

			`Optionally we could still do meta-analysis on how many references`
			`are failing, and switch the global cram_fd params to avoid repeated`
			`testing of reference availability if it's becoming obvious that none`
			`of them are known.`