215 lines
8.1 KiB
Plaintext
215 lines
8.1 KiB
Plaintext
|
|
CRAM encoding internals
|
||
|
|
=======================
|
||
|
|
|
||
|
|
A quick summary of functions involved.
|
||
|
|
|
||
|
|
The encoder works by accumulating a bunch of BAM records (via the
|
||
|
|
cram_put_bam_seq function), and at a certain point (eg counter of
|
||
|
|
records, or switching reference) the array of BAM records it turned
|
||
|
|
into a container, which in turn creates slices, holding CRAM
|
||
|
|
data-series in blocks. The function that turns an array of BAM
|
||
|
|
objects into the container is below.
|
||
|
|
|
||
|
|
cram_encode_container func:
|
||
|
|
Validate references MD5 against header, unless no_ref mode
|
||
|
|
If embed_ref <= 1, fetch ref
|
||
|
|
Switch to embed_ref=2 if failed
|
||
|
|
|
||
|
|
Foreach slice:
|
||
|
|
If embed_ref == 2
|
||
|
|
call cram_generate_reference
|
||
|
|
if failed switch to no_ref mode
|
||
|
|
Foreach sequence
|
||
|
|
call process_one_read to append BAM onto each data series (DS)
|
||
|
|
call cram_stats_add for each DS to gather metrics
|
||
|
|
call cram_encode_aux
|
||
|
|
|
||
|
|
# We now have cram DS, per slice
|
||
|
|
call cram_encoder_init, per DS (based on cram_stats_add data)
|
||
|
|
|
||
|
|
Foreach slice:
|
||
|
|
call cram_encode_slice to turn DS to blocks
|
||
|
|
call cram_compess_slice
|
||
|
|
|
||
|
|
call cram_encode_compression_header
|
||
|
|
|
||
|
|
Threading
|
||
|
|
---------
|
||
|
|
|
||
|
|
CRAM can be multi-threaded, but this brings complications.
|
||
|
|
|
||
|
|
The above function is the main CPU user, so it is this bit which can
|
||
|
|
be executed in parallel from multiple threads. To understand this we
|
||
|
|
need to now look at how the primary loop works when writing a CRAM:
|
||
|
|
|
||
|
|
Encoding main thread:
|
||
|
|
repeatedly calls cram_put_bam_seq
|
||
|
|
calls cram_new_container on first time through to initialise
|
||
|
|
calls cram_next_container when current is full or we need to flush
|
||
|
|
calls cram_flush_container_mt to flush last container
|
||
|
|
pushes BAM object onto current container
|
||
|
|
|
||
|
|
If non-threaded, cram_flush_container_mt does:
|
||
|
|
call cram_flush_container
|
||
|
|
call cram_encode_container to go from BAM to CRAM data-series
|
||
|
|
call cram_flush_container2 (writes it out)
|
||
|
|
|
||
|
|
If threaded, cram_flush_container_mt does:
|
||
|
|
Main: Dispatch cram_flush_thread job
|
||
|
|
Thread: call cram_encode_container to go from BAM to CRAM data-series
|
||
|
|
Main: Call cram_flush_result to drain queue of encoded containers
|
||
|
|
Main: Call cram_flush_container2 (writes it out);
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
Decisions on when to create new containers, detection of sorted vs unsorted,
|
||
|
|
switching to multi-seq mode, etc occur at the main thread in
|
||
|
|
cram_put_bam_seq.
|
||
|
|
|
||
|
|
We can change our mind on container parameters at any point up until
|
||
|
|
the cram_encode_container call. At that point these parameters get
|
||
|
|
baked into a container compression header and all data-series
|
||
|
|
generated need to be in sync with the parameters.
|
||
|
|
|
||
|
|
It is possible that some parameter changes can get detected while
|
||
|
|
encoding the container, as it is there where we fetch references. Eg
|
||
|
|
the need to enable embedded reference or switch to non-ref mode.
|
||
|
|
|
||
|
|
While encoding a container, we can change the parameters for *this*
|
||
|
|
container, and we can also set the default parameter for subsequent
|
||
|
|
new parameters via the global cram fd to avoid spamming attempts to
|
||
|
|
load a reference which doesn't exist, but we cannot change other
|
||
|
|
containers that are being processed in parallel. They'll fend for
|
||
|
|
themselves.
|
||
|
|
|
||
|
|
References
|
||
|
|
----------
|
||
|
|
|
||
|
|
To avoid spamming the reference servers, there is a shared cache of
|
||
|
|
references being currently used by all the worker threads (leading to
|
||
|
|
confusing terminology of reference-counting of references). So each
|
||
|
|
container fetches its section of reference, but the memory for that is
|
||
|
|
handled via its own layer.
|
||
|
|
|
||
|
|
The shared references and ref meta-data is held in cram_fd -> refs (a
|
||
|
|
refs_t pointer):
|
||
|
|
|
||
|
|
// References structure.
|
||
|
|
struct refs_t {
|
||
|
|
string_alloc_t *pool; // String pool for holding filenames and SN vals
|
||
|
|
|
||
|
|
khash_t(refs) *h_meta; // ref_entry*, index by name
|
||
|
|
ref_entry **ref_id; // ref_entry*, index by ID
|
||
|
|
int nref; // number of ref_entry
|
||
|
|
|
||
|
|
char *fn; // current file opened
|
||
|
|
BGZF *fp; // and the hFILE* to go with it.
|
||
|
|
|
||
|
|
int count; // how many cram_fd sharing this refs struct
|
||
|
|
|
||
|
|
pthread_mutex_t lock; // Mutex for multi-threaded updating
|
||
|
|
ref_entry *last; // Last queried sequence
|
||
|
|
int last_id; // Used in cram_ref_decr_locked to delay free
|
||
|
|
};
|
||
|
|
|
||
|
|
Within this, ref_entry is the per-reference information:
|
||
|
|
|
||
|
|
typedef struct ref_entry {
|
||
|
|
char *name;
|
||
|
|
char *fn;
|
||
|
|
int64_t length;
|
||
|
|
int64_t offset;
|
||
|
|
int bases_per_line;
|
||
|
|
int line_length;
|
||
|
|
int64_t count; // for shared references so we know to dealloc seq
|
||
|
|
char *seq;
|
||
|
|
mFILE *mf;
|
||
|
|
int is_md5; // Reference comes from a raw seq found by MD5
|
||
|
|
int validated_md5;
|
||
|
|
} ref_entry;
|
||
|
|
|
||
|
|
Sharing of references to track use between threads is via
|
||
|
|
cram_ref_incr* and cram_ref_decr* (which locked and unlocked
|
||
|
|
variants). We free a reference when the usage count hits zero. To
|
||
|
|
avoid spamming discard and reload in single-thread creation of a
|
||
|
|
pos-sorted CRAM, we keep track of the last reference in cram_fd and
|
||
|
|
delay discard by one loop iteration.
|
||
|
|
|
||
|
|
There are complexities here around whether the references come from a
|
||
|
|
single ref.fa file, are from a local MD5sum cache with one file per
|
||
|
|
reference (mmapped), or whether they're fetched from some remote
|
||
|
|
REF_PATH query such as the EBI. (This later case typically downloads
|
||
|
|
to a local md5 based ref-cache first and mmaps from there.)
|
||
|
|
|
||
|
|
The refs struct start off by being populated from the SAM header. We
|
||
|
|
have M5 tag and name known, maybe a filename, but length is 0 and seq
|
||
|
|
is NULL. This is done by cram_load_reference:
|
||
|
|
|
||
|
|
cram_load_reference (cram_fd, filename):
|
||
|
|
if filename non-NULL
|
||
|
|
call refs_load_fai
|
||
|
|
Populates ref_entry with filename, name, length, line-len, etc
|
||
|
|
sanitise_SQ_lines
|
||
|
|
If no refs loaded
|
||
|
|
call refs_from_header
|
||
|
|
populates ref_entry with name.
|
||
|
|
Sets length=0 as marker for not-yet-loaded
|
||
|
|
|
||
|
|
The main interface used from the code is cram_get_ref(). It takes a
|
||
|
|
reference ID, start and end coordinate and returns a pointer to the
|
||
|
|
relevant sub-sequence.
|
||
|
|
|
||
|
|
cram_get_ref:
|
||
|
|
r = fd->refs->ref_id[id]; // current ref
|
||
|
|
call cram_populate_ref if stored length is 0 (ie ref.fa set)
|
||
|
|
search REF_PATH / REF_CACHE
|
||
|
|
call bgzf_open if local_path
|
||
|
|
call open_path_mfile otherwise
|
||
|
|
copy to local REF_CACHE if required (eg remote fetch)
|
||
|
|
|
||
|
|
If start = 1 and end = ref-length
|
||
|
|
If ref seq unknown
|
||
|
|
call cram_ref_load to load entire ref and use that
|
||
|
|
|
||
|
|
If ref seq now known, return it
|
||
|
|
|
||
|
|
// Otherwise known via .fai or we've errored by now.
|
||
|
|
call load_ref_portion to return a sub-seq from index fasta
|
||
|
|
|
||
|
|
The encoder asks for the entire reference rather than a small portion
|
||
|
|
of it as we're usually encoding a large amount. The decoder may be
|
||
|
|
dealing with small range queries, so it only asks for the relevant
|
||
|
|
sub-section of reference as specified in the cram slice headers.
|
||
|
|
|
||
|
|
|
||
|
|
TODO
|
||
|
|
====
|
||
|
|
|
||
|
|
- Multi-ref mode is enabled when we have too many small containers in
|
||
|
|
a row.
|
||
|
|
|
||
|
|
Instead of firing off new containers when we switch reference, we
|
||
|
|
could always make a new container after N records, separating off
|
||
|
|
M <= N to make the container such that all M are the same reference,
|
||
|
|
and shuffling any remaining N-M down as the start of the next.
|
||
|
|
|
||
|
|
This means we can detect how many new containers we would create,
|
||
|
|
and enable multi-ref mode straight away rather than keeping a recent
|
||
|
|
history of how many small containers we've emitted.
|
||
|
|
|
||
|
|
- The cache of references currently being used is a better place to
|
||
|
|
track the global embed-ref and non-ref logic. Better than cram_fd.
|
||
|
|
Cram_fd is a one-way change, as once we enable non-ref we'll stick
|
||
|
|
with it.
|
||
|
|
|
||
|
|
However if it was per-ref in the ref-cache then we'd probe and try
|
||
|
|
each reference once, and then all new containers for that ref would
|
||
|
|
honour the per-ref parameters. So a single missing reference in the
|
||
|
|
middle of a large file wouldn't change behaviour for all subsequence
|
||
|
|
references.
|
||
|
|
|
||
|
|
Optionally we could still do meta-analysis on how many references
|
||
|
|
are failing, and switch the global cram_fd params to avoid repeated
|
||
|
|
testing of reference availability if it's becoming obvious that none
|
||
|
|
of them are known.
|