1741 lines
73 KiB
Markdown
1741 lines
73 KiB
Markdown
# HTS API
|
|
|
|
## HTSLib APIs and samtools
|
|
|
|
HTSLib is a C library implementation used to access and process the genome
|
|
sequence data. HTSLib implements multiple API interfaces, HTS API, VCF API and
|
|
SAM API. HTS API provides a framework for use by other APIs and applications,
|
|
implements bgzf compression, htscodecs and provides CRAM format support. VCF
|
|
APIs work with variant data in VCF and BCF format.
|
|
|
|
SAM API works with sequence data of different formats, SAM / BAM / CRAM /
|
|
FASTA / FASTQ, and provides methods to do operations on the data. It uses
|
|
methods from HTS API.
|
|
|
|
'samtools' is the utility used to read and modify sequence data. It uses SAM
|
|
APIs from HTSLib to work on the sequence data.
|
|
|
|
|
|
## About this document
|
|
|
|
There are a number of demonstration utilities and their source code in
|
|
'samples' directory of HTSLib and this document gives the description of them
|
|
and the usage of API of HTSLib. The samples are for demonstration
|
|
purposes only and proper error handling is required for actual usage. This
|
|
document is based on HTSLib version 1.17.
|
|
|
|
Updates to this document may be made along with later releases when required.
|
|
|
|
|
|
## The sample apps
|
|
|
|
Flags - This application showcases the basic read of alignment files and flag
|
|
access. It reads and shows the count of read1 and read2 alignments.
|
|
|
|
Split - This application showcases the basic read and write of alignment data.
|
|
It saves the read1 and read2 as separate files in given directory, one as sam
|
|
and other as bam.
|
|
|
|
Split2 - This application showcases the output file format selection. It saves
|
|
the read1 and read2 as separate files in given directory, both as compressed
|
|
sam though the extensions are different.
|
|
|
|
Cram - This application showcases the different way in which cram reference
|
|
data is used for cram output creation.
|
|
|
|
Read_fast - This application showcases the fasta/fastq data read.
|
|
|
|
Read_header - This application showcases the read and access of header data.
|
|
It can show all header line of given type, data of a given tag on a specific
|
|
header line or for all lines of given type.
|
|
|
|
Read_ref - This application showcases the read and access of header data.
|
|
It shows all reference names which has length equal or greater to given input.
|
|
|
|
Read_bam - This application showcases read of different alignment data fields.
|
|
It shows contents of each alignment.
|
|
|
|
Read_aux - This application showcases read of specific auxiliary tag data in
|
|
alignment. It shows the data retrieved using 2 APIs, one as a string with tag
|
|
data and other as raw data alternatively.
|
|
|
|
Dump_aux - This application showcases read of all auxiliary tag data one by one
|
|
in an alignment. It shows the data retrieved.
|
|
|
|
Add_header - This application showcases the write of header lines to a file.
|
|
It adds header line of types, SQ, RG, PG and CO and writes to standard output.
|
|
|
|
Remove_header - This application showcases removal of header line from a file.
|
|
It removes either all header lines of given type or one specific line of given
|
|
type with given unique identifier. Modified header is written on standard
|
|
output.
|
|
|
|
Update_header - This application shows the update of header line fields, where
|
|
update is allowed. It takes the header line type, unique identifier for the
|
|
line, tag to be modified and the new value. Updated data is written on standard
|
|
output.
|
|
|
|
Mod_bam - This application showcases the update of alignment data. It takes
|
|
alignment name, position of field to be modified and new value of it.
|
|
Modified data is written on standard output.
|
|
|
|
Mod_aux - This application showcases the update of auxiliary data in alignment.
|
|
It takes alignment name, tag to be modified, its type and new value. Modified
|
|
data is written on standard output.
|
|
|
|
Mod_aux_ba - This application showcases the update of auxiliary array data in
|
|
alignment. It adds count of ATCGN base as an array in auxiliary data, BA:I.
|
|
Modified data is written on standard output.
|
|
|
|
Write_fast - This application showcases the fasta/fastq data write. It appends
|
|
data to given file.
|
|
|
|
Index_write - This application showcases the creation of index along with
|
|
output creation. Based on file type and shift, it creates bai, csi or crai
|
|
files.
|
|
|
|
Index_fast - This application showcases the index creation on fasta/fastq
|
|
reference files.
|
|
|
|
Read_reg - This application showcases the usage of region specification in
|
|
alignment read.
|
|
|
|
Read_multireg - This application showcases the usage of multiple region
|
|
specification in alignment read.
|
|
|
|
Read_fast_index - This application showcases the fasta/fastq data read using
|
|
index.
|
|
|
|
Pileup - This application showcases the pileup api, where all alignments
|
|
covering a reference position are accessed together. It displays the bases
|
|
covering each position on standard output.
|
|
|
|
Mpileup - This application showcases the mpileup api, which supports multiple
|
|
input files for pileup and gives a side by side view of them in pileup format.
|
|
It displays the bases covering each position on standard output.
|
|
|
|
Modstate - This application showcases the access of base modifications in
|
|
alignment. It shows the modifications present in an alignment and accesses them
|
|
using available APIs. There are 2 APIs and which one to be used can be selected
|
|
through input.
|
|
|
|
Pileup_mod - This application showcases the base modification access in pileup
|
|
mode. It shows the pileup display with base modifications.
|
|
|
|
Flags_field - This application showcases the read of selected fields alone,
|
|
reducing the overhead / increasing the performance. It reads the flag field
|
|
alone and shows the count of read1 and read2. This has impact only on CRAM
|
|
files.
|
|
|
|
Split_thread1 - This application showcases the use of threads in file handling.
|
|
It saves the read1 and read2 as separate files in given directory, one as sam
|
|
and other as bam. 2 threads are used for read and 1 each dedicated for each
|
|
output file.
|
|
|
|
Split_thread2 - This application showcases the use of thread pool in file
|
|
handling. It saves the read1 and read2 as separate files in given directory,
|
|
one as sam and other as bam. A pool of 4 threads is created and shared for both
|
|
read and write.
|
|
|
|
Qtask_ordered - This application showcases the use of queues and threads for
|
|
custom processing. Alignments in input file are updated with their GC ratio
|
|
on a custom aux tag. The processing may occur in any order but the result is
|
|
retrieved in same order as it was queued and saved to disk.
|
|
|
|
Qtask_unordered - This application showcases the use of queues and threads
|
|
for custom processing. The count of bases and GC ratio are calculated and
|
|
displayed. The order of counting is irrelevant and hence ordered retrieval is
|
|
not used.
|
|
|
|
## Building the sample apps
|
|
|
|
The samples expect the HTSLib is installed, libraries and header file path are
|
|
part of the PATH environment variable. If not, these paths need to be explicitly
|
|
passed during the build time.
|
|
|
|
Gcc and compatible compilers can be used to build the samples.
|
|
|
|
These applications can be linked statically or dynamically to HTSLib.
|
|
For static linking, along with htslib other libraries and/or headers required
|
|
to build are, math, pthread, curl, lzma, z and bz2 libraries.
|
|
|
|
A makefile is available along with source files which links statically to
|
|
htslib. To use dynamic linking, update the makefile's 'LDFLAGS' and 'rpath'
|
|
path. The 'rpath' path to be set as the path to lib directory of htslib
|
|
installation.
|
|
|
|
|
|
## Usage of HTS APIs
|
|
### Sequence data file access for read
|
|
|
|
The sequence data file for read may be opened using the sam_open method. It
|
|
opens the file and returns samFile (htsFile) pointer on success or NULL on
|
|
failure. The input can be path to a file in disk, network, cloud or '-'
|
|
designating the standard input.
|
|
|
|
SAM, BAM and CRAM file formats are supported and the input file format is
|
|
detected from the file content.
|
|
|
|
Once done with the file, it needs to be closed with sam_close.
|
|
|
|
Many times, header details would be required and can be read using
|
|
sam_hdr_read api. It returns sam_hdr_t pointer or NULL. The returned header
|
|
needs to be destroyed using sam_hdr_destroy when no longer required.
|
|
|
|
The sequence data may be compressed or uncompressed on disk and on memory it
|
|
is read and kept as uncompressed BAM format. It can be read from a file using
|
|
sam_read1 api. samFile pointer, header and bam storage are to be passed as
|
|
argument and it returns 0 on success, -1 on end of file and < -1 in case of
|
|
errors.
|
|
|
|
The bam storage has to be initialized using bam_init1 api before the call and
|
|
can be reused for successive reads. Once done, it needs to be destroyed using
|
|
bam_destroy1. The member field named core - bam1_core_t - in bam storage,
|
|
bam1_t, has the sequence data in an easily accessible way. Using the fields
|
|
and macros, data can easily be read from it.
|
|
|
|
#include <htslib/sam.h>
|
|
|
|
int main(int argc, char *argv[])
|
|
{
|
|
...
|
|
//initialize
|
|
if (!(bamdata = bam_init1()))
|
|
... // error
|
|
//open input files - r reading
|
|
if (!(infile = sam_open(inname, "r")))
|
|
... // error
|
|
//read header
|
|
if (!(in_samhdr = sam_hdr_read(infile)))
|
|
... // error
|
|
|
|
//read data, check flags and update count
|
|
while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
|
|
if (bamdata->core.flag & BAM_FREAD1)
|
|
cntread1++;
|
|
...
|
|
|
|
//clean up
|
|
if (in_samhdr)
|
|
sam_hdr_destroy(in_samhdr);
|
|
|
|
if (infile)
|
|
sam_close(infile);
|
|
|
|
if (bamdata)
|
|
bam_destroy1(bamdata);
|
|
|
|
return ret;
|
|
}
|
|
Refer: flags_demo.c
|
|
|
|
This shows the count of read1 and read2 alignments.
|
|
|
|
./flags /tmp/sample.sam.gz
|
|
|
|
To read CRAM files, reference data is required and if it is not available, based
|
|
on configuration, library may try to download it from external repositories.
|
|
|
|
|
|
### Sequence data file access for write
|
|
|
|
File access for write is similar to read with a few additional optional steps.
|
|
|
|
The output file can be opened using sam_open api as in read, with "w" instead
|
|
of "r" as mode. This opens the file for writing and uses mode to select the
|
|
output file type. "w" alone denotes SAM, "wb" denotes BAM and "wc" denotes CRAM.
|
|
|
|
Another way is to use sam_open_mode method, which sets the output file type and
|
|
compression based on the file name and explicit textual format specification.
|
|
This method expects a buffer to append type and compression flags. Usually a
|
|
buffer with standard file open flag is used, the buffer past the flag is passed
|
|
to the method to ensure existing flags and updates from this method are present
|
|
in the same buffer without being overwritten. This method will add more flags
|
|
indicating file type and compression based on name. If explicit format detail
|
|
given, then extension is ignored and the explicit specification is used. This
|
|
updated buffer can be used with sam_open to select the file format.
|
|
|
|
sam_open_format method may also be used to open the file for output as more
|
|
information on the output file can be specified using this. Can use
|
|
mode buffer from sam_open_mode api or explicit format structure for this.
|
|
|
|
The header data can be written using the sam_hdr_write api. When the header
|
|
data is copied to another variable and has different lifetime, it is good to
|
|
increase the reference count of the header using sam_hdr_incr_ref and
|
|
sam_hdr_destroy called as many times as required.
|
|
|
|
The alignment data can be written using the sam_write1 api. It takes a samFile
|
|
pointer, header pointer and the alignment data. The header data is required to
|
|
set the reference name in the alignment. It returns -ve value on error.
|
|
|
|
int main(int argc, char *argv[])
|
|
{
|
|
...
|
|
if (!(infile = sam_open(inname, "r")))
|
|
... // error
|
|
outfile1 = sam_open(file1, "w"); //as SAM
|
|
outfile2 = sam_open(file2, "wb"); //as BAM
|
|
...
|
|
if (!(in_samhdr = sam_hdr_read(infile)))
|
|
... // error
|
|
|
|
//write header
|
|
if ((sam_hdr_write(outfile1, in_samhdr) == -1) ||
|
|
(sam_hdr_write(outfile2, in_samhdr) == -1))
|
|
... // error
|
|
|
|
while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
|
|
if (bamdata->core.flag & BAM_FREAD1) {
|
|
if (sam_write1(outfile1, in_samhdr, bamdata) < 0) {
|
|
... // error
|
|
}
|
|
Refer: split.c
|
|
|
|
This creates 1.sam and 2.bam in /tmp/ containing read1 and read2 respectively.
|
|
|
|
./split /tmp/sample.sam.gz /tmp/
|
|
|
|
Below code excerpt shows sam_open_mode api usage.
|
|
|
|
int main(int argc, char *argv[])
|
|
{
|
|
...
|
|
//set file open mode based on file name for 1st and as explicit for 2nd
|
|
if ((sam_open_mode(mode1+1, file1, NULL) == -1) ||
|
|
(sam_open_mode(mode2+1, file2, "sam.gz") == -1))
|
|
... // error
|
|
if (!(infile = sam_open(inname, "r")))
|
|
... // error
|
|
|
|
//open output files
|
|
outfile1 = sam_open(file1, mode1); //as compressed SAM through sam_open
|
|
outfile2 = sam_open_format(file2, mode2, NULL); //as compressed SAM through sam_open_format
|
|
...
|
|
}
|
|
Refer: split2.c
|
|
|
|
This creates 1.sam.gz and 2.sam in /tmp/ both having compressed data.
|
|
|
|
./split2 /tmp/sample.sam.gz /tmp/
|
|
|
|
An htsFormat structure filled appropriately can also be used to specify output
|
|
file format while using sam_open_format api.
|
|
|
|
|
|
### CRAM writing
|
|
|
|
CRAM files uses reference data and compresses alignment data. A CRAM file may
|
|
be created with external reference data file - most appropriate, with embedded
|
|
reference in it or with no reference data at all. It can also be created using
|
|
an autogenerated reference, based on consensus with-in the alignment data.
|
|
The reference detail can be set to an htsFormat structure using hts_parse_format
|
|
api and used with sam_open_format api to create appropriate CRAM file.
|
|
|
|
...
|
|
snprintf(reffmt1, size1, "cram,reference=%s", reffile);
|
|
snprintf(reffmt2, size2, "cram,embed_ref=1,reference=%s", reffile);
|
|
...
|
|
if (hts_parse_format(&fmt1, reffmt1) == -1 || //using external reference - uses the M5/UR tags to get
|
|
reference data during read
|
|
hts_parse_format(&fmt2, reffmt2) == -1 || //embed the reference internally
|
|
hts_parse_format(&fmt3, "cram,embed_ref=2") == -1 || //embed autogenerated reference
|
|
hts_parse_format(&fmt4, "cram,no_ref=1") == -1) { //no reference data encoding at all
|
|
... // error
|
|
outfile1 = sam_open_format(file1, "wc", &fmt1); outfile2 = sam_open_format(file2, "wc", &fmt2);
|
|
...
|
|
Refer: cram.c
|
|
|
|
|
|
### FASTA/FASTQ data access
|
|
|
|
FASTA/FASTQ files have the raw sequence data and the data can be read one by
|
|
one using sam_read1 or a selected range using a region. The data can be written
|
|
similar to alignment data using sam_write1 api. To write the file, format
|
|
can be set by updating mode buffer using sam_open_mode with file name
|
|
or explicit format text. This mode buffer can be used with sam_open or can be
|
|
used with sam_open_format with explicit format information in htsFormat
|
|
structure.
|
|
|
|
It is the FASTA format which is mainly in use to store the reference data.
|
|
|
|
...
|
|
if (!(bamdata = bam_init1()))
|
|
... // error
|
|
if (!(infile = sam_open(inname, "r")))
|
|
... // error
|
|
if (infile->format.format != fasta_format && infile->format.format != fastq_format)
|
|
... // error
|
|
if (!(in_samhdr = sam_hdr_read(infile)))
|
|
... // error
|
|
|
|
while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0)
|
|
... // error
|
|
printf("\nsequence: ");
|
|
for (c = 0; c < bamdata->core.l_qseq; ++c) {
|
|
printf("%c", seq_nt16_str[bam_seqi(bam_get_seq(bamdata), c)]);
|
|
}
|
|
if (infile->format.format == fastq_format) {
|
|
printf("\nquality: ");
|
|
for (c = 0; c < bamdata->core.l_qseq; ++c) {
|
|
printf("%c", bam_get_qual(bamdata)[c] + 33);
|
|
...
|
|
Refer: read_fast.c
|
|
|
|
...
|
|
char mode[4] = "a";
|
|
...
|
|
if (sam_open_mode(mode + 1, outname, NULL) < 0)
|
|
... // error
|
|
if (!(outfile = sam_open(outname, mode)))
|
|
... // error
|
|
if (bam_set1(bamdata, strlen(name), name, BAM_FUNMAP, -1, -1, 0, 0, NULL, -1, -1, 0, strlen(data), data, qual, 0) < 0)
|
|
... // error
|
|
if (sam_write1(outfile, out_samhdr, bamdata) < 0) {
|
|
printf("Failed to write data\n");
|
|
...
|
|
Refer: write_fast.c
|
|
|
|
|
|
### Header data read
|
|
|
|
The header gives the version, reference details, read group, change history
|
|
and comments. These data are stored inside the sam_hdr_t. Each of these
|
|
entries, except comments, have their unique identifier and it is required to
|
|
access different fields of them. The api sam_hdr_count_lines gives the count
|
|
of the specified type of header line. The value of a unique identifier to a
|
|
specific type of header line can be retrieved with sam_hdr_line_name api. The
|
|
api sam_hdr_find_tag_id and sam_hdr_find_tag_pos can get the field data from a
|
|
header line using unique identifier values or using position. The full header
|
|
line can be retrieved using sam_hdr_find_line_pos or sam_hdr_line_id with
|
|
position and unique identifier values respectively.
|
|
|
|
...
|
|
if (!(in_samhdr = sam_hdr_read(infile)))
|
|
... // error
|
|
...
|
|
if (tag)
|
|
ret = sam_hdr_find_tag_id(in_samhdr, header, id, idval, tag, &data);
|
|
else
|
|
ret = sam_hdr_find_line_id(in_samhdr, header, id, idval, &data);
|
|
...
|
|
linecnt = sam_hdr_count_lines(in_samhdr, header);
|
|
...
|
|
if (tag)
|
|
ret = sam_hdr_find_tag_pos(in_samhdr, header, c, tag, &data);
|
|
else
|
|
ret = sam_hdr_find_line_pos(in_samhdr, header, c, &data);
|
|
...
|
|
Refer: read_header.c
|
|
|
|
This will show the VN tag's value from HD header.
|
|
|
|
./read_header /tmp/sample.sam.gz HD VN
|
|
|
|
Shows the 2nd SQ line's LN field value.
|
|
|
|
./read_header /tmp/sample.sam.gz SQ SN T2 LN
|
|
|
|
Below code excerpt shows the reference names which has length above given value.
|
|
|
|
...
|
|
linecnt = sam_hdr_count_lines(in_samhdr, "SQ"); //get reference count
|
|
...
|
|
//iterate and check each reference's length
|
|
for (pos = 1, c = 0; c < linecnt; ++c) {
|
|
if ((ret = sam_hdr_find_tag_pos(in_samhdr, "SQ", c, "LN", &data) == -2))
|
|
... // error
|
|
|
|
size = atoll(data.s);
|
|
if (size < minsize) {
|
|
//not required
|
|
continue;
|
|
}
|
|
|
|
//sam_hdr_find_tag_pos(in_samhdr, "SQ", c, "SN", &data) can also do the same!
|
|
if (!(id = sam_hdr_line_name(in_samhdr, "SQ", c)))
|
|
... // error
|
|
|
|
printf("%d,%s,%s\n", pos, id, data.s);
|
|
...
|
|
Refer: read_refname.c
|
|
|
|
|
|
### Alignment data read
|
|
|
|
The alignment / sequence data contains many fields. Mainly the read/query
|
|
name, flags indicating the properties of the read, reference sequence name,
|
|
position in reference to which it matches, quality of the read, CIGAR string
|
|
indicating the match status, position of mate / reverse strand, name of
|
|
reference sequence to which mate matches, the insert length, base sequence,
|
|
quality value of each base and auxiliary fields.
|
|
|
|
Header data would be required to retrieve the reference names as alignment
|
|
contains the position of the reference in the header.
|
|
|
|
A few of the data are directly visible in bam1_t and the rest are hidden
|
|
inside data member of bam1_t and can easily be retrieved using macros.
|
|
bam_get_qname gives the name of the read, sam_hdr_tid2name gives the reference
|
|
name. bam_get_cigar retrieves the cigar operation array, which can be decoded
|
|
using bam_cigar_oplen to get count of bases to which that operation applicable
|
|
and bam_cigar_opchr to get the cigar operation. bam_seqi retrieves the base
|
|
data at a given position in alignment and it can be converted to character by
|
|
indexing the seq_nt16_str array.
|
|
|
|
...
|
|
while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0)
|
|
{
|
|
//QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL [TAG:TYPE:VALUE]
|
|
printf("NAME: %s\n", bam_get_qname(bamdata)); //get the query name using the macro
|
|
flags = bam_flag2str(bamdata->core.flag); //flags as string
|
|
...
|
|
tidname = sam_hdr_tid2name(in_samhdr, bamdata->core.tid);
|
|
...
|
|
printf("MQUAL: %d\n", bamdata->core.qual); //map quality value
|
|
cigar = bam_get_cigar(bamdata); //retrieves the cigar data
|
|
for (i = 0; i < bamdata->core.n_cigar; ++i) { //no. of cigar data entries
|
|
printf("%d%c", bam_cigar_oplen(cigar[i]), bam_cigar_opchr(cigar[i]));
|
|
//the macros gives the count of operation and the symbol of operation for given cigar entry
|
|
}
|
|
printf("\nTLEN/ISIZE: %"PRIhts_pos"\n", bamdata->core.isize);
|
|
data = bam_get_seq(bamdata);
|
|
//get the sequence data
|
|
if (bamdata->core.l_qseq != bam_cigar2qlen(bamdata->core.n_cigar, cigar)) { //checks the length with CIGAR and query
|
|
...
|
|
for (i = 0; i < bamdata->core.l_qseq ; ++i) { //sequence length
|
|
printf("%c", seq_nt16_str[bam_seqi(data, i)]); //retrieves the base from (internal compressed) sequence data
|
|
...
|
|
printf("%c", bam_get_qual(bamdata)[i]+33); //retrieves the quality value
|
|
...
|
|
Refer: read_bam.c
|
|
|
|
Shows the data from alignments.
|
|
|
|
./read_bam /tmp/sample.sam.gz
|
|
|
|
|
|
### Aux data read
|
|
|
|
Auxiliary data gives extra information about the alignment. There can be a
|
|
number of such data and can be accessed by specifying required tag or by
|
|
iterating one by one through them once the alignment is read as bam1_t. The
|
|
auxiliary data are stored along with the variable length data in the data
|
|
field of bam1_t. There are macros defined to retrieve information about
|
|
auxiliary data from the data field of bam1_t.
|
|
|
|
Data for a specific tag can be retrieved as a string or can be retrieved as raw
|
|
data. bam_aux_get_str retrieves as a string, with tag name, tag type and data.
|
|
bam_aux_get can get raw data and with bam_aux_type and bam_aux2A, bam_aux2f etc.
|
|
the raw data can be extracted.
|
|
|
|
To iterate through all data, the start of aux data is retrieved using macro
|
|
bam_aux_first and successive ones using bam_aux_next. Macro bam_aux_tag gives
|
|
the tag of the aux field and bam_aux_type gives the information about type of
|
|
the aux field.
|
|
|
|
Bam_aux2i, bam_aux2f, bam_aux2Z macros retrieve the aux data's value as
|
|
integer, float and string respectively. The integer value may be of different
|
|
precision / size and the bam_aux_type character indicates how to use the
|
|
value. The string/hex data are NULL terminated.
|
|
|
|
For array data, bam_aux_type will return 'B' and bam_auxB_len gives the length
|
|
of the array. bam_aux_type with the next byte will give the type of data in
|
|
the array. bam_auxB2i, bam_auxB2f will give integer and float data from a
|
|
given position of the array.
|
|
|
|
...
|
|
while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
|
|
//option 1 - get data as string with tag and type
|
|
if ((c = bam_aux_get_str(bamdata, tag, &sdata)) == 1) {
|
|
printf("%s\n",sdata.s);
|
|
...
|
|
//option 2 - get raw data
|
|
if ((data = bam_aux_get(bamdata, tag)) != NULL) {
|
|
printauxdata(stdout, bam_aux_type(data), -1, data);
|
|
...
|
|
Refer: read_aux.c
|
|
|
|
Shows the MD aux tag from alignments.
|
|
|
|
./read_aux ../../samtools/test/mpileup/mpileup.1.bam MD
|
|
|
|
...
|
|
while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
|
|
data = bam_aux_first(bamdata); //get the first aux data
|
|
while (data) {
|
|
printf("%.2s:%c:", bam_aux_tag(data), NULL != strchr("cCsSiI", bam_aux_type(data)) ? 'i' : bam_aux_type(data));
|
|
//macros gets the tag and type of aux data
|
|
//dump the data
|
|
printauxdata(stdout, bam_aux_type(data), -1, data);
|
|
...
|
|
data = bam_aux_next(bamdata, data); //get the next aux data
|
|
...
|
|
Refer: dump_aux.c
|
|
|
|
Shows all the tags from all alignments.
|
|
|
|
./dump_aux ../../samtools/test/mpileup/mpileup.1.bam
|
|
|
|
|
|
### Add/Remove/Update header
|
|
|
|
There are specific types of data that can be part of header data. They have
|
|
a tag from HD, SQ, RG, PG and CO. Fully formatted header lines, separated by new
|
|
line, can be added with sam_hdr_add_lines api. A single header line can be added
|
|
using sam_hdr_add_line api where the header type, tag and value pair are passed
|
|
as arguments, terminated with a NULL argument. The PG header lines are special
|
|
that they have a kind of linkage to previous PG lines. This linkage can be auto
|
|
generated by using sam_hdr_add_pg api which sets the 'PP' field used in linkage.
|
|
sam_hdr_write api does the write of the header data to file.
|
|
|
|
...
|
|
//add SQ line with SN as TR1 and TR2
|
|
if (sam_hdr_add_lines(in_samhdr, &sq[0], 0)) //length as 0 for NULL terminated data
|
|
... // error
|
|
|
|
//add RG line with ID as RG1
|
|
if (sam_hdr_add_line(in_samhdr, "RG", "ID", "RG1", "LB", "Test", "SM", "S1", NULL))
|
|
... // error
|
|
|
|
//add PG/CO lines
|
|
if (sam_hdr_add_pg(in_samhdr, "add_header", "VN", "Test", "CL", data.s, NULL)) //NULL is to indicate end of args
|
|
... // error
|
|
if (sam_hdr_add_line(in_samhdr, "CO", "Test data", NULL)) //NULL is to indicate end of args
|
|
... // error
|
|
|
|
//write output
|
|
if (sam_hdr_write(outfile, in_samhdr) < 0)
|
|
... // error
|
|
Refer: add_header.c
|
|
|
|
Not all type of header data can be removed but where it is possible, either a
|
|
specific header line can be removed or all of a header type can be removed. To
|
|
remove a specific line, header type, unique identifier field tag and its value
|
|
to be used. To remove all lines of a type, header type and unique identifier
|
|
field tag are to be used.
|
|
|
|
...
|
|
|
|
//remove specific line
|
|
if (sam_hdr_remove_line_id(in_samhdr, header, id, idval) < 0)
|
|
... // error
|
|
|
|
//remove multiple lines of a header type
|
|
if (sam_hdr_remove_lines(in_samhdr, header, id, NULL) < 0)
|
|
... // error
|
|
Refer: rem_header.c
|
|
|
|
Shows the file content after removing SQ line with SN 2.
|
|
./rem_header ../../samtools/test/mpileup/mpileup.1.bam SQ 2
|
|
|
|
The unique identifier for the line needs to be found to update a field, though
|
|
not all types in the header may be modifiable. The api sam_hdr_update_line
|
|
takes the unique identifier for the header line type, its value, the field
|
|
which needs to be modified and the new value with which to modify it, followed
|
|
by a NULL.
|
|
e.g. To change LN field from 2000 to 2250 in SQ line with unique identifier SN
|
|
as 'chr1', sam_hdr_update_line( header, "SQ", "SN", "chr1", "LN", "2250",
|
|
NULL). To change PP field from ABC to DEF in PG line with ID APP.10,
|
|
sam_hdr_update_line( header, "PG", "ID", "APP.10", "PP", "DEF", NULL).
|
|
|
|
...
|
|
//update with new data
|
|
if (sam_hdr_update_line(in_samhdr, header, id, idval, tag, val, NULL) < 0) {
|
|
printf("Failed to update data\n");
|
|
goto end;
|
|
}
|
|
...
|
|
Refer: update_header.c
|
|
|
|
Shows new sam file with 2nd SQ line having length as 38.
|
|
|
|
./update_header /tmp/sample.sam.gz SQ T1 LN 38
|
|
|
|
|
|
### Update alignment data
|
|
|
|
Many of the bam data fields may be updated by setting new value to appropriate
|
|
field in bam1_core_t structure and for a few, creating a new bam1_t record would
|
|
be easier than update of existing record.
|
|
|
|
...
|
|
while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0)
|
|
{
|
|
...
|
|
case 1:// QNAME
|
|
ret = bam_set_qname(bamdata, val);
|
|
break;
|
|
case 2:// FLAG
|
|
bamdata->core.flag = atol(val) & 0xFFFF;
|
|
break;
|
|
case 3:// RNAME
|
|
case 7:// RNEXT
|
|
if ((ret = sam_hdr_name2tid(in_samhdr, val)) < 0)
|
|
... // error
|
|
if (field == 3) {
|
|
//reference
|
|
bamdata->core.tid = ret;
|
|
} else {
|
|
//mate reference
|
|
bamdata->core.mtid = ret;
|
|
}
|
|
break;
|
|
case 4:// POS
|
|
bamdata->core.pos = atoll(val);
|
|
break;
|
|
case 5:// MAPQ
|
|
bamdata->core.qual = atoi(val) & 0x0FF;
|
|
break;
|
|
case 6:// CIGAR
|
|
{
|
|
...
|
|
//get cigar array and set all data in new bam record
|
|
if ((ncigar = sam_parse_cigar(val, NULL, &cigar, &size)) < 0)
|
|
... // error
|
|
if (bam_set1(newbam, bamdata->core.l_qname, bam_get_qname(bamdata), bamdata->core.flag, bamdata->core.tid,
|
|
bamdata->core.pos, bamdata->core.qual, ncigar, cigar, bamdata->core.mtid, bamdata->core.mpos,
|
|
bamdata->core.isize, bamdata->core.l_qseq, (const char*)bam_get_seq(bamdata),
|
|
(const char*)bam_get_qual(bamdata), bam_get_l_aux(bamdata)) < 0)
|
|
... // error
|
|
|
|
//correct sequence data as input is expected in ascii format and not as compressed inside bam!
|
|
memcpy(bam_get_seq(newbam), bam_get_seq(bamdata), (bamdata->core.l_qseq + 1) / 2);
|
|
//copy the aux data
|
|
memcpy(bam_get_aux(newbam), bam_get_aux(bamdata), bam_get_l_aux(bamdata));
|
|
...
|
|
break;
|
|
case 8:// PNEXT
|
|
bamdata->core.mpos = atoll(val);
|
|
break;
|
|
case 9:// TLEN
|
|
bamdata->core.isize = atoll(val);
|
|
break;
|
|
case 10:// SEQ
|
|
...
|
|
for( c = 0; c < i; ++c) {
|
|
bam_set_seqi(bam_get_seq(bamdata), c, seq_nt16_table[(unsigned char)val[c]]);
|
|
}
|
|
break;
|
|
case 11:// QUAL
|
|
...
|
|
for (c = 0; c < i; ++c)
|
|
val[c] -= 33; //phred score from ascii value
|
|
memcpy(bam_get_qual(bamdata), val, i);
|
|
Refer: mod_bam.c
|
|
|
|
Shows data with RNAME modified to T2.
|
|
|
|
./mod_bam /tmp/sample.sam ITR1 3 T2
|
|
|
|
The auxiliary data in bam1_t structure can be modified using
|
|
bam_aux_update_float, bam_aux_update_int etc. apis. If the aux field is not
|
|
present at all, it can be appended using bam_aux_append.
|
|
|
|
...
|
|
//matched to qname, update aux
|
|
if (!(data = bam_aux_get(bamdata, tag))) {
|
|
//tag not present append
|
|
... // cut: computed length and val based on tag type
|
|
if (bam_aux_append(bamdata, tag, type, length, (const uint8_t*)val))
|
|
... // error
|
|
} else {
|
|
//update the tag with newer value
|
|
char auxtype = bam_aux_type(data);
|
|
switch (type) {
|
|
case 'f':
|
|
case 'd':
|
|
...
|
|
if (bam_aux_update_float(bamdata, tag, atof(val)))
|
|
... // error
|
|
case 'C':
|
|
case 'S':
|
|
case 'I':
|
|
...
|
|
if (bam_aux_update_int(bamdata, tag, atoll(val)))
|
|
... // error
|
|
case 'Z':
|
|
...
|
|
if (bam_aux_update_str(bamdata, tag, length, val))
|
|
... // error
|
|
case 'A':
|
|
...
|
|
//update the char data directly on buffer
|
|
*(data+1) = val[0];
|
|
Refer: mod_aux.c
|
|
|
|
Shows the given record's MD tag set to Test.
|
|
|
|
./mod_aux samtools/test/mpileup/mpileup.1.bam ERR013140.6157908 MD Z Test
|
|
|
|
The array aux fields can be updated using bam_aux_update_array api.
|
|
|
|
...
|
|
if (bam_aux_update_array(bamdata, "BA", 'I', sizeof(cnt)/sizeof(cnt[0]), cnt))
|
|
... // error
|
|
Refer: mod_aux_ba.c
|
|
|
|
Shows the records updated with an array of integers, containing count of ACGT
|
|
and N in that order. The bases are decoded before count for the sake of
|
|
simplicity. Refer qtask_ordered.c for a better counting where decoding is made
|
|
outside the loop.
|
|
|
|
./mod_aux_ba samtools/test/mpileup/mpileup.1.bam
|
|
|
|
|
|
### Create an index
|
|
|
|
Indexes help to read data faster without iterating sequentially through the
|
|
file. Indexes contain the position information about alignments and that they
|
|
can be read easily. There are different type of indices, BAI, CSI, CRAI, TBI,
|
|
FAI etc. and are usually used with iterators.
|
|
|
|
Indexing of plain/textual files are not supported, compressed SAM&FASTA/Q, BAM,
|
|
and CRAM files can be indexed. CRAM files are indexed as .crai and the others
|
|
as .bai, .csi, .fai etc. Each of these types have different internal
|
|
representations of the index information. Bai uses a fixed configuration values
|
|
where as csi has them dynamically updated based on the alignment data.
|
|
|
|
Indexes can be created either with save of alignment data or explicitly by
|
|
read of existing alignment file for alignment data (SAM/BAM/CRAM). For reference
|
|
data it has to be explicitly created (FASTA).
|
|
|
|
To create index along with alignment write, the sam_idx_init api need to be
|
|
invoked before the start of alignment data write. This api takes the output
|
|
samFile pointer, header pointer, minimum shift and index file path. For BAI
|
|
index, the min shift has to be 0.
|
|
|
|
At the end of write, sam_idx_save api need to be invoked to save the index.
|
|
|
|
...
|
|
//write header
|
|
if (sam_hdr_write(outfile, in_samhdr))
|
|
... // error
|
|
// initialize indexing, before start of write
|
|
if (sam_idx_init(outfile, in_samhdr, size, fileidx))
|
|
... // error
|
|
if (sam_write1(outfile, in_samhdr, bamdata) < 0)
|
|
... // error
|
|
if (sam_idx_save(outfile))
|
|
... // error
|
|
Refer:index_write.c
|
|
|
|
Creates mpileup.1.bam and mpileup.1.bam.bai in /tmp/.
|
|
|
|
./idx_on_write ../../samtools/test/mpileup/mpileup.1.bam 0 /tmp/
|
|
|
|
To create index explicitly on an existing alignment data file, the
|
|
sam_index_build api or its alike can be used. sam_index_build takes the
|
|
alignment file path, min shift for the index and creates the index file in
|
|
same path. The output name will be based on the alignment file format and min
|
|
shift passed.
|
|
|
|
The sam_index_build2 api takes the index file path as well and gives more
|
|
control than the previous one. The sam_index_build3 api provides an option to
|
|
configure the number of threads in index creation.
|
|
|
|
Index for reference data can be created using fai_build3 api. This creates
|
|
index file with .fai extension. If the file is bgzip-ped, a .gzi file is
|
|
created as well. It takes the path to input file and that of fai and gzi files.
|
|
When fai/gzi path are NULL, they are created along with input file.
|
|
These index files will be useful for reference data access.
|
|
|
|
...
|
|
if (fai_build3(filename, NULL, NULL) == -1)
|
|
... // error
|
|
Refer: index_fast.c
|
|
|
|
A tabix index can be created for compressed vcf/sam/bed and other data using
|
|
tbx_index_build. It is mainly used with vcf and non-sam type files.
|
|
|
|
|
|
### Read with iterators
|
|
|
|
Index file helps to read required data without sequentially accessing the file
|
|
and are required to use iterators. The interested reference, start and end
|
|
position etc. are required to read data with iterators. With index and these
|
|
information, an iterator is created and relevant alignments can be accessed by
|
|
iterating it.
|
|
|
|
The api sam_index_load and the like does the index loading. It takes input
|
|
samFile pointer and file path. It loads the index file based on the input file
|
|
name, from the same path and with implicit index file extension - cram file
|
|
with .crai and others with .bai. The sam_index_load2 api accepts explicit path
|
|
to index file, which allows loading it from a different location and explicit
|
|
extensions. The sam_index_load3 api supports download/save of the index
|
|
locally from a remote location. These apis returns NULL on failure and index
|
|
pointer on success.
|
|
|
|
The index file path can be appended to alignment file path and used as well.
|
|
In this case the paths are expected to be separated by '##idx##'.
|
|
|
|
The sam_iter_queryi or sam_iter_querys apis may be used to create an iterator
|
|
and sam_itr_next api does the alignment data retrieval. Along with retrieval
|
|
of current data, it advances the iterator to next relevant data. The
|
|
sam_iter_queryi takes the interested positions as numeric values and
|
|
sam_iter_querys takes the interested position as a string.
|
|
|
|
With sam_iter_queryi, the reference id can be the 0 based index of reference
|
|
data, -2 for unmapped alignments, -3 to start read from beginning of file, -4
|
|
to continue from current position, -5 to return nothing. Based on the
|
|
reference id given, alignment covering the given start and end positions will
|
|
be read with sam_iter_next api.
|
|
|
|
With sam_iter_querys, the reference sequence is identified with the name and
|
|
interested positions can be described with start and end separated by '-' as
|
|
string. When sequence is identified as '.', it begins from the start of file
|
|
and when it is '*', unmapped alignments are read. Reference with <name>[:],
|
|
<name>:S, <name>:S-E, <name>:-E retrieves all data, all data covering position
|
|
S onwards, all data covering position S to E, all data covering upto position
|
|
E of reference with ID <name> respectively on read using sam_iter_next.
|
|
|
|
The index and iterator created are to be destroyed once the need is over.
|
|
sam_itr_destroy and hts_idx_destroy apis does this.
|
|
|
|
...
|
|
//load index file
|
|
if (!(idx = sam_index_load2(infile, inname, idxfile)))
|
|
... // error
|
|
//create iterator
|
|
if (!(iter = sam_itr_querys(idx, in_samhdr, region)))
|
|
... // error
|
|
|
|
//read using iterator
|
|
while ((c = sam_itr_next(infile, iter, bamdata)) >= 0)
|
|
... // error
|
|
|
|
if (iter)
|
|
sam_itr_destroy(iter);
|
|
if (idx)
|
|
hts_idx_destroy(idx);
|
|
...
|
|
Refer:index_reg_read.c
|
|
|
|
With sample.sam, region as \* will show alignments with name UNMAP2 and UNMAP3
|
|
|
|
./read_reg /tmp/sample.sam.gz \*
|
|
|
|
With region as \., it shows all alignments
|
|
|
|
./read_reg /tmp/sample.sam.gz \.
|
|
|
|
With region as T1:1-4, start 1 and end 4 it shows nothing and with T1:1-5 it
|
|
shows alignment with name ITR1.
|
|
|
|
./read_reg /tmp/sample.sam.gz T1:1-5
|
|
|
|
With region as T2:30-100, it shows alignment with name ITR2M which refers the
|
|
reference data T2.
|
|
|
|
./read_reg /tmp/sample.sam.gz T2:30-100
|
|
|
|
|
|
Multiple interested regions can be specified for read using sam_itr_regarray.
|
|
It takes index path, header, count of regions and region descriptions as array
|
|
of char array / string. This array passed need to be released by the user
|
|
itself.
|
|
|
|
...
|
|
//load index file, assume it to be present in same location
|
|
if (!(idx = sam_index_load(infile, inname)))
|
|
... // error
|
|
//create iterator
|
|
if (!(iter = sam_itr_regarray(idx, in_samhdr, regions, regcnt)))
|
|
... // error
|
|
if (regions) {
|
|
//can be freed as it is no longer required
|
|
free(regions);
|
|
regions = NULL;
|
|
}
|
|
|
|
//get required area
|
|
while ((c = sam_itr_multi_next(infile, iter, bamdata) >= 0))
|
|
... // process bamdata
|
|
Refer:index_multireg_read.c
|
|
|
|
With compressed sample.sam and 2 regions from reference T1 (30 to 32) and 1
|
|
region from T2 (34 onwards), alignments with name A1, B1, A2 and ITR2M would
|
|
be shown.
|
|
|
|
./read_multireg /tmp/sample.sam.gz 2 T1:30-32,T2:34
|
|
|
|
To use numeric indices instead of textual regions, sam_itr_regions can be used.
|
|
It takes index file path, header, count of regions and an array of region
|
|
description (hts_reglist_t*), which has the start end positions as numerals.
|
|
|
|
The index and iterators are to be destroyed using the sam_itr_destroy and
|
|
hts_idx_destroy. The hts_reglist_t* array passed is destroyed by the library
|
|
on iterator destroy. The regions array (array of char array/string) needs to be
|
|
destroyed by the user itself.
|
|
|
|
For fasta/fastq files, the index has to be loaded using fai_load3_format which
|
|
takes the file, index file names and format. With single region specification
|
|
fai_fetch64 can be used to get bases, and fai_fetchqual64 for quality in case
|
|
of fastq data. With multiple region specification, with comma separation,
|
|
faidx_fetch_seq64 and faidx_fetch_qual64 does the job. Regions has to be parsed
|
|
using fai_parse_region in case of multiregion specifications. fai_adjust_region
|
|
is used to adjust the start-end points based on available data.
|
|
|
|
Below excerpt shows fasta/q access with single and multiregions,
|
|
|
|
...
|
|
//load index
|
|
if (!(idx = fai_load3_format(inname, NULL, NULL, FAI_CREATE, fmt)))
|
|
... // error
|
|
|
|
...
|
|
if (!usemulti) {
|
|
//get data from single given region
|
|
if (!(data = fai_fetch64(idx, region, &len)))
|
|
... // region not found
|
|
|
|
printf("Data: %"PRId64" %s\n", len, data);
|
|
free((void*)data);
|
|
//get quality for fastq type
|
|
if (fmt == FAI_FASTQ) {
|
|
if (!(data = fai_fetchqual64(idx, region, &len)))
|
|
... // region not found
|
|
...
|
|
|
|
} else { // usemulti
|
|
//parse, get each region and get data for each
|
|
while ((remaining = fai_parse_region(idx, region, &tid, &beg, &end, HTS_PARSE_LIST))) { //here expects regions as csv
|
|
//parsed the region, correct end points based on actual data
|
|
if (fai_adjust_region(idx, tid, &beg, &end) == -1)
|
|
... // error
|
|
//get data for given region
|
|
if (!(data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len)))
|
|
... // region not found
|
|
|
|
printf("Data: %"PRIhts_pos" %s\n", len, data);
|
|
free((void*)data);
|
|
data = NULL;
|
|
//get quality data for fastq
|
|
if (fmt == FAI_FASTQ) {
|
|
if (!(data = faidx_fetch_qual64(idx, faidx_iseq(idx, tid), beg, end, &len)))
|
|
... // error
|
|
printf("Qual: %"PRIhts_pos" %s\n", len, data);
|
|
free((void*)data);
|
|
...
|
|
region = remaining; //parse remaining region defs
|
|
|
|
...
|
|
if (idx) {
|
|
fai_destroy(idx);
|
|
...
|
|
Refer: read_fast_index.c
|
|
|
|
|
|
### Pileup and MPileup
|
|
|
|
Pileup shows the transposed view of the SAM alignment data, i.e. it shows the
|
|
reference positions and bases which cover that position through different reads
|
|
side by side. MPileup facilitates the piling up of multiple sam files against
|
|
each other and same reference at the same time.
|
|
|
|
Mpileup has replaced the pileup. The input expects the data to be sorted by
|
|
position.
|
|
|
|
Pileup needs to be initialized with bam_pileup_init method which takes pointer
|
|
to a method, which will be called by pileup to read data from required files,
|
|
and pointer to data which might be required for this read method to do the
|
|
read operation. It returns a pointer to the pileup iterator.
|
|
|
|
User can specify methods which need to be invoked during the load and unload
|
|
of an alignment, like constructor and destructor of objects.
|
|
Bam_plp_constructor and bam_plp_destructor methods does the setup of
|
|
these methods in the pileup iterator. During invocation of these methods, the
|
|
pointer to data passed in the initialization is passed as well. If user want
|
|
to do any custom status handling or actions during load or unload, it can be
|
|
done in these methods. Alignment specific data can be created and stored in
|
|
an argument passed to the constructor and the same will be accessible during
|
|
pileup status return. The same will be accessible during destructor as well
|
|
where any deallocation can be made.
|
|
|
|
User is expected to invoke bam_plp_auto api to get the pileup status. It
|
|
returns the pileup status or NULL on end. During this all alignments are read
|
|
one by one, using the method given in initialization for data read, until one
|
|
for a new reference is found or all alignment covering a position is read. On
|
|
such condition, the pileup status is returned and the same continuous on next
|
|
bam_plp_auto call. The pileup status returned is an array for all positions
|
|
for which the processing is completed. Along with the result, the reference
|
|
index, position in reference data and number of alignments which covers this
|
|
position are passed. User can iterate the result array and get bases from each
|
|
alignment which covers the given reference position. The alignment specific
|
|
custom data which were created in constructor function will also be available
|
|
in the result.
|
|
|
|
The bam_plp_auto api invokes the data read method to load an alignment and the
|
|
constructor method is invoked during the load. Once the end of alignment is
|
|
passed, it is removed from the processing and destructor method is invoked,
|
|
that user could do deallocations and custom actions as in load during this
|
|
time. The custom data passed during the initialization is passed to the
|
|
constructor and destructor methods during invocation.
|
|
|
|
Once the forward and reverse strands are identified, the better of the quality
|
|
is identified and used. Both reads are required for this and hence reads are
|
|
cached until its mate is read. The maximum number of reads that can be cached
|
|
is controlled by bam_plp_set_maxcnt. Reads covering a position are cached and
|
|
as soon as mate is found, quality is adjusted and is removed from cache. Reads
|
|
above the cache limit are discarded.
|
|
|
|
Once done, the pileup iterator to be discarded by sam_plp_destroy api.
|
|
|
|
...
|
|
if (!(plpiter = bam_plp_init(readdata, &conf)))
|
|
... // error
|
|
//set constructor destructor callbacks
|
|
bam_plp_constructor(plpiter, plpconstructor);
|
|
bam_plp_destructor(plpiter, plpdestructor);
|
|
|
|
while ((plp = bam_plp_auto(plpiter, &tid, &refpos, &n))) {
|
|
printf("%d\t%d\t", tid+1, refpos+1);
|
|
for (j = 0; j < n; ++j) {
|
|
//doesnt detect succeeding insertion and deletion together here, only insertion is identified
|
|
//deletion is detected in plp->is_del as and when pos reaches the position
|
|
//if detection ahead is required, use bam_plp_insertion here which gives deletion length along with insertion
|
|
if (plp[j].is_del || plp[j].is_refskip) {
|
|
printf("*");
|
|
continue;
|
|
}
|
|
//start and end are displayed in UPPER and rest on LOWER
|
|
printf("%c", plp[j].is_head ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
|
|
(plp[j].is_tail ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
|
|
tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)])));
|
|
if (plp[j].indel > 0) {
|
|
//insertions, anyway not start or end
|
|
printf("+%d", plp[j].indel);
|
|
for (k = 0; k < plp[j].indel; ++k) {
|
|
printf("%c", tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos + k + 1)]));
|
|
}
|
|
}
|
|
else if (plp[j].indel < 0) {
|
|
printf("%d", plp[j].indel);
|
|
for (k = 0; k < -plp[j].indel; ++k) {
|
|
printf("?");
|
|
}
|
|
...
|
|
if (plpiter)
|
|
bam_plp_destroy(plpiter);
|
|
...
|
|
Refer:pileup.c
|
|
|
|
The read method may use a simple read or it could be an advanced read using
|
|
indices, iterators and region specifications based on the need. The constructor
|
|
method may create any custom data and store it in the pointer passed to it. The
|
|
same need to be released by use on destructor method.
|
|
|
|
MPileup works same as the pileup and supports multiple inputs against the same
|
|
reference, giving side by side view of reference and alignments from different
|
|
inputs.
|
|
|
|
MPileup needs to be initialized with bam_mpileup_init method which takes
|
|
pointer to a method, which will be called by pileup to read data from required
|
|
files, and an array of pointer to data which might be required for this read
|
|
method to do the read operation. It returns a pointer to the mpileup iterator.
|
|
|
|
User can specify methods which need to be invoked during the load and unload
|
|
of an alignment, like constructor and destructor of objects.
|
|
bam_mplp_constructor and bam_mplp_destructor methods does the setup
|
|
of these methods in the pileup iterator. During invocation of these methods,
|
|
the pointer to data passed in the initialization is passed as well. If user
|
|
want to do any custom status handling or actions during load or unload, it can
|
|
be done on these methods. Alignment specific data can be created and
|
|
stored in the custom data pointer and the same will be accessible during
|
|
return of pileup status. The same will be accessible during destructor as well
|
|
where any deallocation can be made.
|
|
|
|
User is expected to invoke bam_mplp_auto api to get the pileup status. It
|
|
returns the pileup status. During this all alignments are read one by one,
|
|
using the method given in initialization for data read, until one for a new
|
|
reference is found or all alignment covering a position is read. On such
|
|
condition, the pileup status is returned and the same continuous on next
|
|
bam_mplp_auto call.
|
|
|
|
The pileup status is returned through a parameter in the method itself, is an
|
|
array for all inputs, each containing array for positions on which the
|
|
processing is completed. Along with the result, the reference index, position
|
|
in reference data and number of alignments which covers this position are
|
|
passed. User can iterate the result array and get bases from each alignment
|
|
which covers the given reference position. The alignment specific custom data
|
|
which were created in constructor function will also be available in the
|
|
result.
|
|
|
|
Once the forward and reverse strands are identified, the better of the quality
|
|
is identified and used. Both reads are required for this and hence reads are
|
|
cached until its mate is read. The maximum number of reads that can be cached
|
|
is controlled by bam_mplp_set_maxcnt. Reads covering a position are cached and
|
|
as soon as mate is found, quality is adjusted and is removed from cache. Reads
|
|
above the cache limit are discarded.
|
|
|
|
Once done, the pileup iterator to be discarded by sam_mplp_destroy api.
|
|
|
|
...
|
|
if (!(mplpiter = bam_mplp_init(argc - 1, readdata, (void**) conf)))
|
|
... // error
|
|
//set constructor destructor callbacks
|
|
bam_mplp_constructor(mplpiter, plpconstructor);
|
|
bam_mplp_destructor(mplpiter, plpdestructor);
|
|
|
|
while (bam_mplp64_auto(mplpiter, &tid, &refpos, depth, plp) > 0) {
|
|
printf("%d\t%"PRIhts_pos"\t", tid+1, refpos+1);
|
|
|
|
for (input = 0; input < argc - 1; ++input) {
|
|
for (dpt = 0; dpt < depth[input]; ++dpt) {
|
|
if (plp[input][dpt].is_del || plp[input][dpt].is_refskip) {
|
|
printf("*");
|
|
continue;
|
|
}
|
|
//start and end are displayed in UPPER and rest on LOWER
|
|
printf("%c", plp[input][dpt].is_head ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
|
|
plp[input][dpt].qpos)]) : (plp[input]->is_tail ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
|
|
plp[input][dpt].qpos)]) : tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
|
|
plp[input][dpt].qpos)])));
|
|
if (plp[input][dpt].indel > 0) {
|
|
//insertions, anyway not start or end
|
|
printf("+%d", plp[input][dpt].indel);
|
|
for (k = 0; k < plp[input][dpt].indel; ++k) {
|
|
printf("%c", tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[input][dpt].b),
|
|
plp[input][dpt].qpos + k + 1)]));
|
|
}
|
|
}
|
|
else if (plp[input][dpt].indel < 0) {
|
|
printf("%d", plp[input][dpt].indel);
|
|
for (k = 0; k < -plp[input][dpt].indel; ++k) {
|
|
printf("?");
|
|
...
|
|
if (mplpiter) {
|
|
bam_mplp_destroy(mplpiter);
|
|
}
|
|
...
|
|
if (plp) {
|
|
free(plp);
|
|
...
|
|
Refer:mpileup.c
|
|
|
|
This sample takes multiple sam files and shows the pileup of data side by side.
|
|
|
|
./mpileup /tmp/mp.bam /tmp/mp.sam
|
|
|
|
|
|
### Base modifications
|
|
|
|
The alignment data may contain base modification information as well. This
|
|
gives the base, modifications found, orientation in which it was found and the
|
|
quality for the modification. The base modification can be identified using
|
|
hts_parse_basemod api. It stores the modification details on hts_base_mod_state
|
|
and this has to be initialized using hts_base_mod_state_alloc api.
|
|
|
|
Once the modifications are identified, they can be accessed through different
|
|
ways. bam_mods_recorded api gives the modifications identified for an alignment.
|
|
Modifications can be queried for each base position iteratively using
|
|
bam_mods_at_next_pos api. Check the returned value with buffer size to see
|
|
whether the buffer is big enough to retrieve all modifications.
|
|
Instead of querying for each position, the next modified position can be
|
|
directly retrieved directly using bam_next_basemod api. An alignment can be
|
|
queried to have a specific modification using bam_mods_query_type api. At the
|
|
end of processing, the state need to be released using hts_base_mod_state_free
|
|
api.
|
|
|
|
...
|
|
if (!(ms = hts_base_mod_state_alloc()))
|
|
... // error
|
|
while ((ret_r = sam_read1(infile, in_samhdr, bamdata)) >= 0)
|
|
{
|
|
...
|
|
if (bam_parse_basemod(bamdata, ms))
|
|
... // error
|
|
bm = bam_mods_recorded(ms, &cnt);
|
|
for (k = 0; k < cnt; ++k) {
|
|
printf("%c", bm[k]);
|
|
}
|
|
printf("\n");
|
|
hts_base_mod mod[5] = {0}; //for ATCGN
|
|
if (opt) {
|
|
//option 1
|
|
for (; i < bamdata->core.l_qseq; ++i) {
|
|
if ((r = bam_mods_at_next_pos(bamdata, ms, mod, sizeof(mod)/sizeof(mod[0]))) <= -1) {
|
|
printf("Failed to get modifications\n");
|
|
goto end;
|
|
}
|
|
else if (r > (sizeof(mod) / sizeof(mod[0]))) {
|
|
printf("More modifications than this app can handle, update the app\n");
|
|
goto end;
|
|
}
|
|
else if (!r) {
|
|
//no modification at this pos
|
|
printf("%c", seq_nt16_str[bam_seqi(data, i)]);
|
|
}
|
|
//modifications
|
|
for (j = 0; j < r; ++j) {
|
|
printf("%c%c%c", mod[j].canonical_base, mod[j].strand ? '-' : '+', mod[j].modified_base);
|
|
...
|
|
else {
|
|
//option 2
|
|
while ((r = bam_next_basemod(bamdata, ms, mod, sizeof(mod)/sizeof(mod[0]), &pos)) >= 0) {
|
|
for (; i < bamdata->core.l_qseq && i < pos; ++i) {
|
|
printf("%c", seq_nt16_str[bam_seqi(data, i)]);
|
|
}
|
|
//modifications
|
|
for (j = 0; j < r; ++j) {
|
|
printf("%c%c%c", mod[j].canonical_base, mod[j].strand ? '-' : '+', mod[j].modified_base);
|
|
}
|
|
...
|
|
//check last alignment's base modification
|
|
int strand = 0, impl = 0;
|
|
char canonical = 0, modification[] = "mhfcgebaon"; //possible modifications
|
|
printf("\n\nLast alignment has \n");
|
|
for (k = 0; k < sizeof(modification) - 1; ++k) { //avoiding NUL termination
|
|
if (bam_mods_query_type(ms, modification[k], &strand, &impl, &canonical)) {
|
|
printf ("No modification of %c type\n", modification[k]);
|
|
}
|
|
else {
|
|
printf("%s strand has %c modified with %c, can %sassume unlisted as unmodified\n", strand ? "-/bottom/reverse" :
|
|
"+/top/forward", canonical, modification[k], impl?"" : "not " );
|
|
}
|
|
}
|
|
...
|
|
if (ms)
|
|
hts_base_mod_state_free(ms);
|
|
...
|
|
Refer:modstate.c
|
|
|
|
The modification can be accessed in pileup mode as well. bam_mods_at_qpos gives
|
|
the modification at given pileup position. Insertion and deletion to the given
|
|
position with possible modification can be retrieved using bam_plp_insertion_mod
|
|
api.
|
|
|
|
...
|
|
int plpconstructor(void *data, const bam1_t *b, bam_pileup_cd *cd) {
|
|
//when using cd, initialize and use as it will be reused after destructor
|
|
cd->p = hts_base_mod_state_alloc();
|
|
//parse the bam data and gather modification data from MM tags
|
|
return (-1 == bam_parse_basemod(b, (hts_base_mod_state*)cd->p)) ? 1 : 0;
|
|
}
|
|
|
|
int plpdestructor(void *data, const bam1_t *b, bam_pileup_cd *cd) {
|
|
if (cd->p) {
|
|
hts_base_mod_state_free((hts_base_mod_state *)cd->p);
|
|
cd->p = NULL;
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
int main(int argc, char *argv[])
|
|
{
|
|
...
|
|
if (!(plpiter = bam_plp_init(readdata, &conf))) {
|
|
... // error
|
|
//set constructor destructor callbacks
|
|
bam_plp_constructor(plpiter, plpconstructor);
|
|
bam_plp_destructor(plpiter, plpdestructor);
|
|
|
|
while ((plp = bam_plp_auto(plpiter, &tid, &refpos, &depth))) {
|
|
memset(&mods, 0, sizeof(mods));
|
|
printf("%d\t%d\t", tid+1, refpos+1);
|
|
|
|
for (j = 0; j < depth; ++j) {
|
|
dellen = 0;
|
|
if (plp[j].is_del || plp[j].is_refskip) {
|
|
printf("*");
|
|
continue;
|
|
}
|
|
/*invoke bam mods_mods_at_qpos before bam_plp_insertion_mod that the base modification
|
|
is retrieved before change in pileup pos thr' plp_insertion_mod call*/
|
|
if ((modlen = bam_mods_at_qpos(plp[j].b, plp[j].qpos, plp[j].cd.p, mods, NMODS)) == -1)
|
|
... // error
|
|
//use plp_insertion/_mod to get insertion and del at the same position
|
|
if ((inslen = bam_plp_insertion_mod(&plp[j], (hts_base_mod_state*)plp[j].cd.p, &insdata, &dellen)) == -1)
|
|
... // error
|
|
//start and end are displayed in UPPER and rest on LOWER, only 1st modification considered
|
|
//base and modification
|
|
printf("%c%c%c", plp[j].is_head ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
|
|
(plp[j].is_tail ? toupper(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)]) :
|
|
tolower(seq_nt16_str[bam_seqi(bam_get_seq(plp[j].b), plp[j].qpos)])),
|
|
modlen > 0 ? mods[0].strand ? '-' : '+' : '\0', modlen > 0 ? mods[0].modified_base : '\0');
|
|
//insertion and deletions
|
|
if (plp[j].indel > 0) {
|
|
//insertion
|
|
/*insertion data from plp_insertion_mod, note this shows the quality value as well
|
|
which is different from base and modification above;the lower case display is not attempted either*/
|
|
printf("+%d%s", plp[j].indel, insdata.s);
|
|
//handle deletion if any
|
|
if (dellen) {
|
|
printf("-%d", dellen);
|
|
for (k = 0; k < dellen; ++k) {
|
|
printf("?");
|
|
...
|
|
else if (plp[j].indel < 0) {
|
|
//deletion
|
|
printf("%d", plp[j].indel);
|
|
for (k = 0; k < -plp[j].indel; ++k) {
|
|
printf("?");
|
|
}
|
|
}
|
|
...
|
|
Refer:pileup_mod.c
|
|
|
|
|
|
### Read selected fields
|
|
|
|
At times the whole alignment data may not be of interest and it would be
|
|
better to read required fields alone from the alignment data. CRAM file format
|
|
supports such specific data read and HTSLib provides an option to use this.
|
|
This can improve the performance on read operation.
|
|
|
|
The hts_set_opt method does the selection of specified fields. There are flags
|
|
indicating specific fields, like SAM_FLAG, SAM_SEQ, SAM_QNAME, in alignment
|
|
data and a combination of flags for the required fields can be passed with
|
|
CRAM_OPT_REQUIRED_FIELDS to this api.
|
|
|
|
...
|
|
//select required field alone, this is useful for CRAM alone
|
|
if (hts_set_opt(infile, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG) < 0)
|
|
... // error
|
|
|
|
//read header
|
|
in_samhdr = sam_hdr_read(infile);
|
|
...
|
|
//read data, check flags and update count
|
|
while ((c = sam_read1(infile, in_samhdr, bamdata)) >= 0) {
|
|
if (bamdata->core.flag & BAM_FREAD1)
|
|
cntread1++;
|
|
...
|
|
Refer: flags_htsopt_field.c
|
|
|
|
|
|
### Thread-pool to read / write
|
|
|
|
The HTSLib api supports thread pooling for better performance. There are a few
|
|
ways in which this can be used. The pool can be made specific for a file or a
|
|
generic pool can be created and shared across multiple files. Thread pool can
|
|
also be used to execute user defined tasks. The tasks are to be added to queue,
|
|
threads in pool executes them and results can be queued back if required.
|
|
|
|
To have a thread pool specific for a file, hts_set_opt api can be used with the
|
|
file pointer, HTS_OPT_NTHREADS and the number of threads to be in the pool.
|
|
Thread pool is released on closure of file. To have a thread pool which can be
|
|
shared across different files, it needs to be initialized using hts_tpool_init
|
|
api, passing number of threads as an argument. This thread pool can be
|
|
associated with a file using hts_set_opt api. The file pointer,
|
|
HTS_OPT_THREAD_POOL and the thread pool address are to be passed as arguments to
|
|
the api. The thread pool has to be released with hts_tpool_destroy.
|
|
|
|
The samples are trivial ones to showcase the usage of api. The number of threads
|
|
to use for different tasks has to be identified based on complexity and
|
|
parallelism of the task.
|
|
|
|
Below excerpt shows file specific thread pool,
|
|
|
|
...
|
|
//create file specific threads
|
|
if (hts_set_opt(infile, HTS_OPT_NTHREADS, 1) < 0 || //1 thread specific for reading
|
|
hts_set_opt(outfile1, HTS_OPT_NTHREADS, 1) < 0 || //1 thread specific for sam write
|
|
hts_set_opt(outfile2, HTS_OPT_NTHREADS, 2) < 0) { //2 thread specific for bam write
|
|
printf("Failed to set thread options\n");
|
|
goto end;
|
|
}
|
|
Refer: split_thread1.c
|
|
|
|
Below excerpt shows a thread pool shared across files,
|
|
|
|
...
|
|
//create a pool of 4 threads
|
|
if (!(tpool.pool = hts_tpool_init(4)))
|
|
... // error
|
|
//share the pool with all the 3 files
|
|
if (hts_set_opt(infile, HTS_OPT_THREAD_POOL, &tpool) < 0 ||
|
|
hts_set_opt(outfile1, HTS_OPT_THREAD_POOL, &tpool) < 0 ||
|
|
hts_set_opt(outfile2, HTS_OPT_THREAD_POOL, &tpool) < 0) {
|
|
... // error
|
|
|
|
... // do something
|
|
|
|
//tidy up at end
|
|
if (tpool.pool)
|
|
hts_tpool_destroy(tpool.pool);
|
|
...
|
|
Refer: split_thread2.c
|
|
|
|
Note that it is important to analyze the task in hand to decide the number of
|
|
threads to be used. As an example, if the number of threads for reading is set
|
|
to 2 and bam write to 1, keeping total number of threads the same, the
|
|
performance may decrease as bam decoding is easier than encoding.
|
|
|
|
Custom task / user defined functions can be performed on data using thread pool
|
|
and for that, the task has to be scheduled to a queue. Thread pool associated
|
|
with the queue will perform the task. There can be multiple pools and queues.
|
|
The order of execution of threads are decided based on many factors and load on
|
|
each task may vary, so the completion of the tasks may not be in the order of
|
|
their queueing. The queues can be used in two different ways, one where the
|
|
result is enqueued to queue again to be read in same order as initial queueing,
|
|
second where the resuls are not enqueued and completed possibly in a different
|
|
order than initial queueing. Explicitly created threads can also be used along
|
|
with hts thread pool usage.
|
|
|
|
hts_tpool_process_init initializes the queue / process, associates a queue with
|
|
thread pool and reserves space for given number of tasks on queue. It takes a
|
|
parameter indicating whether the result need to be enqueued for retrieval or
|
|
not. If the result is enqueued, it is retrieved in the order of scheduling of
|
|
task. Another parameter sets the maximum number of slots for tasks in queue,
|
|
usually 2 times the number of threads are used. The input and output have their
|
|
own queues and they grow as required upto the max set. hts_tpool_dispatch api
|
|
enqueues the task to the queue. The api blocks when there is no space in queue.
|
|
This behavior can be controlled with hts_tpool_dispatch2 api. The queue can be
|
|
reset using hts_tpool_process_reset api where all tasks are discarded. The api
|
|
hts_tpool_dispatch3 supports configuring cleanup routines which are to be run
|
|
when reset occurs on the queue. hts_tpool_process_flush api can ensure that
|
|
all the piled up tasks are processed, a possible case when the queueing and
|
|
processing happen at different speeds. hts_tpool_process_shutdown api stops the
|
|
processing of queue.
|
|
|
|
There are a few apis which let the user to check the status of processing. The
|
|
api hts_tpool_process_empty shows whether all the tasks are completed or not.
|
|
The api hts_tpool_process_sz gives the number of tasks, at different states of
|
|
processing. The api hts_tpool_process_len gives the number of results in output
|
|
queue waiting to be collected.
|
|
|
|
The order of execution of tasks depends on the number of threads involved and
|
|
how the threads are scheduled by operating system. When the results are enqueued
|
|
back to queue, they are read in same order of enqueueing of task and in that
|
|
case the order of execution will not be noticed. When the results are not
|
|
enqueued the results are available right away and the order of execution may be
|
|
noticeable. Based on the nature of task and the need of order maintenance, users
|
|
can select either of the queueing.
|
|
|
|
Below excerpts shows the usage of queues and threads in both cases. In the 1st,
|
|
alignments are updated with an aux tag indicating GC ratio. The order of data
|
|
has to be maintained even after update, hence the result queueing is used to
|
|
ensure same order as initial. A number of alignments are bunched together and
|
|
reuse of allocated memory is made to make it perform better. A sentinel job is
|
|
used to identify the completion of all tasks at the result collection side.
|
|
...
|
|
void *thread_ordered_proc(void *args)
|
|
{
|
|
...
|
|
for ( i = 0; i < bamdata->count; ++i) {
|
|
...
|
|
for (pos = 0; pos < bamdata->bamarray[i]->core.l_qseq; ++pos)
|
|
count[bam_seqi(data,pos)]++;
|
|
...
|
|
gcratio = (count[2] /*C*/ + count[4] /*G*/) / (float) (count[1] /*A*/ + count[8] /*T*/ + count[2] + count[4]);
|
|
|
|
if (bam_aux_append(bamdata->bamarray[i], "xr", 'f', sizeof(gcratio), (const uint8_t*)&gcratio) < 0) {
|
|
|
|
...
|
|
void *threadfn_orderedwrite(void *args)
|
|
{
|
|
...
|
|
//get result and write; wait if no result is in queue - until shutdown of queue
|
|
while (tdata->result == 0 &&
|
|
(r = hts_tpool_next_result_wait(tdata->queue)) != NULL) {
|
|
bamdata = (data*) hts_tpool_result_data(r);
|
|
...
|
|
for (i = 0; i < bamdata->count; ++i) {
|
|
if (sam_write1(tdata->outfile, tdata->samhdr, bamdata->bamarray[i]) < 0) {
|
|
... // error
|
|
...
|
|
hts_tpool_delete_result(r, 0); //release the result memory
|
|
...
|
|
|
|
// Shut down the process queue. If we stopped early due to a write failure,
|
|
// this will signal to the other end that something has gone wrong.
|
|
hts_tpool_process_shutdown(tdata->queue);
|
|
|
|
...
|
|
int main(int argc, char *argv[])
|
|
{
|
|
...
|
|
if (!(pool = hts_tpool_init(cnt))) //thread pool
|
|
... // error
|
|
tpool.pool = pool; //to share the pool for file read and write as well
|
|
//queue to use with thread pool, for task and results
|
|
if (!(queue = hts_tpool_process_init(pool, cnt * 2, 0))) {
|
|
...
|
|
//share the thread pool with i/o files
|
|
if (hts_set_opt(infile, HTS_OPT_THREAD_POOL, &tpool) < 0 ||
|
|
hts_set_opt(outfile, HTS_OPT_THREAD_POOL, &tpool) < 0)
|
|
... // error
|
|
if (pthread_create(&thread, NULL, threadfn_orderedwrite, &twritedata))
|
|
... // error
|
|
while (c >= 0) {
|
|
if (!(bamdata = getbamstorage(chunk, &bamcache)))
|
|
... // error
|
|
for (cnt = 0; cnt < bamdata->maxsize; ++cnt) {
|
|
c = sam_read1(infile, in_samhdr, bamdata->bamarray[cnt]);
|
|
...
|
|
if (hts_tpool_dispatch3(pool, queue, thread_ordered_proc, bamdata,
|
|
cleanup_bamstorage, cleanup_bamstorage,
|
|
0) == -1)
|
|
... // error
|
|
...
|
|
if (queue) {
|
|
if (-1 == c) {
|
|
// EOF read, send a marker to tell the threadfn_orderedwrite()
|
|
// function to shut down.
|
|
if (hts_tpool_dispatch(pool, queue, thread_ordered_proc,
|
|
NULL) == -1) {
|
|
... // error
|
|
hts_tpool_process_shutdown(queue);
|
|
|
|
...
|
|
// Wait for threadfn_orderedwrite to finish.
|
|
if (started_thread) {
|
|
pthread_join(thread, NULL);
|
|
|
|
...
|
|
if (queue) {
|
|
// Once threadfn_orderedwrite has stopped, the queue can be
|
|
// cleaned up.
|
|
hts_tpool_process_destroy(queue);
|
|
}
|
|
...
|
|
Refer: qtask_ordered.c
|
|
|
|
In this 2nd, the bases are counted and GC ratio of whole file is calculated.
|
|
Order in which bases are counted is not relevant and no result queue required.
|
|
The queue is created as input only.
|
|
...
|
|
void *thread_unordered_proc(void *args)
|
|
{
|
|
...
|
|
for ( i = 0; i < bamdata->count; ++i) {
|
|
data = bam_get_seq(bamdata->bamarray[i]);
|
|
for (pos = 0; pos < bamdata->bamarray[i]->core.l_qseq; ++pos)
|
|
counts[bam_seqi(data, pos)]++;
|
|
|
|
...
|
|
//update result and add the memory block for reuse
|
|
pthread_mutex_lock(&bamdata->cache->lock);
|
|
for (i = 0; i < 16; i++) {
|
|
bamdata->bases->counts[i] += counts[i];
|
|
}
|
|
|
|
bamdata->next = bamdata->cache->list;
|
|
bamdata->cache->list = bamdata;
|
|
pthread_mutex_unlock(&bamdata->cache->lock);
|
|
|
|
...
|
|
int main(int argc, char *argv[])
|
|
{
|
|
...
|
|
if (!(queue = hts_tpool_process_init(pool, cnt * 2, 1)))
|
|
... // error
|
|
c = 0;
|
|
while (c >= 0) {
|
|
...
|
|
for (cnt = 0; cnt < bamdata->maxsize; ++cnt) {
|
|
c = sam_read1(infile, in_samhdr, bamdata->bamarray[cnt]);
|
|
|
|
...
|
|
if (c >= -1 ) {
|
|
...
|
|
if (hts_tpool_dispatch3(pool, queue, thread_unordered_proc, bamdata,
|
|
cleanup_bamstorage, cleanup_bamstorage,
|
|
0) == -1)
|
|
... // error
|
|
...
|
|
if (-1 == c) {
|
|
// EOF read, ensure all are processed, waits for all to finish
|
|
if (hts_tpool_process_flush(queue) == -1) {
|
|
fprintf(stderr, "Failed to flush queue\n");
|
|
} else { //all done
|
|
//refer seq_nt16_str to find position of required bases
|
|
fprintf(stdout, "GCratio: %f\nBase counts:\n",
|
|
(gccount.counts[2] /*C*/ + gccount.counts[4] /*G*/) / (float)
|
|
(gccount.counts[1] /*A*/ + gccount.counts[8] /*T*/ +
|
|
gccount.counts[2] + gccount.counts[4]));
|
|
...
|
|
if (queue) {
|
|
hts_tpool_process_destroy(queue);
|
|
}
|
|
Refer: qtask_unordered.c
|
|
|
|
## More Information
|
|
|
|
### CRAM reference files
|
|
|
|
The cram reference data is required for the read of sequence data in CRAM
|
|
format. The sequence data file may have it as embedded or as a reference to
|
|
the actual file. When it is a reference, it is downloaded locally, in the
|
|
cache directory for later usage. It will be stored in a directory structure
|
|
based on the MD5 checksum in the cache directory.
|
|
|
|
Each chromosome in a reference file gets saved as a separate file with md5sum
|
|
as its path and name. The initial 4 numerals make the directory name and rest
|
|
as the file name (<cache path>/<1st 2 of md5sum>/<2nd 2 of md5sum>/<rest of
|
|
md5sum>).
|
|
|
|
The download would be attempted from standard location, EBI ENA
|
|
(https://www.ebi.ac.uk/ena).
|
|
|
|
|
|
### Bam1_t
|
|
|
|
This structure holds the sequence data in BAM format. There are fixed and
|
|
variable size fields, basic and extended information on sequence
|
|
data. Variable size data and extended information are kept together in a
|
|
buffer, named data in bam1_t. Fields in the member named core, bam1_core_t,
|
|
and a few macros together support the storage and handling of the whole
|
|
sequence data.
|
|
|
|
- core has a link to reference as a 0 based index in field tid. The mate /
|
|
reverse strand's link to reference is given by mtid.
|
|
|
|
- Field pos and mpos gives the position in reference to which the sequence and
|
|
its mate / reverse strand match.
|
|
|
|
- Field flag gives the properties of the given alignment. It shows the
|
|
alignment's orientation, mate status, read order etc.
|
|
|
|
- Field qual gives the quality of the alignment read.
|
|
|
|
- l_qname gives the length of the name of the alignment / read, l_extranul gives
|
|
the extra space used internally in the data field.
|
|
|
|
- l_qseq gives the length of the alignment / read in the data field.
|
|
|
|
-- n_cigar gives the number of CIGAR operations for the given alignment.
|
|
|
|
- isize gives the insert size of the read / alignment.
|
|
|
|
The bases in sequence data are stored by compressing 2 bases together in a
|
|
byte. When the reverse flag is set, the base data is reversed and
|
|
complemented from the actual read (i.e. if the forward read is ACTG, the
|
|
reverse read to be CAGT; it will be stored in SAM format with reversed and
|
|
complemented format as ACTG with reverse flag set).
|
|
|
|
Macros bam_get_qname, bam_get_seq, bam_get_qual, bam_get_aux, bam_get_l_aux,
|
|
bam_seqi etc access the data field and retrieve the required data. The aux
|
|
macros support the retrieval of auxiliary data from the data field.
|
|
|
|
|
|
### Sam_hdr_t
|
|
|
|
This structure holds the header information. This holds the number of targets
|
|
/ SQ lines in the file, each one's length, name and reference count to this
|
|
structure. It also has this information in an internal data structure for
|
|
easier access of each field of this data.
|
|
|
|
When this data is shared or assigned to another variable of a different scope
|
|
or purpose, the reference count needs to be incremented to ensure that it is
|
|
valid till the end of the variable's scope. sam_hdr_incr_ref and it needs to
|
|
be destroyed as many times with sam_hdr_destroy api.
|
|
|
|
|
|
### Index
|
|
|
|
Indices need the data to be sorted by position. They can be of different
|
|
types with extension .bai, .csi or .tbi for compressed SAM/BAM/VCF files and
|
|
.crai for CRAM files. The index name can be passed along with the alignment
|
|
file itself by appending a specific character sequence. The apis can detect this
|
|
sequence and extract the index path. ##idx## is the sequence which separates
|
|
the file path and index path.
|
|
|
|
|
|
### Data files
|
|
|
|
The data files can be a local file, a network file, a file accessible through
|
|
the web or in cloud storage like google and amazon. The data files can be
|
|
represented with URIs like file://, file://localhost/.., ,ftp://..,
|
|
gs+http[s].., s3+http[s]://
|
|
|