210 lines
7.0 KiB
Groff
210 lines
7.0 KiB
Groff
.TH tabix 1 "12 September 2024" "htslib-1.21" "Bioinformatics tools"
|
|
.SH NAME
|
|
.PP
|
|
tabix \- Generic indexer for TAB-delimited genome position files
|
|
.\"
|
|
.\" Copyright (C) 2009-2011 Broad Institute.
|
|
.\" Copyright (C) 2014, 2016, 2018, 2020, 2022, 2024 Genome Research Ltd.
|
|
.\"
|
|
.\" Author: Heng Li <lh3@sanger.ac.uk>
|
|
.\"
|
|
.\" Permission is hereby granted, free of charge, to any person obtaining a
|
|
.\" copy of this software and associated documentation files (the "Software"),
|
|
.\" to deal in the Software without restriction, including without limitation
|
|
.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
|
.\" and/or sell copies of the Software, and to permit persons to whom the
|
|
.\" Software is furnished to do so, subject to the following conditions:
|
|
.\"
|
|
.\" The above copyright notice and this permission notice shall be included in
|
|
.\" all copies or substantial portions of the Software.
|
|
.\"
|
|
.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
|
.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
|
.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
|
|
.\" DEALINGS IN THE SOFTWARE.
|
|
.\"
|
|
.SH SYNOPSIS
|
|
.PP
|
|
.B tabix
|
|
.RB [ -0lf ]
|
|
.RB [ -p
|
|
gff|bed|sam|vcf]
|
|
.RB [ -s
|
|
.IR seqCol ]
|
|
.RB [ -b
|
|
.IR begCol ]
|
|
.RB [ -e
|
|
.IR endCol ]
|
|
.RB [ -S
|
|
.IR lineSkip ]
|
|
.RB [ -c
|
|
.IR metaChar ]
|
|
.I in.tab.bgz
|
|
.RI [ "region1 " [ "region2 " [ ... "]]]"
|
|
|
|
.SH DESCRIPTION
|
|
.PP
|
|
Tabix indexes a TAB-delimited genome position file
|
|
.I in.tab.bgz
|
|
and creates an index file
|
|
.RI ( in.tab.bgz.tbi
|
|
or
|
|
.IR in.tab.bgz.csi )
|
|
when
|
|
.I region
|
|
is absent from the command-line. The input data file must be position
|
|
sorted and compressed by
|
|
.B bgzip
|
|
which has a
|
|
.BR gzip (1)
|
|
like interface.
|
|
|
|
After indexing, tabix is able to quickly retrieve data lines overlapping
|
|
.I regions
|
|
specified in the format "chr:beginPos-endPos".
|
|
(Coordinates specified in this region format are 1-based and inclusive.)
|
|
|
|
Fast data retrieval also
|
|
works over network if URI is given as a file name and in this case the
|
|
index file will be downloaded if it is not present locally.
|
|
|
|
The tabix
|
|
.RI ( .tbi )
|
|
and BAI index formats can handle individual chromosomes up to 512 Mbp
|
|
(2^29 bases) in length.
|
|
If your input file might contain data lines with begin or end positions
|
|
greater than that, you will need to use a CSI index.
|
|
|
|
Multiple threads can be used for operations except listing of sequence names.
|
|
|
|
.SH INDEXING OPTIONS
|
|
.TP 10
|
|
.B -0, --zero-based
|
|
Specify that the position in the data file is 0-based half-open
|
|
(e.g. UCSC files) rather than 1-based.
|
|
.TP
|
|
.BI "-b, --begin " INT
|
|
Column of start chromosomal position. [4]
|
|
.TP
|
|
.BI "-c, --comment " CHAR
|
|
Skip lines started with character CHAR. [#]
|
|
.TP
|
|
.BI "-C, --csi"
|
|
Produce CSI format index instead of classical tabix or BAI style indices.
|
|
.TP
|
|
.BI "-e, --end " INT
|
|
Column of end chromosomal position. The end column can be the same as the
|
|
start column. [5]
|
|
.TP
|
|
.B "-f, --force "
|
|
Force to overwrite the index file if it is present.
|
|
.TP
|
|
.BI "-m, --min-shift " INT
|
|
Set minimal interval size for CSI indices to 2^INT [14]
|
|
.TP
|
|
.BI "-p, --preset " STR
|
|
Input format for indexing. Valid values are: gff, bed, sam, vcf.
|
|
This option should not be applied together with any of
|
|
.BR -s ", " -b ", " -e ", " -c " and " -0 ;
|
|
it is not used for data retrieval because this setting is stored in
|
|
the index file. [gff]
|
|
.TP
|
|
.BI "-s, --sequence " INT
|
|
Column of sequence name. Option
|
|
.BR -s ", " -b ", " -e ", " -S ", " -c " and " -0
|
|
are all stored in the index file and thus not used in data retrieval. [1]
|
|
.TP
|
|
.BI "-S, --skip-lines " INT
|
|
Skip first INT lines in the data file. [0]
|
|
|
|
.SH QUERYING AND OTHER OPTIONS
|
|
.TP
|
|
.B "-h, --print-header "
|
|
Print also the header/meta lines.
|
|
.TP
|
|
.B "-H, --only-header "
|
|
Print only the header/meta lines.
|
|
.TP
|
|
.B "-l, --list-chroms "
|
|
List the sequence names stored in the index file.
|
|
.TP
|
|
.BI "-r, --reheader " FILE
|
|
Replace the header with the content of FILE
|
|
.TP
|
|
.BI "-R, --regions " FILE
|
|
Restrict to regions listed in the FILE. The FILE can be BED file (requires .bed, .bed.gz, .bed.bgz
|
|
file name extension) or a TAB-delimited file with CHROM, POS, and, optionally,
|
|
POS_TO columns, where positions are 1-based and inclusive. When this option is in use, the input
|
|
file may not be sorted.
|
|
.TP
|
|
.BI "-T, --targets " FILE
|
|
Similar to
|
|
.B -R
|
|
but the entire input will be read sequentially and regions not listed in FILE will be skipped.
|
|
.TP
|
|
.BI "-D "
|
|
Do not download the index file before opening it. Valid for remote files only.
|
|
.TP
|
|
.BI "--cache " INT
|
|
Set the BGZF block cache size to INT megabytes. [10]
|
|
|
|
This is of most benefit when the
|
|
.B -R
|
|
option is used, which can cause blocks to be read more than once.
|
|
Setting the size to 0 will disable the cache.
|
|
.TP
|
|
.B --separate-regions
|
|
This option can be used when multiple regions are supplied in the command line
|
|
and the user needs to quickly see which file records belong to which region.
|
|
For this, a line with the name of the region, preceded by the file specific
|
|
comment symbol, is inserted in the output before its corresponding group of
|
|
records.
|
|
.TP
|
|
.BI "--verbosity " INT
|
|
Set verbosity of logging messages printed to stderr.
|
|
The default is 3, which turns on error and warning messages;
|
|
2 reduces warning messages;
|
|
1 prints only error messages and 0 is mostly silent.
|
|
Values higher than 3 produce additional informational and debugging messages.
|
|
.TP
|
|
.BI "-@, --threads " INT
|
|
Set number of threads to use for the operation.
|
|
The default is 0, where no extra threads are in use.
|
|
.PP
|
|
.SH EXAMPLE
|
|
(grep "^#" in.gff; grep -v "^#" in.gff | sort -t"`printf '\(rst'`" -k1,1 -k4,4n) | bgzip > sorted.gff.gz;
|
|
|
|
tabix -p gff sorted.gff.gz;
|
|
|
|
tabix sorted.gff.gz chr1:10,000,000-20,000,000;
|
|
|
|
.SH NOTES
|
|
It is straightforward to achieve overlap queries using the standard
|
|
B-tree index (with or without binning) implemented in all SQL databases,
|
|
or the R-tree index in PostgreSQL and Oracle. But there are still many
|
|
reasons to use tabix. Firstly, tabix directly works with a lot of widely
|
|
used TAB-delimited formats such as GFF/GTF and BED. We do not need to
|
|
design database schema or specialized binary formats. Data do not need
|
|
to be duplicated in different formats, either. Secondly, tabix works on
|
|
compressed data files while most SQL databases do not. The GenCode
|
|
annotation GTF can be compressed down to 4%. Thirdly, tabix is
|
|
fast. The same indexing algorithm is known to work efficiently for an
|
|
alignment with a few billion short reads. SQL databases probably cannot
|
|
easily handle data at this scale. Last but not the least, tabix supports
|
|
remote data retrieval. One can put the data file and the index at an FTP
|
|
or HTTP server, and other users or even web services will be able to get
|
|
a slice without downloading the entire file.
|
|
|
|
.SH AUTHOR
|
|
.PP
|
|
Tabix was written by Heng Li. The BGZF library was originally
|
|
implemented by Bob Handsaker and modified by Heng Li for remote file
|
|
access and in-memory caching.
|
|
|
|
.SH SEE ALSO
|
|
.IR bgzip (1),
|
|
.IR samtools (1)
|