239 lines
6.1 KiB
Groff
239 lines
6.1 KiB
Groff
|
|
'\" t
|
||
|
|
.TH faidx 5 "June 2018" "htslib" "Bioinformatics formats"
|
||
|
|
.SH NAME
|
||
|
|
faidx \- an index enabling random access to FASTA and FASTQ files
|
||
|
|
.\"
|
||
|
|
.\" Copyright (C) 2013, 2015, 2018 Genome Research Ltd.
|
||
|
|
.\"
|
||
|
|
.\" Author: John Marshall <jm18@sanger.ac.uk>
|
||
|
|
.\"
|
||
|
|
.\" Permission is hereby granted, free of charge, to any person obtaining a
|
||
|
|
.\" copy of this software and associated documentation files (the "Software"),
|
||
|
|
.\" to deal in the Software without restriction, including without limitation
|
||
|
|
.\" the rights to use, copy, modify, merge, publish, distribute, sublicense,
|
||
|
|
.\" and/or sell copies of the Software, and to permit persons to whom the
|
||
|
|
.\" Software is furnished to do so, subject to the following conditions:
|
||
|
|
.\"
|
||
|
|
.\" The above copyright notice and this permission notice shall be included in
|
||
|
|
.\" all copies or substantial portions of the Software.
|
||
|
|
.\"
|
||
|
|
.\" THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||
|
|
.\" IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||
|
|
.\" FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
|
||
|
|
.\" THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||
|
|
.\" LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
||
|
|
.\" FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
|
||
|
|
.\" DEALINGS IN THE SOFTWARE.
|
||
|
|
.\"
|
||
|
|
.SH SYNOPSIS
|
||
|
|
.IR file.fa .fai,
|
||
|
|
.IR file.fasta .fai,
|
||
|
|
.IR file.fq .fai,
|
||
|
|
.IR file.fastq .fai
|
||
|
|
.SH DESCRIPTION
|
||
|
|
Using an \fBfai index\fP file in conjunction with a FASTA/FASTQ file containing
|
||
|
|
reference sequences enables efficient access to arbitrary regions within
|
||
|
|
those reference sequences.
|
||
|
|
The index file typically has the same filename as the corresponding FASTA/FASTQ
|
||
|
|
file, with \fB.fai\fP appended.
|
||
|
|
.P
|
||
|
|
An \fBfai index\fP file is a text file consisting of lines each with
|
||
|
|
five TAB-delimited columns for a FASTA file and six for FASTQ:
|
||
|
|
.TS
|
||
|
|
lbl.
|
||
|
|
NAME Name of this reference sequence
|
||
|
|
LENGTH Total length of this reference sequence, in bases
|
||
|
|
OFFSET Offset in the FASTA/FASTQ file of this sequence's first base
|
||
|
|
LINEBASES The number of bases on each line
|
||
|
|
LINEWIDTH The number of bytes in each line, including the newline
|
||
|
|
QUALOFFSET Offset of sequence's first quality within the FASTQ file
|
||
|
|
.TE
|
||
|
|
.P
|
||
|
|
The \fBNAME\fP and \fBLENGTH\fP columns contain the same
|
||
|
|
data as would appear in the \fBSN\fP and \fBLN\fP fields of a
|
||
|
|
SAM \fB@SQ\fP header for the same reference sequence.
|
||
|
|
.P
|
||
|
|
The \fBOFFSET\fP column contains the offset within the FASTA/FASTQ file, in
|
||
|
|
bytes starting from zero, of the first base of this reference sequence, i.e., of
|
||
|
|
the character following the newline at the end of the header line (the
|
||
|
|
"\fB>\fP" line in FASTA, "\fB@\fP" in FASTQ). Typically the lines of a
|
||
|
|
\fBfai index\fP file appear in the order in which the reference sequences
|
||
|
|
appear in the FASTA/FASTQ file, so \fB.fai\fP files are typically sorted
|
||
|
|
according to this column.
|
||
|
|
.P
|
||
|
|
The \fBLINEBASES\fP column contains the number of bases in each of the sequence
|
||
|
|
lines that form the body of this reference sequence, apart from the final line
|
||
|
|
which may be shorter.
|
||
|
|
The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of
|
||
|
|
the sequence lines (except perhaps the final line), thus differing from
|
||
|
|
\fBLINEBASES\fP in that it also counts the bytes forming the line terminator.
|
||
|
|
.P
|
||
|
|
The \fBQUALOFFSET\fP works the same way as \fBOFFSET\fP but for the first
|
||
|
|
quality score of this reference sequence. This would be the first character
|
||
|
|
following the newline at the end of the "\fB+\fP" line. For FASTQ files only.
|
||
|
|
.SS FASTA Files
|
||
|
|
In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text
|
||
|
|
file of the form
|
||
|
|
.LP
|
||
|
|
.RS
|
||
|
|
.RI > name
|
||
|
|
.RI [ description ...]
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCATGCATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGCATGCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
ATGCAT
|
||
|
|
.br
|
||
|
|
.RI > name
|
||
|
|
.RI [ description ...]
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
[...]
|
||
|
|
.RE
|
||
|
|
.LP
|
||
|
|
In particular, each reference sequence must be "well-formatted", i.e., all
|
||
|
|
of its sequence lines must be the same length, apart from the final sequence
|
||
|
|
line which may be shorter.
|
||
|
|
(While this sequence line length must be the same within each sequence,
|
||
|
|
it may vary between different reference sequences in the same FASTA file.)
|
||
|
|
.P
|
||
|
|
This also means that although the FASTA file may have Unix- or Windows-style
|
||
|
|
or other line termination, the newline characters present must be consistent,
|
||
|
|
at least within each reference sequence.
|
||
|
|
.P
|
||
|
|
The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header
|
||
|
|
line text (i.e., up to the first whitespace character, having skipped any
|
||
|
|
initial whitespace after the ">") as the \fBNAME\fP column.
|
||
|
|
.SS FASTQ Files
|
||
|
|
FASTQ files for indexing work in the same way as the FASTA files.
|
||
|
|
.LP
|
||
|
|
.RS
|
||
|
|
.RI @ name
|
||
|
|
.RI [ description...]
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCATGCATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGCATGCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
ATGCAT
|
||
|
|
.br
|
||
|
|
.RI +
|
||
|
|
.br
|
||
|
|
FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
|
||
|
|
.br
|
||
|
|
HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
|
||
|
|
.br
|
||
|
|
8011<<
|
||
|
|
.br
|
||
|
|
.RI @ name
|
||
|
|
.RI [ description...]
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
.RI +
|
||
|
|
.br
|
||
|
|
IIA94445EEII==
|
||
|
|
.br
|
||
|
|
=>IIIIIIIIICCC
|
||
|
|
.br
|
||
|
|
[...]
|
||
|
|
.RE
|
||
|
|
.LP
|
||
|
|
Quality lines must be wrapped at the same length as the corresponding
|
||
|
|
sequence lines.
|
||
|
|
.SH EXAMPLE
|
||
|
|
For example, given this FASTA file
|
||
|
|
.LP
|
||
|
|
.RS
|
||
|
|
>one
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCATGCATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGCATGCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
ATGCAT
|
||
|
|
.br
|
||
|
|
>two another chromosome
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
.RE
|
||
|
|
.LP
|
||
|
|
formatted with Unix-style (LF) line termination, the corresponding fai index
|
||
|
|
would be
|
||
|
|
.RS
|
||
|
|
.TS
|
||
|
|
lnnnn.
|
||
|
|
one 66 5 30 31
|
||
|
|
two 28 98 14 15
|
||
|
|
.TE
|
||
|
|
.RE
|
||
|
|
.LP
|
||
|
|
If the FASTA file were formatted with Windows-style (CR-LF) line termination,
|
||
|
|
the fai index would be
|
||
|
|
.RS
|
||
|
|
.TS
|
||
|
|
lnnnn.
|
||
|
|
one 66 6 30 32
|
||
|
|
two 28 103 14 16
|
||
|
|
.TE
|
||
|
|
.RE
|
||
|
|
.LP
|
||
|
|
An example FASTQ file
|
||
|
|
.LP
|
||
|
|
.RS
|
||
|
|
@fastq1
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCATGCATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGCATGCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
ATGCAT
|
||
|
|
.br
|
||
|
|
+
|
||
|
|
.br
|
||
|
|
FFFA@@FFFFFFFFFFHHB:::@BFFFFGG
|
||
|
|
.br
|
||
|
|
HIHIIIIIIIIIIIIIIIIIIIIIIIFFFF
|
||
|
|
.br
|
||
|
|
8011<<
|
||
|
|
.br
|
||
|
|
@fastq2
|
||
|
|
.br
|
||
|
|
ATGCATGCATGCAT
|
||
|
|
.br
|
||
|
|
GCATGCATGCATGC
|
||
|
|
.br
|
||
|
|
+
|
||
|
|
.br
|
||
|
|
IIA94445EEII==
|
||
|
|
.br
|
||
|
|
=>IIIIIIIIICCC
|
||
|
|
.br
|
||
|
|
.RE
|
||
|
|
.LP
|
||
|
|
Formatted with Unix-style line termination would give this fai index
|
||
|
|
.RS
|
||
|
|
.TS
|
||
|
|
lnnnnn.
|
||
|
|
fastq1 66 8 30 31 79
|
||
|
|
fastq2 28 156 14 15 188
|
||
|
|
.TE
|
||
|
|
.RE
|
||
|
|
.SH SEE ALSO
|
||
|
|
.IR samtools (1)
|
||
|
|
.TP
|
||
|
|
https://en.wikipedia.org/wiki/FASTA_format
|
||
|
|
.TP
|
||
|
|
https://en.wikipedia.org/wiki/FASTQ_format
|
||
|
|
|
||
|
|
Further description of the FASTA and FASTQ formats
|