updated manual
This commit is contained in:
parent
7544aca718
commit
f1517b845c
113
bwa.1
113
bwa.1
|
|
@ -1,4 +1,4 @@
|
||||||
.TH bwa 1 "24 October 2011" "bwa-0.6.0" "Bioinformatics tools"
|
.TH bwa 1 "12 November 2011" "bwa-0.6.0" "Bioinformatics tools"
|
||||||
.SH NAME
|
.SH NAME
|
||||||
.PP
|
.PP
|
||||||
bwa - Burrows-Wheeler Alignment Tool
|
bwa - Burrows-Wheeler Alignment Tool
|
||||||
|
|
@ -20,19 +20,19 @@ BWA is a fast light-weighted tool that aligns relatively short sequences
|
||||||
(queries) to a sequence database (targe), such as the human reference
|
(queries) to a sequence database (targe), such as the human reference
|
||||||
genome. It implements two different algorithms, both based on
|
genome. It implements two different algorithms, both based on
|
||||||
Burrows-Wheeler Transform (BWT). The first algorithm is designed for
|
Burrows-Wheeler Transform (BWT). The first algorithm is designed for
|
||||||
short queries up to ~200bp with low error rate (<3%). It does gapped
|
short queries up to ~150bp with low error rate (<3%). It does gapped
|
||||||
global alignment w.r.t. queries, supports paired-end reads, and is one
|
global alignment w.r.t. queries, supports paired-end reads, and is one
|
||||||
of the fastest short read alignment algorithms to date while also
|
of the fastest short read alignment algorithms to date while also
|
||||||
visiting suboptimal hits. The second algorithm, BWA-SW, is designed for
|
visiting suboptimal hits. The second algorithm, BWA-SW, is designed for
|
||||||
long reads with more errors. It performs heuristic Smith-Waterman-like
|
reads longer than 100bp with more errors. It performs a heuristic Smith-Waterman-like
|
||||||
alignment to find high-scoring local hits (and thus chimera). On
|
alignment to find high-scoring local hits and split hits. On
|
||||||
low-error short queries, BWA-SW is slower and less accurate than the
|
low-error short queries, BWA-SW is a little slower and less accurate than the
|
||||||
first algorithm, but on long queries, it is better.
|
first algorithm, but on long queries, it is better.
|
||||||
.PP
|
.PP
|
||||||
For both algorithms, the database file in the FASTA format must be
|
For both algorithms, the database file in the FASTA format must be
|
||||||
first indexed with the
|
first indexed with the
|
||||||
.B `index'
|
.B `index'
|
||||||
command, which typically takes a few hours. The first algorithm is
|
command, which typically takes a few hours for a 3GB genome. The first algorithm is
|
||||||
implemented via the
|
implemented via the
|
||||||
.B `aln'
|
.B `aln'
|
||||||
command, which finds the suffix array (SA) coordinates of good hits of
|
command, which finds the suffix array (SA) coordinates of good hits of
|
||||||
|
|
@ -72,8 +72,7 @@ reimplemented by Yuta Mori.
|
||||||
.TP
|
.TP
|
||||||
.B bwtsw
|
.B bwtsw
|
||||||
Algorithm implemented in BWT-SW. This method works with the whole human
|
Algorithm implemented in BWT-SW. This method works with the whole human
|
||||||
genome, but it does not work with database smaller than 10MB and it is
|
genome.
|
||||||
usually slower than IS.
|
|
||||||
.RE
|
.RE
|
||||||
.RE
|
.RE
|
||||||
|
|
||||||
|
|
@ -260,9 +259,17 @@ Specify the read group in a format like `@RG\\tID:foo\\tSM:bar'. [null]
|
||||||
.B bwasw
|
.B bwasw
|
||||||
bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t
|
bwa bwasw [-a matchScore] [-b mmPen] [-q gapOpenPen] [-r gapExtPen] [-t
|
||||||
nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N
|
nThreads] [-w bandWidth] [-T thres] [-s hspIntv] [-z zBest] [-N
|
||||||
nHspRev] [-c thresCoef] <in.db.fasta> <in.fq>
|
nHspRev] [-c thresCoef] <in.db.fasta> <in.fq> [mate.fq]
|
||||||
|
|
||||||
Align query sequences in the <in.fq> file.
|
Align query sequences in the
|
||||||
|
.I in.fq
|
||||||
|
file. When
|
||||||
|
.I mate.fq
|
||||||
|
is present, perform paired-end alignment. The paired-end mode only works
|
||||||
|
for reads Illumina short-insert libraries. In the paired-end mode, BWA-SW
|
||||||
|
may still output split alignments but they are all marked as not properly
|
||||||
|
paired; the mate positions will not be written if the mate has multiple
|
||||||
|
local hits.
|
||||||
|
|
||||||
.B OPTIONS:
|
.B OPTIONS:
|
||||||
.RS
|
.RS
|
||||||
|
|
@ -413,20 +420,19 @@ subsequence contains no more than
|
||||||
differences.
|
differences.
|
||||||
.PP
|
.PP
|
||||||
When gapped alignment is disabled, BWA is expected to generate the same
|
When gapped alignment is disabled, BWA is expected to generate the same
|
||||||
alignment as Eland, the Illumina alignment program. However, as BWA
|
alignment as Eland version 1, the Illumina alignment program. However, as BWA
|
||||||
change `N' in the database sequence to random nucleotides, hits to these
|
change `N' in the database sequence to random nucleotides, hits to these
|
||||||
random sequences will also be counted. As a consequence, BWA may mark a
|
random sequences will also be counted. As a consequence, BWA may mark a
|
||||||
unique hit as a repeat, if the random sequences happen to be identical
|
unique hit as a repeat, if the random sequences happen to be identical
|
||||||
to the sequences which should be unqiue in the database. This random
|
to the sequences which should be unqiue in the database.
|
||||||
behaviour will be avoided in future releases.
|
|
||||||
.PP
|
.PP
|
||||||
By default, if the best hit is no so repetitive (controlled by -R), BWA
|
By default, if the best hit is not highly repetitive (controlled by -R), BWA
|
||||||
also finds all hits contains one more mismatch; otherwise, BWA finds all
|
also finds all hits contains one more mismatch; otherwise, BWA finds all
|
||||||
equally best hits only. Base quality is NOT considered in evaluating
|
equally best hits only. Base quality is NOT considered in evaluating
|
||||||
hits. In paired-end alignment, BWA pairs all hits it found. It further
|
hits. In the paired-end mode, BWA pairs all hits it found. It further
|
||||||
performs Smith-Waterman alignment for unmapped reads with mates mapped
|
performs Smith-Waterman alignment for unmapped reads to rescue reads with a
|
||||||
to rescue mapped mates, and for high-quality anomalous pairs to fix
|
high erro rate, and for high-quality anomalous pairs to fix potential alignment
|
||||||
potential alignment errors.
|
errors.
|
||||||
|
|
||||||
.SS Estimating Insert Size Distribution
|
.SS Estimating Insert Size Distribution
|
||||||
.PP
|
.PP
|
||||||
|
|
@ -447,20 +453,20 @@ error output.
|
||||||
|
|
||||||
.SS Memory Requirement
|
.SS Memory Requirement
|
||||||
.PP
|
.PP
|
||||||
With bwtsw algorithm, 2.5GB memory is required for indexing the complete
|
With bwtsw algorithm, 5GB memory is required for indexing the complete
|
||||||
human genome sequences. For short reads, the
|
human genome sequences. For short reads, the
|
||||||
.B `aln'
|
.B aln
|
||||||
command uses ~2.3GB memory and the
|
command uses ~3.2GB memory and the
|
||||||
.B `sampe'
|
.B sampe
|
||||||
command uses ~3.5GB.
|
command uses ~5.4GB.
|
||||||
|
|
||||||
.SS Speed
|
.SS Speed
|
||||||
.PP
|
.PP
|
||||||
Indexing the human genome sequences takes 3 hours with bwtsw
|
Indexing the human genome sequences takes 3 hours with bwtsw
|
||||||
algorithm. Indexing smaller genomes with IS or divsufsort algorithms is
|
algorithm. Indexing smaller genomes with IS algorithms is
|
||||||
several times faster, but requires more memory.
|
faster, but requires more memory.
|
||||||
.PP
|
.PP
|
||||||
Speed of alignment is largely determined by the error rate of the query
|
The speed of alignment is largely determined by the error rate of the query
|
||||||
sequences (r). Firstly, BWA runs much faster for near perfect hits than
|
sequences (r). Firstly, BWA runs much faster for near perfect hits than
|
||||||
for hits with many differences, and it stops searching for a hit with
|
for hits with many differences, and it stops searching for a hit with
|
||||||
l+2 differences if a l-difference hit is found. This means BWA will be
|
l+2 differences if a l-difference hit is found. This means BWA will be
|
||||||
|
|
@ -475,36 +481,39 @@ r>0.02.
|
||||||
Pairing is slower for shorter reads. This is mainly because shorter
|
Pairing is slower for shorter reads. This is mainly because shorter
|
||||||
reads have more spurious hits and converting SA coordinates to
|
reads have more spurious hits and converting SA coordinates to
|
||||||
chromosomal coordinates are very costly.
|
chromosomal coordinates are very costly.
|
||||||
.PP
|
|
||||||
In a practical experiment, BWA is able to map 2 million 32bp reads to a
|
|
||||||
bacterial genome in several minutes, map the same amount of reads to
|
|
||||||
human X chromosome in 8-15 minutes and to the human genome in 15-25
|
|
||||||
minutes. This result implies that the speed of BWA is insensitive to the
|
|
||||||
size of database and therefore BWA is more efficient when the database
|
|
||||||
is sufficiently large. On smaller genomes, hash based algorithms are
|
|
||||||
usually much faster.
|
|
||||||
|
|
||||||
.SH NOTES ON LONG-READ ALIGNMENT
|
.SH NOTES ON LONG-READ ALIGNMENT
|
||||||
.PP
|
.PP
|
||||||
Command
|
Command
|
||||||
.B `bwasw'
|
.B bwasw
|
||||||
is designed for long-read alignment. The algorithm behind, BWA-SW, is
|
is designed for long-read alignment. BWA-SW essentially aligns the trie
|
||||||
similar to BWT-SW, but does not guarantee to find all local hits due to
|
of the reference genome against the directed acyclic word graph (DAWG) of a
|
||||||
the heuristic acceleration. It tends to be faster and more accurate if
|
read to find seeds not highly repetitive in the genome, and then performs a
|
||||||
the resultant alignment is supported by more seeds, and therefore
|
standard Smith-Waterman algorithm to extend the seeds. A key heuristic, called
|
||||||
BWA-SW usually performs better on long queries than on short ones.
|
the Z-best heuristic, is that at each vertex in the DAWG, BWA-SW only keeps the
|
||||||
|
top Z reference suffix intervals that match the vertex. BWA-SW is more accurate
|
||||||
|
if the resultant alignment is supported by more seeds, and therefore BWA-SW
|
||||||
|
usually performs better on long queries or queries with low divergence to the
|
||||||
|
reference genome.
|
||||||
|
|
||||||
On 350-1000bp reads, BWA-SW is several to tens of times faster than the
|
BWA-SW is perhaps a better choice than BWA-short for 100bp single-end HiSeq reads
|
||||||
existing programs. Its accuracy is comparable to SSAHA2, more accurate
|
mainly because it gives better gapped alignment. For paired-end reads, it is yet
|
||||||
than BLAT. Like BLAT, BWA-SW also finds chimera which may pose a
|
to know whether BWA-short or BWA-SW yield overall better results.
|
||||||
challenge to SSAHA2. On 10-100kbp queries where chimera detection is
|
|
||||||
important, BWA-SW is over 10X faster than BLAT while being more
|
|
||||||
sensitive.
|
|
||||||
|
|
||||||
BWA-SW can also be used to align ~100bp reads, but it is slower than
|
.SH CHANGES IN BWA-0.6
|
||||||
the short-read algorithm. Its sensitivity and accuracy is lower than
|
.PP
|
||||||
SSAHA2 especially when the sequencing error rate is above 2%. This is
|
Since version 0.6, BWA has been able to work with a reference genome longer than 4GB.
|
||||||
the trade-off of the 30X speed up in comparison to SSAHA2's -454 mode.
|
This feature makes it possible to integrate the forward and reverse complemented
|
||||||
|
genome in one FM-index, which speeds up both BWA-short and BWA-SW. As a tradeoff,
|
||||||
|
BWA uses more memory because it has to keep all positions and ranks in 64-bit
|
||||||
|
integers, twice larger than 32-bit integers used in the previous versions.
|
||||||
|
|
||||||
|
The latest BWA-SW also works for paired-end reads longer than 100bp. In
|
||||||
|
comparison to BWA-short, BWA-SW tends to be more accurate for highly unique
|
||||||
|
reads and more robust to relative long INDELs and structural variants.
|
||||||
|
Nonetheless, BWA-short usually has higher power to distinguish the optimal hit
|
||||||
|
from many suboptimal hits. The choice of the mapping algorithm may depend on
|
||||||
|
the application.
|
||||||
|
|
||||||
.SH SEE ALSO
|
.SH SEE ALSO
|
||||||
BWA website <http://bio-bwa.sourceforge.net>, Samtools website
|
BWA website <http://bio-bwa.sourceforge.net>, Samtools website
|
||||||
|
|
@ -529,12 +538,12 @@ If you use the short-read alignment component, please cite the following
|
||||||
paper:
|
paper:
|
||||||
.PP
|
.PP
|
||||||
Li H. and Durbin R. (2009) Fast and accurate short read alignment with
|
Li H. and Durbin R. (2009) Fast and accurate short read alignment with
|
||||||
Burrows-Wheeler transform. Bioinformatics, 25, 1754-60. [PMID: 19451168]
|
Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]
|
||||||
.PP
|
.PP
|
||||||
If you use the long-read component (BWA-SW), please cite:
|
If you use the long-read component (BWA-SW), please cite:
|
||||||
.PP
|
.PP
|
||||||
Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
|
Li H. and Durbin R. (2010) Fast and accurate long-read alignment with
|
||||||
Burrows-Wheeler transform. Bioinformatics. [PMID: 20080505]
|
Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505]
|
||||||
|
|
||||||
.SH HISTORY
|
.SH HISTORY
|
||||||
BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW
|
BWA is largely influenced by BWT-SW. It uses source codes from BWT-SW
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue