updated README

This commit is contained in:
Heng Li 2014-05-14 14:32:09 -04:00
parent 061c63f36a
commit 1627f9dfae
1 changed files with 55 additions and 20 deletions

View File

@ -8,15 +8,14 @@
###Introduction ###Introduction
BWA is a software package for mapping low-divergent sequences against a large BWA is a software package for mapping DNA sequences against a large reference
reference genome, such as the human genome. It consists of three algorithms: genome, such as the human genome. It consists of three algorithms:
BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina
sequence reads up to 100bp, while the rest two for longer sequences ranged from sequence reads up to 100bp, while the rest two for longer sequences ranged from
70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as the support of 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the
long reads and chimeric alignment, but BWA-MEM, which is the latest, is support of long reads and chimeric alignment, but BWA-MEM, which is the latest,
generally recommended as it is faster and more is generally recommended as it is faster and more accurate. BWA-MEM also has
accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp better performance than BWA-backtrack for 70-100bp Illumina reads.
Illumina reads.
For all the algorithms, BWA first needs to construct the FM-index for the For all the algorithms, BWA first needs to construct the FM-index for the
reference genome (the **index** command). Alignment algorithms are invoked with reference genome (the **index** command). Alignment algorithms are invoked with
@ -26,10 +25,10 @@ different sub-commands: **aln/samse/sampe** for BWA-backtrack,
###Availability ###Availability
BWA is released under [GPLv3][1]. The latest souce code is [freely BWA is released under [GPLv3][1]. The latest souce code is [freely
available][2] at github. Released packages can [be downloaded][3] at available at github][2]. Released packages can [be downloaded][3] at
SourceForge. After you acquire the source code, simply use `make` to compile SourceForge. After you acquire the source code, simply use `make` to compile
and copy the single executable `bwa` to the destination you want. The only and copy the single executable `bwa` to the destination you want. The only
dependency of BWA is [zlib][14]. dependency required to build BWA is [zlib][14].
###Seeking helps ###Seeking helps
@ -59,21 +58,37 @@ do not have plan to submit it to a peer-reviewed journal in the near future.
###Frequently asked questions (FAQs) ###Frequently asked questions (FAQs)
####What types of data does BWA work with? 1. [What types of data does BWA work with?](#type)
2. [Why does a read appear multiple times in the output SAM?](#multihit)
3. [Does BWA work on reference sequences longer than 4GB in total?](#4gb)
4. [Why can one read in a pair has high mapping quality but the other has zero?](#pe0)
5. [How can a BWA-backtrack alignment stands out of the end of a chromosome?](endref)
6. [How to map sequences to GRCh38 with ALT contigs?](#h38)
####<a href="type"></a>1. What types of data does BWA work with?
BWA works with a variety types of DNA sequence data, though the optimal BWA works with a variety types of DNA sequence data, though the optimal
algorithm and setting may vary. The following list gives the recommended algorithm and setting may vary. The following list gives the recommended
settings: settings:
* Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly * Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly
contigs up to a few megabases: contigs up to a few megabases mapped to a close related reference genome:
bwa mem ref.fa reads.fq > aln.sam bwa mem ref.fa reads.fq > aln.sam
* Illumina single-end reads no longer than ~70bp:
bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam
* Illumina/454/IonTorrent paired-end reads longer than ~70bp: * Illumina/454/IonTorrent paired-end reads longer than ~70bp:
bwa mem ref.fa read1.fq read2.fq > aln-pe.sam bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
* Illumina paired-end reads no longer than ~70bp:
bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
bwa samse ref.fa reads.sai reads.fq > aln-pe.sam
* PacBio subreads to a reference genome: * PacBio subreads to a reference genome:
bwa mem -x pacbio ref.fa reads.fq > aln.sam bwa mem -x pacbio ref.fa reads.fq > aln.sam
@ -82,20 +97,40 @@ settings:
bwa mem -x pbread reads.fq reads.fq > overlap.pas bwa mem -x pbread reads.fq reads.fq > overlap.pas
* Illumina single-end reads no longer than ~70bp: BWA-MEM is recommended for query sequences longer than ~70bp for a variety of
error rates (or sequence divergence). Generally, BWA-MEM is more tolerant with
errors given longer query sequences as the chance of missing all seeds is small.
As is shown above, with non-default settings, BWA-MEM works with PacBio subreads
with a sequencing error rate as high as ~15%.
bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam ####<a href="multihit"></a>2. Why does a read appear multiple times in the output SAM?
* Illumina paired-end reads no longer than ~70bp: BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene
fusion or a long deletion, a read bridging the break point may have two hits,
occupying two lines in the SAM output. With the default setting of BWA-MEM, one
and only one line is primary and is soft clipped; other lines are tagged with
0x800 SAM flag (supplementary alignment) and are hard clipped.
bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai ####<a href="4gb"></a>3. Does BWA work on reference sequences longer than 4GB in total?
bwa samse ref.fa reads.sai reads.fq > aln-pe.sam
####Why does a read appear multiple times in the output SAM? Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over
4GB. However, individual chromosome should not be longer than 2GB.
BWA-SW and BWA-MEM perform local alignments. ####<a href="pe0"></a>4. Why can one read in a pair has high mapping quality but the other has zero?
####How to map sequences to GRCh38 with ALT contigs? This is correct. Mapping quality is assigned for individual read, not for a read
pair. It is possible that one read can be mapped unambiguously, but its mate
falls in a tandem repeat and thus its accurate position cannot be determined.
####<a href="endref"></a>5. How can a BWA-backtrack alignment stands out of the end of a chromosome?
Internally BWA concatenates all reference sequences into one long sequence. A
read may be mapped to the junction of two adjacent reference sequences. In this
case, BWA-backtrack will flag the read as unmapped (0x4), but you will see
position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment
as well. BWA-MEM does not have this problem.
####<a href="h38"></a>6. How to map sequences to GRCh38 with ALT contigs?
BWA-backtrack and BWA-MEM partially support mapping to a reference containing BWA-backtrack and BWA-MEM partially support mapping to a reference containing
ALT contigs that represent alternative alleles highly divergent from the ALT contigs that represent alternative alleles highly divergent from the