基于bwa,做一些优化,注释
 
 
 
 
 
 
Go to file
Heng Li c8ae5029b3 minor change to example 2014-10-17 16:19:21 -04:00
.gitignore Added Makefile.bak and bwamem-lite to .gitignore 2013-03-13 09:18:18 +00:00
COPYING Imported from my local bwa repository, the master repository. 2011-01-13 20:52:12 -05:00
ChangeLog Update to the latest modfication 0.5.9rc1-2. Update ChangeLog 2011-01-13 20:54:10 -05:00
Makefile added -lrt on Linux 2014-10-15 23:23:01 -04:00
NEWS.md removed <> from markdown 2014-09-18 11:54:23 -04:00
QSufSort.c removed a few unused variables 2013-02-23 13:26:50 -05:00
QSufSort.h move bwt_gen/* to the root directory 2011-10-20 11:56:24 -04:00
README-alt.md added section about the completeness of GRCh38 2014-10-17 15:42:17 -04:00
README.md r898: read the index into a single memory block 2014-10-15 12:27:45 -04:00
bamlite.c Removed more dependencies on utils.h 2013-05-03 11:38:48 +01:00
bamlite.h Removed more dependencies on utils.h 2013-05-03 11:38:48 +01:00
bntseq.c r896: more flexible ALT reading 2014-10-14 23:37:24 -04:00
bntseq.h r816: read .alt file (not tested) 2014-09-05 12:49:50 -04:00
bwa-helper.js separate postalt to a separate script 2014-09-19 23:23:54 -04:00
bwa-postalt.js wrong flag in the SAM output 2014-10-14 16:57:57 -04:00
bwa.1 Release bwa-0.7.9-r783 2014-05-19 09:09:11 -04:00
bwa.c r915: fixed broken example.c 2014-10-17 16:17:28 -04:00
bwa.h replaced Sys V shm with POSIX shm 2014-10-15 23:06:03 -04:00
bwamem.c r907: revert to -g.8 by default 2014-10-16 15:56:33 -04:00
bwamem.h r878: XA is given to the best alignment 2014-09-30 13:50:51 -04:00
bwamem_extra.c r878: XA is given to the best alignment 2014-09-30 13:50:51 -04:00
bwamem_pair.c These files were committed on a wrong branch 2014-09-18 10:49:35 -04:00
bwape.c r737: fixed an assertion when failed to convert sa 2014-04-30 14:55:44 -04:00
bwase.c Release bwa-0.7.9-r782 2014-05-19 09:08:07 -04:00
bwase.h removed color-space support 2013-02-12 10:21:17 -05:00
bwaseqio.c Removed more dependencies on utils.h 2013-05-03 11:38:48 +01:00
bwashm.c r912: very minor improvement 2014-10-17 12:14:18 -04:00
bwt.c r809: new strategy for the -a mode 2014-08-25 11:59:27 -04:00
bwt.h r809: new strategy for the -a mode 2014-08-25 11:59:27 -04:00
bwt_gen.c r807: allow to change block size in bwt_gen 2014-08-25 10:31:54 -04:00
bwt_lite.c r770: fixed a compiling warning 2014-05-14 14:44:03 -04:00
bwt_lite.h Fixed clang compiling warnings 2014-03-16 15:18:22 -04:00
bwtaln.c Release bwa-0.7.9-r782 2014-05-19 09:08:07 -04:00
bwtaln.h r397: multi changes/bugfixes to bwa-backtrack 2013-05-24 16:28:18 -04:00
bwtgap.c r397: multi changes/bugfixes to bwa-backtrack 2013-05-24 16:28:18 -04:00
bwtgap.h r397: multi changes/bugfixes to bwa-backtrack 2013-05-24 16:28:18 -04:00
bwtindex.c r808: a minor bug with the new index -b 2014-08-25 10:36:24 -04:00
bwtsw2.h r132: optionally copy FASTA/Q comment to SAM 2012-10-26 12:54:32 -04:00
bwtsw2_aux.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
bwtsw2_chain.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
bwtsw2_core.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
bwtsw2_main.c Ensure exit status of 1 if given invalid options or index files are not found. 2013-04-29 13:58:28 +01:00
bwtsw2_pair.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
example.c minor change to example 2014-10-17 16:19:21 -04:00
fastmap.c shm works on small files, but not large ones 2014-10-15 15:44:06 -04:00
is.c Removed more dependencies on utils.h 2013-05-03 11:38:48 +01:00
kbtree.h Release bwa-0.7.9-r782 2014-05-19 09:08:07 -04:00
khash.h Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
kopen.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
kseq.h Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
ksort.h Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
kstring.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
kstring.h Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
ksw.c r744: int overflow given MB query 2014-05-01 15:30:36 -04:00
ksw.h dev-448: different ins/del penalties 2014-03-28 10:54:23 -04:00
kthread.c use kthread for multi-threading 2013-11-02 12:13:11 -04:00
kvec.h Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
main.c r915: fixed broken example.c 2014-10-17 16:17:28 -04:00
malloc_wrap.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
malloc_wrap.h Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
pemerge.c Reduce dependency on utils.h - new malloc wrapping scheme. 2013-05-02 15:12:01 +01:00
qualfa2fq.pl Imported from my local bwa repository, the master repository. 2011-01-13 20:52:12 -05:00
utils.c r810: add err_puts() 2014-08-26 11:07:24 -04:00
utils.h r810: add err_puts() 2014-08-26 11:07:24 -04:00
xa2multi.pl Bugfix: reverse (complement) sequence and phred string if alternative alignment has different orientation than primary alignment 2011-09-07 14:31:28 +02:00

README.md

##Getting started

git clone https://github.com/lh3/bwa.git
cd bwa; make
./bwa index ref.fa
./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

##Introduction

BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads.

For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with different sub-commands: aln/samse/sampe for BWA-backtrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm.

##Availability

BWA is released under GPLv3. The latest source code is freely available at github. Released packages can be downloaded at SourceForge. After you acquire the source code, simply use make to compile and copy the single executable bwa to the destination you want. The only dependency required to build BWA is zlib.

##Seeking helps

The detailed usage is described in the man page available together with the source code. You can use man ./bwa.1 to view the man page in a terminal. The HTML version of the man page can be found at the BWA website. If you have questions about BWA, you may sign up the mailing list and then send the questions to bio-bwa-help@sourceforge.net. You may also ask questions in forums such as BioStar and SEQanswers.

##Citing BWA

  • Li H. and Durbin R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754-1760. [PMID: 19451168]. (if you use the BWA-backtrack algorithm)

  • Li H. and Durbin R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589-595. [PMID: 20080505]. (if you use the BWA-SW algorithm)

  • Li H. (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN]. (if you use the BWA-MEM algorithm or the fastmap command, or want to cite the whole BWA package)

Please note that the last reference is a preprint hosted at arXiv.org. I do not have plan to submit it to a peer-reviewed journal in the near future.

##Frequently asked questions (FAQs)

  1. What types of data does BWA work with?
  2. Why does a read appear multiple times in the output SAM?
  3. Does BWA work on reference sequences longer than 4GB in total?
  4. Why can one read in a pair has high mapping quality but the other has zero?
  5. How can a BWA-backtrack alignment stands out of the end of a chromosome?

####1. What types of data does BWA work with?

BWA works with a variety types of DNA sequence data, though the optimal algorithm and setting may vary. The following list gives the recommended settings:

  • Illumina/454/IonTorrent single-end reads longer than ~70bp or assembly contigs up to a few megabases mapped to a closely related reference genome:

      bwa mem ref.fa reads.fq > aln.sam
    
  • Illumina single-end reads shorter than ~70bp:

      bwa aln ref.fa reads.fq > reads.sai; bwa samse ref.fa reads.sai reads.fq > aln-se.sam
    
  • Illumina/454/IonTorrent paired-end reads longer than ~70bp:

      bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
    
  • Illumina paired-end reads shorter than ~70bp:

      bwa aln ref.fa read1.fq > read1.sai; bwa aln ref.fa read2.fq > read2.sai
      bwa sampe ref.fa read1.sai read2.sai read1.fq read2.fq > aln-pe.sam
    
  • PacBio subreads or Oxford Nanopore reads to a reference genome:

      bwa mem -x pacbio ref.fa reads.fq > aln.sam
      bwa mem -x ont2d ref.fa reads.fq > aln.sam
    

BWA-MEM is recommended for query sequences longer than ~70bp for a variety of error rates (or sequence divergence). Generally, BWA-MEM is more tolerant with errors given longer query sequences as the chance of missing all seeds is small. As is shown above, with non-default settings, BWA-MEM works with Oxford Nanopore reads with a sequencing error rate over 20%.

####2. Why does a read appear multiple times in the output SAM?

BWA-SW and BWA-MEM perform local alignments. If there is a translocation, a gene fusion or a long deletion, a read bridging the break point may have two hits, occupying two lines in the SAM output. With the default setting of BWA-MEM, one and only one line is primary and is soft clipped; other lines are tagged with 0x800 SAM flag (supplementary alignment) and are hard clipped.

####3. Does BWA work on reference sequences longer than 4GB in total?

Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, individual chromosome should not be longer than 2GB.

####4. Why can one read in a pair has high mapping quality but the other has zero?

This is correct. Mapping quality is assigned for individual read, not for a read pair. It is possible that one read can be mapped unambiguously, but its mate falls in a tandem repeat and thus its accurate position cannot be determined.

####5. How can a BWA-backtrack alignment stands out of the end of a chromosome?

Internally BWA concatenates all reference sequences into one long sequence. A read may be mapped to the junction of two adjacent reference sequences. In this case, BWA-backtrack will flag the read as unmapped (0x4), but you will see position, CIGAR and all the tags. A similar issue may occur to BWA-SW alignment as well. BWA-MEM does not have this problem.