Merge branch 'dev'

This commit is contained in:
Heng Li 2014-10-21 00:55:19 -04:00
commit acf4922fe4
1 changed files with 23 additions and 20 deletions

View File

@ -3,7 +3,7 @@
Since version 0.7.11, BWA-MEM supports read mapping against a reference genome Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
with long alternative haplotypes present in separate ALT contigs. To use the with long alternative haplotypes present in separate ALT contigs. To use the
ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
SAM format and rename the file to ""*idxbase*.alt". For GRCh38, this alignment SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
is available from the [BWA resource bundle for GRCh38][res]. is available from the [BWA resource bundle for GRCh38][res].
#### Option 1: Mapping to the official GRCh38 with ALT contigs #### Option 1: Mapping to the official GRCh38 with ALT contigs
@ -20,7 +20,6 @@ cp bwa-hs38-res/hs38d4.fa.alt hs38a.fa.alt
Perform mapping: Perform mapping:
```sh ```sh
bwa mem hs38a.fa read1.fq read2.fq \ bwa mem hs38a.fa read1.fq read2.fq \
| samblaster \
| bwa-hs38-res/k8-linux bwa-postalt.js hs38a.fa.alt \ | bwa-hs38-res/k8-linux bwa-postalt.js hs38a.fa.alt \
| samtools view -bS - > aln.unsrt.bam | samtools view -bS - > aln.unsrt.bam
``` ```
@ -38,15 +37,14 @@ cp bwa-hs38-res/hs38d4.fa.alt .
Perform mapping: Perform mapping:
```sh ```sh
bwa mem hs38d4.fa read1.fq read2.fq \ bwa mem hs38d4.fa read1.fq read2.fq \
| samblaster \
| bwa-hs38-res/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \ | bwa-hs38-res/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \
| samtools view -bS - > aln.unsrt.bam | samtools view -bS - > aln.unsrt.bam
``` ```
This command line generates `postinfo.ctw` which loosely evaluates the presence This command line generates `postinfo.ctw` which loosely evaluates the presence
of an ALT contig with an empirical score at the last column. of an ALT contig with an empirical score at the last column.
**If you are not interested in the way BWA-MEM performs ALT mapping, you can ***If you are not interested in the way BWA-MEM performs ALT mapping, you can
skip the rest of this documentation.** skip the rest of this documentation.***
## Background ## Background
@ -80,20 +78,19 @@ the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*,
depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports
alignments and assigns mapQ following these two rules: alignments and assigns mapQ following these two rules:
* The original mapQ of a non-ALT hit is computed across non-ALT hits only. 1. The original mapQ of a non-ALT hit is computed across non-ALT hits only.
The reported mapQ of an ALT hit is computed across all hits. The reported mapQ of an ALT hit is computed across all hits.
* An ALT hit is only reported if its score is strictly better than all 2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
overlapping non-ALT hits. A reported ALT hit is flagged with 0x800 alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
(supplementary) unless there are no non-ALT hits. primary. ALT hits are reported as supplementary alignments (flag 0x800) only
if they are better than all overlapping non-ALT hits.
When option `-g FLOAT` is in use (which is the default), a third rule kicks in: In theory, non-ALT alignments from step 1 should be identical to alignments
against a reference genome with ALT contigs. In practice, the two types of
alignments may differ in rare cases due to seeding heuristics.
* The mapQ of a non-ALT hit is reduced to zero if its score is less than FLOAT If we don't care about ALT hits, we may skip postprocessing (step 2).
times the score of an overlapping ALT hit. In this case, the original mapQ is
moved to the `om` tag.
If we don't care about ALT hits, we may actually skip postprocessing (step 2).
Nonetheless, postprocessing is recommended as it improves mapQ and gives more Nonetheless, postprocessing is recommended as it improves mapQ and gives more
information about ALT hits. information about ALT hits.
@ -103,8 +100,9 @@ Postprocessing is done with a separate script `bwa-postalt.js`. It reads all
potential hits reported in the XA tag, lifts ALT hits to the chromosomal potential hits reported in the XA tag, lifts ALT hits to the chromosomal
positions using the ALT-to-ref alignment, groups them after lifting and then positions using the ALT-to-ref alignment, groups them after lifting and then
reassigns mapQ based on the best scoring hit in each group with all the hits in reassigns mapQ based on the best scoring hit in each group with all the hits in
a group get the same mapQ. Knowing the ALT-to-ref alignment, this script can a group get the same mapQ. Being aware of the ALT-to-ref alignment, this script
greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. can greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT
hits.
The script also measures the presence of each ALT contig. For a group of The script also measures the presence of each ALT contig. For a group of
overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j
@ -134,6 +132,10 @@ In addition to decoy, we also put multiple alleles of HLA genes in
version 3.18.0. Script `bwa-postalt.js` also helps to genotype HLA genes, though version 3.18.0. Script `bwa-postalt.js` also helps to genotype HLA genes, though
not to high resolution for now. not to high resolution for now.
### Evaluating ALT Mapping
(To come later...)
## Problems and Future Development ## Problems and Future Development
There are some uncertainties about ALT mappings - we are not sure whether they There are some uncertainties about ALT mappings - we are not sure whether they
@ -141,9 +143,10 @@ help biological discovery and don't know the best way to analyze them. Without
clear demand from downstream analyses, it is very difficult to design the clear demand from downstream analyses, it is very difficult to design the
optimal mapping strategy. The current BWA-MEM method is just a start. If it optimal mapping strategy. The current BWA-MEM method is just a start. If it
turns out to be useful in research, we will probably rewrite bwa-postalt.js in C turns out to be useful in research, we will probably rewrite bwa-postalt.js in C
for performance; if not, we will try new designs. It is also possible that we for performance; if not, we may make changes. It is also possible that we might
may make breakthrough on the representation of multiple genomes, in which case, make breakthrough on the representation of multiple genomes, in which case, we
we can even get rid of ALT contigs once for all. can even get rid of ALT contigs for good.
[res]: https://sourceforge.net/projects/bio-bwa/files/ [res]: https://sourceforge.net/projects/bio-bwa/files/