Updated README for ALT mapping

This commit is contained in:
Heng Li 2014-11-17 13:20:23 -05:00
parent ad9d12c04d
commit cee8149b12
3 changed files with 388 additions and 54 deletions

View File

@ -1,57 +1,63 @@
## Getting Started
Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
with long alternative haplotypes present in separate ALT contigs. To use the
ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
is available from the [binary package of BWA][res].
```sh
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
# Download the bwa-0.7.11 binary package
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
bwa.kit/run-gen-hs38d6 # download GRCh38 and write hs38d6.fa
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
bwa.kit/bwa index hs38d6.fa # create BWA index
# mapping
bwa.kit/run-bwamem hs38d6.fa read1.fq read2.fq | sh
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
```
In the final alignment, a read may be placed on the [primary assembly][grcdef]
and multiple overlapping ALT contigs at the same time (on multiple SAM lines).
Mapping quality (mapQ) is properly adjusted by the postprocessing script
`bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For
details, see the [Methods section](#methods).
This will generate the following files:
* `out.aln.sam.gz`: unsorted alignments with ALT-aware mapping quality. In this
file, one read may be placed on multiple overlapping ALT contigs at the same
time even if the read is mapped better to some contigs than others. This makes
it possible to analyze each contig independent of others.
* `out.hla.top`: best genotypes for HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1 genes.
* `out.hla.all`: other possible genotypes on the six HLA genes.
* `out.log.*`: bwa-mem, samblaster and HLA typing log files.
Note that `run-bwamem` only prints command lines but doesn't execute them. It
is advised to have a look at the command lines before passing them to `sh` for
actual execution.
## Background
GRCh38 consists of several components: chromosomal assembly, unlocalized contigs
(chromosome known but location unknown), unplaced contigs (chromosome unknown)
and ALT contigs (long clustered variations). The combination of the first three
components is called the *primary assembly*. You can find the more exact
definitions from the [GRC website][grcdef].
components is called the *primary assembly*. It is recommended to use the
complete primary assembly for all analyses. Using ALT contigs in read mapping is
tricky.
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions.
However, sequences that are highly diverged from the primary assembly only
contribute a few million bp. Most subsequences of ALT contigs are nearly
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary
assembly. However, sequences that are highly diverged from the primary assembly
only contribute a few million bp. Most subsequences of ALT contigs are nearly
identical to the primary assembly. If we align sequence reads to GRCh38+ALT
treating ALT equal to the primary assembly, we will get many reads with zero
mapping quality and lose variants on them. It is crucial to make the mapper
aware of ALTs.
blindly, we will get many additional reads with zero mapping quality and miss
variants on them. It is crucial to make mappers aware of ALTs.
BWA-MEM is designed to minimize the interference of ALT contigs such that on the
primary assembly, the ALT-aware alignment is highly similar to the alignment
without using ALT contigs in the index. This design choice makes it almost
always safe to map reads to GRCh38+ALT. Although we don't know yet how much
variations on ALT contigs contribute to phenotypes, we would not get the answer
without mapping large cohorts to these extra sequences. We hope our current
implementation encourages researchers to use ALT contigs soon and often.
BWA-MEM is ALT-aware. It essentially computes mapping quality across the
non-redundant content of the primary assembly plus the ALT contigs and is free
of the problem above.
## Methods
### Sequence alignment
As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and
postprocessing.
postprocessing. The `bwa.kit/run-bwamem` script performs the two steps when ALT
contigs are present. The following picture shows an example about how BWA-MEM
infers mapping quality and reports alignment after step 2:
![](https://raw.githubusercontent.com/lh3/bwa/dev/extras/alt-demo.png)
#### Step 1: BWA-MEM mapping
@ -65,11 +71,11 @@ alignments and assigns mapQ following these two rules:
2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
primary. ALT hits are reported as supplementary alignments (flag 0x800) only
if they are better than all overlapping non-ALT hits.
primary and ALT hits be supplementary (SAM flag 0x800) if ALT hits are better
than the best overlapping non-ALT hits.
In theory, non-ALT alignments from step 1 should be identical to alignments
against a reference genome with ALT contigs. In practice, the two types of
against the reference genome with ALT contigs. In practice, the two types of
alignments may differ in rare cases due to seeding heuristics. When an ALT hit
is significantly better than non-ALT hits, BWA-MEM may miss seeds on the
non-ALT hits.
@ -102,32 +108,32 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
The sequences are included in `hs38d4-extra.fa` from the [BWA resource bundle
for GRCh38][res].
The sequences are included in `hs38d6-extra.fa` from the [BWA binary
package][res].
In addition to decoy, we also put multiple alleles of HLA genes in
`hs38d4-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
version 3.18.0. Script `bwa-postalt.js` also helps to genotype HLA genes, though
not to high resolution for now.
`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
version 3.18.0 and are used to collect reads sequenced from these genes.
### More on HLA typing
### HLA typing
It is [well known][hlalink] that HLA genes are associated with many autoimmunity
infectious diseases and drug responses. However, many HLA alleles are highly
diverged from the reference genome. If we map whole-genome shotgun (WGS) reads
to the reference only, many allele-informative will get lost. As a result, the
vast majority of WGS projects have ignored these important genes.
HLA genes are known to be associated with many autoimmune diseases, infectious
diseases and drug responses. They are among the most important genes but are
rarely studied by WGS projects due to the high sequence divergence between
HLA genes and the reference genome in these regions.
We recommend to include the genomic regions of classical HLA genes in the BWA
index. This way we will be able to get a more complete collection of reads
mapped to HLA. We can then isolate these reads with little computational cost
and type HLA genes with another program, such as [Warren et al (2012)][hla4],
[Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1]
or others from [this list][hlatools].
### Evaluating ALT Mapping
(Coming soon...)
By including the HLA gene regions in the reference assembly as ALT contigs, we
are able to effectively identify reads coming from these genes. We also provide
a pipeline, which is included in the [BWA binary package][res], to type the
several classic HLA genes. The pipeline is conceptually simple. It de novo
assembles sequence reads mapped to each gene, aligns exon sequences of each
allele to the assembled contigs and then finds the pairs of alleles that best
explain the contigs. In practice, however, the completeness of IMGT/HLA and
copy-number changes related to these genes are not so straightforward to
resolve. HLA typing may not always be successful. Users may also consider to use
other programs for typing such as [Warren et al (2012)][hla4], [Liu et al
(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though
most of them are distributed under restrictive licenses.
## Problems and Future Development

View File

@ -0,0 +1,328 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>ActiveLayerIndex</key>
<integer>0</integer>
<key>ApplicationVersion</key>
<array>
<string>com.omnigroup.OmniGraffle</string>
<string>139.18.0.187838</string>
</array>
<key>AutoAdjust</key>
<true/>
<key>BackgroundGraphic</key>
<dict>
<key>Bounds</key>
<string>{{0, 0}, {576, 733}}</string>
<key>Class</key>
<string>SolidGraphic</string>
<key>ID</key>
<integer>2</integer>
<key>Style</key>
<dict>
<key>shadow</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
<key>stroke</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
</dict>
</dict>
<key>BaseZoom</key>
<integer>0</integer>
<key>CanvasOrigin</key>
<string>{0, 0}</string>
<key>ColumnAlign</key>
<integer>1</integer>
<key>ColumnSpacing</key>
<real>36</real>
<key>CreationDate</key>
<string>2014-11-17 16:51:42 +0000</string>
<key>Creator</key>
<string>Heng Li</string>
<key>DisplayScale</key>
<string>1 0/72 in = 1 0/72 in</string>
<key>GraphDocumentVersion</key>
<integer>8</integer>
<key>GraphicsList</key>
<array>
<dict>
<key>Bounds</key>
<string>{{35.699992179870605, 151.89999580383301}, {476, 224}}</string>
<key>Class</key>
<string>ShapedGraphic</string>
<key>FitText</key>
<string>YES</string>
<key>Flow</key>
<string>Resize</string>
<key>FontInfo</key>
<dict>
<key>Font</key>
<string>AndaleMono</string>
<key>Size</key>
<real>12</real>
</dict>
<key>ID</key>
<integer>28</integer>
<key>Shape</key>
<string>Rectangle</string>
<key>Style</key>
<dict>
<key>fill</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
<key>shadow</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
<key>stroke</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
</dict>
<key>Text</key>
<dict>
<key>Align</key>
<integer>0</integer>
<key>Pad</key>
<integer>0</integer>
<key>Text</key>
<string>{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
\cocoascreenfonts1{\fonttbl\f0\fnil\fcharset0 Consolas;\f1\fnil\fcharset0 Consolas-Bold;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red127\green127\blue127;\red255\green0\blue0;
\red204\green204\blue204;\red0\green0\blue255;\red0\green128\blue0;\red255\green128\blue0;}
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\fs24 \cf2 Read: A\cf0 TCAGCATC\
\cf2 \
ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATCA
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf3 GAACTAGTCACAT\cf2 \
\cf3 |||||\cf5 (high div) \cf3 |||\cf5 (novel ins)\cf3 ||||||||||\cf2 \
Chromosome:\cf3 GCGTACATGATACGA
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf3 ATC-------------CTAGTCACATCGTAATCGA\
\cf2 \cf3 |||||||||||| |||||||\cf5 (novel ins) \cf3 ||||||||||\
\cf2 ALT ctg 2:\cf3 TGATACGA
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf3 ATCA
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf3 GAACTAGTCACAT\
\
\cf2 4 potential hits:
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 &gt;
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf0 &gt;
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf2 &gt;
\f1\b \cf8 ATCgcCAgC\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\b0 \cf0 2 hit groups: \{
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 ,
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf2 \} and\cf0 \{
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf2 ,
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf2 \}\
\cf0 Hits considered in mapQ:
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 and
\f1\b \cf6 ATCgGCATC\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\b0 \cf3 \
\cf2 In the output SAM:
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf2 as the primary SAM line with mapQ=0\
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf2 as a supplementary line with mapQ&gt;0\
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf2 as a supplementary line with mapQ&gt;0\
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf2 in an XA tag, not as a separate line}</string>
<key>VerticalPad</key>
<integer>0</integer>
</dict>
<key>Wrap</key>
<string>NO</string>
</dict>
</array>
<key>GridInfo</key>
<dict>
<key>GridSpacing</key>
<real>7.1999998092651367</real>
<key>MajorGridSpacing</key>
<integer>10</integer>
<key>SnapsToGrid</key>
<string>YES</string>
</dict>
<key>GuidesLocked</key>
<string>NO</string>
<key>GuidesVisible</key>
<string>YES</string>
<key>HPages</key>
<integer>1</integer>
<key>ImageCounter</key>
<integer>1</integer>
<key>KeepToScale</key>
<false/>
<key>Layers</key>
<array>
<dict>
<key>Lock</key>
<string>NO</string>
<key>Name</key>
<string>Layer 1</string>
<key>Print</key>
<string>YES</string>
<key>View</key>
<string>YES</string>
</dict>
</array>
<key>LayoutInfo</key>
<dict>
<key>Animate</key>
<string>NO</string>
<key>circoMinDist</key>
<real>18</real>
<key>circoSeparation</key>
<real>0.0</real>
<key>layoutEngine</key>
<string>dot</string>
<key>neatoSeparation</key>
<real>0.0</real>
<key>twopiSeparation</key>
<real>0.0</real>
</dict>
<key>LinksVisible</key>
<string>NO</string>
<key>MagnetsVisible</key>
<string>NO</string>
<key>MasterSheets</key>
<array/>
<key>ModificationDate</key>
<string>2014-11-17 18:01:49 +0000</string>
<key>Modifier</key>
<string>Heng Li</string>
<key>NotesVisible</key>
<string>NO</string>
<key>Orientation</key>
<integer>2</integer>
<key>OriginVisible</key>
<string>NO</string>
<key>PageBreaks</key>
<string>YES</string>
<key>PrintInfo</key>
<dict>
<key>NSBottomMargin</key>
<array>
<string>float</string>
<string>41</string>
</array>
<key>NSHorizonalPagination</key>
<array>
<string>coded</string>
<string>BAtzdHJlYW10eXBlZIHoA4QBQISEhAhOU051bWJlcgCEhAdOU1ZhbHVlAISECE5TT2JqZWN0AIWEASqEhAFxlwCG</string>
</array>
<key>NSLeftMargin</key>
<array>
<string>float</string>
<string>18</string>
</array>
<key>NSPaperSize</key>
<array>
<string>size</string>
<string>{612, 792}</string>
</array>
<key>NSPrintReverseOrientation</key>
<array>
<string>int</string>
<string>0</string>
</array>
<key>NSRightMargin</key>
<array>
<string>float</string>
<string>18</string>
</array>
<key>NSTopMargin</key>
<array>
<string>float</string>
<string>18</string>
</array>
</dict>
<key>PrintOnePage</key>
<false/>
<key>ReadOnly</key>
<string>NO</string>
<key>RowAlign</key>
<integer>1</integer>
<key>RowSpacing</key>
<real>36</real>
<key>SheetTitle</key>
<string>Canvas 1</string>
<key>SmartAlignmentGuidesActive</key>
<string>YES</string>
<key>SmartDistanceGuidesActive</key>
<string>YES</string>
<key>UniqueID</key>
<integer>1</integer>
<key>UseEntirePage</key>
<false/>
<key>VPages</key>
<integer>1</integer>
<key>WindowInfo</key>
<dict>
<key>CurrentSheet</key>
<integer>0</integer>
<key>ExpandedCanvases</key>
<array>
<dict>
<key>name</key>
<string>Canvas 1</string>
</dict>
</array>
<key>Frame</key>
<string>{{367, 6}, {710, 872}}</string>
<key>ListView</key>
<true/>
<key>OutlineWidth</key>
<integer>142</integer>
<key>RightSidebar</key>
<false/>
<key>ShowRuler</key>
<true/>
<key>Sidebar</key>
<true/>
<key>SidebarWidth</key>
<integer>120</integer>
<key>VisibleRegion</key>
<string>{{0, 0}, {575, 733}}</string>
<key>Zoom</key>
<real>1</real>
<key>ZoomValues</key>
<array>
<array>
<string>Canvas 1</string>
<real>1</real>
<real>1</real>
</array>
</array>
</dict>
</dict>
</plist>

BIN
extras/alt-demo.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB