Updated README for ALT mapping

This commit is contained in:
Heng Li 2014-11-17 13:20:23 -05:00
parent ad9d12c04d
commit cee8149b12
3 changed files with 388 additions and 54 deletions

View File

@ -1,57 +1,63 @@
## Getting Started ## Getting Started
Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
with long alternative haplotypes present in separate ALT contigs. To use the
ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
is available from the [binary package of BWA][res].
```sh ```sh
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index # Download the bwa-0.7.11 binary package
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \ wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf - | gzip -dc | tar xf -
bwa.kit/run-gen-hs38d6 # download GRCh38 and write hs38d6.fa # Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
bwa.kit/bwa index hs38d6.fa # create BWA index bwa.kit/bwa index hs38d6.fa # create BWA index
# mapping # mapping
bwa.kit/run-bwamem hs38d6.fa read1.fq read2.fq | sh bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
``` ```
In the final alignment, a read may be placed on the [primary assembly][grcdef] This will generate the following files:
and multiple overlapping ALT contigs at the same time (on multiple SAM lines).
Mapping quality (mapQ) is properly adjusted by the postprocessing script * `out.aln.sam.gz`: unsorted alignments with ALT-aware mapping quality. In this
`bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For file, one read may be placed on multiple overlapping ALT contigs at the same
details, see the [Methods section](#methods). time even if the read is mapped better to some contigs than others. This makes
it possible to analyze each contig independent of others.
* `out.hla.top`: best genotypes for HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1 genes.
* `out.hla.all`: other possible genotypes on the six HLA genes.
* `out.log.*`: bwa-mem, samblaster and HLA typing log files.
Note that `run-bwamem` only prints command lines but doesn't execute them. It
is advised to have a look at the command lines before passing them to `sh` for
actual execution.
## Background ## Background
GRCh38 consists of several components: chromosomal assembly, unlocalized contigs GRCh38 consists of several components: chromosomal assembly, unlocalized contigs
(chromosome known but location unknown), unplaced contigs (chromosome unknown) (chromosome known but location unknown), unplaced contigs (chromosome unknown)
and ALT contigs (long clustered variations). The combination of the first three and ALT contigs (long clustered variations). The combination of the first three
components is called the *primary assembly*. You can find the more exact components is called the *primary assembly*. It is recommended to use the
definitions from the [GRC website][grcdef]. complete primary assembly for all analyses. Using ALT contigs in read mapping is
tricky.
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions. GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary
However, sequences that are highly diverged from the primary assembly only assembly. However, sequences that are highly diverged from the primary assembly
contribute a few million bp. Most subsequences of ALT contigs are nearly only contribute a few million bp. Most subsequences of ALT contigs are nearly
identical to the primary assembly. If we align sequence reads to GRCh38+ALT identical to the primary assembly. If we align sequence reads to GRCh38+ALT
treating ALT equal to the primary assembly, we will get many reads with zero blindly, we will get many additional reads with zero mapping quality and miss
mapping quality and lose variants on them. It is crucial to make the mapper variants on them. It is crucial to make mappers aware of ALTs.
aware of ALTs.
BWA-MEM is designed to minimize the interference of ALT contigs such that on the BWA-MEM is ALT-aware. It essentially computes mapping quality across the
primary assembly, the ALT-aware alignment is highly similar to the alignment non-redundant content of the primary assembly plus the ALT contigs and is free
without using ALT contigs in the index. This design choice makes it almost of the problem above.
always safe to map reads to GRCh38+ALT. Although we don't know yet how much
variations on ALT contigs contribute to phenotypes, we would not get the answer
without mapping large cohorts to these extra sequences. We hope our current
implementation encourages researchers to use ALT contigs soon and often.
## Methods ## Methods
### Sequence alignment ### Sequence alignment
As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and
postprocessing. postprocessing. The `bwa.kit/run-bwamem` script performs the two steps when ALT
contigs are present. The following picture shows an example about how BWA-MEM
infers mapping quality and reports alignment after step 2:
![](https://raw.githubusercontent.com/lh3/bwa/dev/extras/alt-demo.png)
#### Step 1: BWA-MEM mapping #### Step 1: BWA-MEM mapping
@ -65,11 +71,11 @@ alignments and assigns mapQ following these two rules:
2. If there are no non-ALT hits, the best ALT hit is outputted as the primary 2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
alignment. If there are both ALT and non-ALT hits, non-ALT hits will be alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
primary. ALT hits are reported as supplementary alignments (flag 0x800) only primary and ALT hits be supplementary (SAM flag 0x800) if ALT hits are better
if they are better than all overlapping non-ALT hits. than the best overlapping non-ALT hits.
In theory, non-ALT alignments from step 1 should be identical to alignments In theory, non-ALT alignments from step 1 should be identical to alignments
against a reference genome with ALT contigs. In practice, the two types of against the reference genome with ALT contigs. In practice, the two types of
alignments may differ in rare cases due to seeding heuristics. When an ALT hit alignments may differ in rare cases due to seeding heuristics. When an ALT hit
is significantly better than non-ALT hits, BWA-MEM may miss seeds on the is significantly better than non-ALT hits, BWA-MEM may miss seeds on the
non-ALT hits. non-ALT hits.
@ -102,32 +108,32 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
For a more complete reference genome, we compiled a new set of decoy sequences For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples. from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
The sequences are included in `hs38d4-extra.fa` from the [BWA resource bundle The sequences are included in `hs38d6-extra.fa` from the [BWA binary
for GRCh38][res]. package][res].
In addition to decoy, we also put multiple alleles of HLA genes in In addition to decoy, we also put multiple alleles of HLA genes in
`hs38d4-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], `hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
version 3.18.0. Script `bwa-postalt.js` also helps to genotype HLA genes, though version 3.18.0 and are used to collect reads sequenced from these genes.
not to high resolution for now.
### More on HLA typing ### HLA typing
It is [well known][hlalink] that HLA genes are associated with many autoimmunity HLA genes are known to be associated with many autoimmune diseases, infectious
infectious diseases and drug responses. However, many HLA alleles are highly diseases and drug responses. They are among the most important genes but are
diverged from the reference genome. If we map whole-genome shotgun (WGS) reads rarely studied by WGS projects due to the high sequence divergence between
to the reference only, many allele-informative will get lost. As a result, the HLA genes and the reference genome in these regions.
vast majority of WGS projects have ignored these important genes.
We recommend to include the genomic regions of classical HLA genes in the BWA By including the HLA gene regions in the reference assembly as ALT contigs, we
index. This way we will be able to get a more complete collection of reads are able to effectively identify reads coming from these genes. We also provide
mapped to HLA. We can then isolate these reads with little computational cost a pipeline, which is included in the [BWA binary package][res], to type the
and type HLA genes with another program, such as [Warren et al (2012)][hla4], several classic HLA genes. The pipeline is conceptually simple. It de novo
[Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1] assembles sequence reads mapped to each gene, aligns exon sequences of each
or others from [this list][hlatools]. allele to the assembled contigs and then finds the pairs of alleles that best
explain the contigs. In practice, however, the completeness of IMGT/HLA and
### Evaluating ALT Mapping copy-number changes related to these genes are not so straightforward to
resolve. HLA typing may not always be successful. Users may also consider to use
(Coming soon...) other programs for typing such as [Warren et al (2012)][hla4], [Liu et al
(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though
most of them are distributed under restrictive licenses.
## Problems and Future Development ## Problems and Future Development

View File

@ -0,0 +1,328 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>ActiveLayerIndex</key>
<integer>0</integer>
<key>ApplicationVersion</key>
<array>
<string>com.omnigroup.OmniGraffle</string>
<string>139.18.0.187838</string>
</array>
<key>AutoAdjust</key>
<true/>
<key>BackgroundGraphic</key>
<dict>
<key>Bounds</key>
<string>{{0, 0}, {576, 733}}</string>
<key>Class</key>
<string>SolidGraphic</string>
<key>ID</key>
<integer>2</integer>
<key>Style</key>
<dict>
<key>shadow</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
<key>stroke</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
</dict>
</dict>
<key>BaseZoom</key>
<integer>0</integer>
<key>CanvasOrigin</key>
<string>{0, 0}</string>
<key>ColumnAlign</key>
<integer>1</integer>
<key>ColumnSpacing</key>
<real>36</real>
<key>CreationDate</key>
<string>2014-11-17 16:51:42 +0000</string>
<key>Creator</key>
<string>Heng Li</string>
<key>DisplayScale</key>
<string>1 0/72 in = 1 0/72 in</string>
<key>GraphDocumentVersion</key>
<integer>8</integer>
<key>GraphicsList</key>
<array>
<dict>
<key>Bounds</key>
<string>{{35.699992179870605, 151.89999580383301}, {476, 224}}</string>
<key>Class</key>
<string>ShapedGraphic</string>
<key>FitText</key>
<string>YES</string>
<key>Flow</key>
<string>Resize</string>
<key>FontInfo</key>
<dict>
<key>Font</key>
<string>AndaleMono</string>
<key>Size</key>
<real>12</real>
</dict>
<key>ID</key>
<integer>28</integer>
<key>Shape</key>
<string>Rectangle</string>
<key>Style</key>
<dict>
<key>fill</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
<key>shadow</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
<key>stroke</key>
<dict>
<key>Draws</key>
<string>NO</string>
</dict>
</dict>
<key>Text</key>
<dict>
<key>Align</key>
<integer>0</integer>
<key>Pad</key>
<integer>0</integer>
<key>Text</key>
<string>{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
\cocoascreenfonts1{\fonttbl\f0\fnil\fcharset0 Consolas;\f1\fnil\fcharset0 Consolas-Bold;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red127\green127\blue127;\red255\green0\blue0;
\red204\green204\blue204;\red0\green0\blue255;\red0\green128\blue0;\red255\green128\blue0;}
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\fs24 \cf2 Read: A\cf0 TCAGCATC\
\cf2 \
ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATCA
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf3 GAACTAGTCACAT\cf2 \
\cf3 |||||\cf5 (high div) \cf3 |||\cf5 (novel ins)\cf3 ||||||||||\cf2 \
Chromosome:\cf3 GCGTACATGATACGA
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf3 ATC-------------CTAGTCACATCGTAATCGA\
\cf2 \cf3 |||||||||||| |||||||\cf5 (novel ins) \cf3 ||||||||||\
\cf2 ALT ctg 2:\cf3 TGATACGA
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf3 ATCA
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf3 GAACTAGTCACAT\
\
\cf2 4 potential hits:
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 &gt;
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf0 &gt;
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf2 &gt;
\f1\b \cf8 ATCgcCAgC\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\b0 \cf0 2 hit groups: \{
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 ,
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf2 \} and\cf0 \{
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf2 ,
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf2 \}\
\cf0 Hits considered in mapQ:
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 and
\f1\b \cf6 ATCgGCATC\
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\b0 \cf3 \
\cf2 In the output SAM:
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf2 as the primary SAM line with mapQ=0\
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf2 as a supplementary line with mapQ&gt;0\
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf2 as a supplementary line with mapQ&gt;0\
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf2 in an XA tag, not as a separate line}</string>
<key>VerticalPad</key>
<integer>0</integer>
</dict>
<key>Wrap</key>
<string>NO</string>
</dict>
</array>
<key>GridInfo</key>
<dict>
<key>GridSpacing</key>
<real>7.1999998092651367</real>
<key>MajorGridSpacing</key>
<integer>10</integer>
<key>SnapsToGrid</key>
<string>YES</string>
</dict>
<key>GuidesLocked</key>
<string>NO</string>
<key>GuidesVisible</key>
<string>YES</string>
<key>HPages</key>
<integer>1</integer>
<key>ImageCounter</key>
<integer>1</integer>
<key>KeepToScale</key>
<false/>
<key>Layers</key>
<array>
<dict>
<key>Lock</key>
<string>NO</string>
<key>Name</key>
<string>Layer 1</string>
<key>Print</key>
<string>YES</string>
<key>View</key>
<string>YES</string>
</dict>
</array>
<key>LayoutInfo</key>
<dict>
<key>Animate</key>
<string>NO</string>
<key>circoMinDist</key>
<real>18</real>
<key>circoSeparation</key>
<real>0.0</real>
<key>layoutEngine</key>
<string>dot</string>
<key>neatoSeparation</key>
<real>0.0</real>
<key>twopiSeparation</key>
<real>0.0</real>
</dict>
<key>LinksVisible</key>
<string>NO</string>
<key>MagnetsVisible</key>
<string>NO</string>
<key>MasterSheets</key>
<array/>
<key>ModificationDate</key>
<string>2014-11-17 18:01:49 +0000</string>
<key>Modifier</key>
<string>Heng Li</string>
<key>NotesVisible</key>
<string>NO</string>
<key>Orientation</key>
<integer>2</integer>
<key>OriginVisible</key>
<string>NO</string>
<key>PageBreaks</key>
<string>YES</string>
<key>PrintInfo</key>
<dict>
<key>NSBottomMargin</key>
<array>
<string>float</string>
<string>41</string>
</array>
<key>NSHorizonalPagination</key>
<array>
<string>coded</string>
<string>BAtzdHJlYW10eXBlZIHoA4QBQISEhAhOU051bWJlcgCEhAdOU1ZhbHVlAISECE5TT2JqZWN0AIWEASqEhAFxlwCG</string>
</array>
<key>NSLeftMargin</key>
<array>
<string>float</string>
<string>18</string>
</array>
<key>NSPaperSize</key>
<array>
<string>size</string>
<string>{612, 792}</string>
</array>
<key>NSPrintReverseOrientation</key>
<array>
<string>int</string>
<string>0</string>
</array>
<key>NSRightMargin</key>
<array>
<string>float</string>
<string>18</string>
</array>
<key>NSTopMargin</key>
<array>
<string>float</string>
<string>18</string>
</array>
</dict>
<key>PrintOnePage</key>
<false/>
<key>ReadOnly</key>
<string>NO</string>
<key>RowAlign</key>
<integer>1</integer>
<key>RowSpacing</key>
<real>36</real>
<key>SheetTitle</key>
<string>Canvas 1</string>
<key>SmartAlignmentGuidesActive</key>
<string>YES</string>
<key>SmartDistanceGuidesActive</key>
<string>YES</string>
<key>UniqueID</key>
<integer>1</integer>
<key>UseEntirePage</key>
<false/>
<key>VPages</key>
<integer>1</integer>
<key>WindowInfo</key>
<dict>
<key>CurrentSheet</key>
<integer>0</integer>
<key>ExpandedCanvases</key>
<array>
<dict>
<key>name</key>
<string>Canvas 1</string>
</dict>
</array>
<key>Frame</key>
<string>{{367, 6}, {710, 872}}</string>
<key>ListView</key>
<true/>
<key>OutlineWidth</key>
<integer>142</integer>
<key>RightSidebar</key>
<false/>
<key>ShowRuler</key>
<true/>
<key>Sidebar</key>
<true/>
<key>SidebarWidth</key>
<integer>120</integer>
<key>VisibleRegion</key>
<string>{{0, 0}, {575, 733}}</string>
<key>Zoom</key>
<real>1</real>
<key>ZoomValues</key>
<array>
<array>
<string>Canvas 1</string>
<real>1</real>
<real>1</real>
</array>
</array>
</dict>
</dict>
</plist>

BIN
extras/alt-demo.png 100644

Binary file not shown.

After

Width:  |  Height:  |  Size: 45 KiB