Updated README for ALT mapping
This commit is contained in:
parent
ad9d12c04d
commit
cee8149b12
114
README-alt.md
114
README-alt.md
|
|
@ -1,57 +1,63 @@
|
|||
## Getting Started
|
||||
|
||||
Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
|
||||
with long alternative haplotypes present in separate ALT contigs. To use the
|
||||
ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
|
||||
SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
|
||||
is available from the [binary package of BWA][res].
|
||||
|
||||
```sh
|
||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||
# Download the bwa-0.7.11 binary package
|
||||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||
| gzip -dc | tar xf -
|
||||
bwa.kit/run-gen-hs38d6 # download GRCh38 and write hs38d6.fa
|
||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
|
||||
bwa.kit/bwa index hs38d6.fa # create BWA index
|
||||
# mapping
|
||||
bwa.kit/run-bwamem hs38d6.fa read1.fq read2.fq | sh
|
||||
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
||||
```
|
||||
|
||||
In the final alignment, a read may be placed on the [primary assembly][grcdef]
|
||||
and multiple overlapping ALT contigs at the same time (on multiple SAM lines).
|
||||
Mapping quality (mapQ) is properly adjusted by the postprocessing script
|
||||
`bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For
|
||||
details, see the [Methods section](#methods).
|
||||
This will generate the following files:
|
||||
|
||||
* `out.aln.sam.gz`: unsorted alignments with ALT-aware mapping quality. In this
|
||||
file, one read may be placed on multiple overlapping ALT contigs at the same
|
||||
time even if the read is mapped better to some contigs than others. This makes
|
||||
it possible to analyze each contig independent of others.
|
||||
|
||||
* `out.hla.top`: best genotypes for HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1 genes.
|
||||
|
||||
* `out.hla.all`: other possible genotypes on the six HLA genes.
|
||||
|
||||
* `out.log.*`: bwa-mem, samblaster and HLA typing log files.
|
||||
|
||||
Note that `run-bwamem` only prints command lines but doesn't execute them. It
|
||||
is advised to have a look at the command lines before passing them to `sh` for
|
||||
actual execution.
|
||||
|
||||
## Background
|
||||
|
||||
GRCh38 consists of several components: chromosomal assembly, unlocalized contigs
|
||||
(chromosome known but location unknown), unplaced contigs (chromosome unknown)
|
||||
and ALT contigs (long clustered variations). The combination of the first three
|
||||
components is called the *primary assembly*. You can find the more exact
|
||||
definitions from the [GRC website][grcdef].
|
||||
components is called the *primary assembly*. It is recommended to use the
|
||||
complete primary assembly for all analyses. Using ALT contigs in read mapping is
|
||||
tricky.
|
||||
|
||||
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions.
|
||||
However, sequences that are highly diverged from the primary assembly only
|
||||
contribute a few million bp. Most subsequences of ALT contigs are nearly
|
||||
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary
|
||||
assembly. However, sequences that are highly diverged from the primary assembly
|
||||
only contribute a few million bp. Most subsequences of ALT contigs are nearly
|
||||
identical to the primary assembly. If we align sequence reads to GRCh38+ALT
|
||||
treating ALT equal to the primary assembly, we will get many reads with zero
|
||||
mapping quality and lose variants on them. It is crucial to make the mapper
|
||||
aware of ALTs.
|
||||
blindly, we will get many additional reads with zero mapping quality and miss
|
||||
variants on them. It is crucial to make mappers aware of ALTs.
|
||||
|
||||
BWA-MEM is designed to minimize the interference of ALT contigs such that on the
|
||||
primary assembly, the ALT-aware alignment is highly similar to the alignment
|
||||
without using ALT contigs in the index. This design choice makes it almost
|
||||
always safe to map reads to GRCh38+ALT. Although we don't know yet how much
|
||||
variations on ALT contigs contribute to phenotypes, we would not get the answer
|
||||
without mapping large cohorts to these extra sequences. We hope our current
|
||||
implementation encourages researchers to use ALT contigs soon and often.
|
||||
BWA-MEM is ALT-aware. It essentially computes mapping quality across the
|
||||
non-redundant content of the primary assembly plus the ALT contigs and is free
|
||||
of the problem above.
|
||||
|
||||
## Methods
|
||||
|
||||
### Sequence alignment
|
||||
|
||||
As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and
|
||||
postprocessing.
|
||||
postprocessing. The `bwa.kit/run-bwamem` script performs the two steps when ALT
|
||||
contigs are present. The following picture shows an example about how BWA-MEM
|
||||
infers mapping quality and reports alignment after step 2:
|
||||
|
||||

|
||||
|
||||
#### Step 1: BWA-MEM mapping
|
||||
|
||||
|
|
@ -65,11 +71,11 @@ alignments and assigns mapQ following these two rules:
|
|||
|
||||
2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
|
||||
alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
|
||||
primary. ALT hits are reported as supplementary alignments (flag 0x800) only
|
||||
if they are better than all overlapping non-ALT hits.
|
||||
primary and ALT hits be supplementary (SAM flag 0x800) if ALT hits are better
|
||||
than the best overlapping non-ALT hits.
|
||||
|
||||
In theory, non-ALT alignments from step 1 should be identical to alignments
|
||||
against a reference genome with ALT contigs. In practice, the two types of
|
||||
against the reference genome with ALT contigs. In practice, the two types of
|
||||
alignments may differ in rare cases due to seeding heuristics. When an ALT hit
|
||||
is significantly better than non-ALT hits, BWA-MEM may miss seeds on the
|
||||
non-ALT hits.
|
||||
|
|
@ -102,32 +108,32 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
|
|||
|
||||
For a more complete reference genome, we compiled a new set of decoy sequences
|
||||
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
|
||||
The sequences are included in `hs38d4-extra.fa` from the [BWA resource bundle
|
||||
for GRCh38][res].
|
||||
The sequences are included in `hs38d6-extra.fa` from the [BWA binary
|
||||
package][res].
|
||||
|
||||
In addition to decoy, we also put multiple alleles of HLA genes in
|
||||
`hs38d4-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||
version 3.18.0. Script `bwa-postalt.js` also helps to genotype HLA genes, though
|
||||
not to high resolution for now.
|
||||
`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||
version 3.18.0 and are used to collect reads sequenced from these genes.
|
||||
|
||||
### More on HLA typing
|
||||
### HLA typing
|
||||
|
||||
It is [well known][hlalink] that HLA genes are associated with many autoimmunity
|
||||
infectious diseases and drug responses. However, many HLA alleles are highly
|
||||
diverged from the reference genome. If we map whole-genome shotgun (WGS) reads
|
||||
to the reference only, many allele-informative will get lost. As a result, the
|
||||
vast majority of WGS projects have ignored these important genes.
|
||||
HLA genes are known to be associated with many autoimmune diseases, infectious
|
||||
diseases and drug responses. They are among the most important genes but are
|
||||
rarely studied by WGS projects due to the high sequence divergence between
|
||||
HLA genes and the reference genome in these regions.
|
||||
|
||||
We recommend to include the genomic regions of classical HLA genes in the BWA
|
||||
index. This way we will be able to get a more complete collection of reads
|
||||
mapped to HLA. We can then isolate these reads with little computational cost
|
||||
and type HLA genes with another program, such as [Warren et al (2012)][hla4],
|
||||
[Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1]
|
||||
or others from [this list][hlatools].
|
||||
|
||||
### Evaluating ALT Mapping
|
||||
|
||||
(Coming soon...)
|
||||
By including the HLA gene regions in the reference assembly as ALT contigs, we
|
||||
are able to effectively identify reads coming from these genes. We also provide
|
||||
a pipeline, which is included in the [BWA binary package][res], to type the
|
||||
several classic HLA genes. The pipeline is conceptually simple. It de novo
|
||||
assembles sequence reads mapped to each gene, aligns exon sequences of each
|
||||
allele to the assembled contigs and then finds the pairs of alleles that best
|
||||
explain the contigs. In practice, however, the completeness of IMGT/HLA and
|
||||
copy-number changes related to these genes are not so straightforward to
|
||||
resolve. HLA typing may not always be successful. Users may also consider to use
|
||||
other programs for typing such as [Warren et al (2012)][hla4], [Liu et al
|
||||
(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though
|
||||
most of them are distributed under restrictive licenses.
|
||||
|
||||
## Problems and Future Development
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,328 @@
|
|||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
|
||||
<plist version="1.0">
|
||||
<dict>
|
||||
<key>ActiveLayerIndex</key>
|
||||
<integer>0</integer>
|
||||
<key>ApplicationVersion</key>
|
||||
<array>
|
||||
<string>com.omnigroup.OmniGraffle</string>
|
||||
<string>139.18.0.187838</string>
|
||||
</array>
|
||||
<key>AutoAdjust</key>
|
||||
<true/>
|
||||
<key>BackgroundGraphic</key>
|
||||
<dict>
|
||||
<key>Bounds</key>
|
||||
<string>{{0, 0}, {576, 733}}</string>
|
||||
<key>Class</key>
|
||||
<string>SolidGraphic</string>
|
||||
<key>ID</key>
|
||||
<integer>2</integer>
|
||||
<key>Style</key>
|
||||
<dict>
|
||||
<key>shadow</key>
|
||||
<dict>
|
||||
<key>Draws</key>
|
||||
<string>NO</string>
|
||||
</dict>
|
||||
<key>stroke</key>
|
||||
<dict>
|
||||
<key>Draws</key>
|
||||
<string>NO</string>
|
||||
</dict>
|
||||
</dict>
|
||||
</dict>
|
||||
<key>BaseZoom</key>
|
||||
<integer>0</integer>
|
||||
<key>CanvasOrigin</key>
|
||||
<string>{0, 0}</string>
|
||||
<key>ColumnAlign</key>
|
||||
<integer>1</integer>
|
||||
<key>ColumnSpacing</key>
|
||||
<real>36</real>
|
||||
<key>CreationDate</key>
|
||||
<string>2014-11-17 16:51:42 +0000</string>
|
||||
<key>Creator</key>
|
||||
<string>Heng Li</string>
|
||||
<key>DisplayScale</key>
|
||||
<string>1 0/72 in = 1 0/72 in</string>
|
||||
<key>GraphDocumentVersion</key>
|
||||
<integer>8</integer>
|
||||
<key>GraphicsList</key>
|
||||
<array>
|
||||
<dict>
|
||||
<key>Bounds</key>
|
||||
<string>{{35.699992179870605, 151.89999580383301}, {476, 224}}</string>
|
||||
<key>Class</key>
|
||||
<string>ShapedGraphic</string>
|
||||
<key>FitText</key>
|
||||
<string>YES</string>
|
||||
<key>Flow</key>
|
||||
<string>Resize</string>
|
||||
<key>FontInfo</key>
|
||||
<dict>
|
||||
<key>Font</key>
|
||||
<string>AndaleMono</string>
|
||||
<key>Size</key>
|
||||
<real>12</real>
|
||||
</dict>
|
||||
<key>ID</key>
|
||||
<integer>28</integer>
|
||||
<key>Shape</key>
|
||||
<string>Rectangle</string>
|
||||
<key>Style</key>
|
||||
<dict>
|
||||
<key>fill</key>
|
||||
<dict>
|
||||
<key>Draws</key>
|
||||
<string>NO</string>
|
||||
</dict>
|
||||
<key>shadow</key>
|
||||
<dict>
|
||||
<key>Draws</key>
|
||||
<string>NO</string>
|
||||
</dict>
|
||||
<key>stroke</key>
|
||||
<dict>
|
||||
<key>Draws</key>
|
||||
<string>NO</string>
|
||||
</dict>
|
||||
</dict>
|
||||
<key>Text</key>
|
||||
<dict>
|
||||
<key>Align</key>
|
||||
<integer>0</integer>
|
||||
<key>Pad</key>
|
||||
<integer>0</integer>
|
||||
<key>Text</key>
|
||||
<string>{\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210
|
||||
\cocoascreenfonts1{\fonttbl\f0\fnil\fcharset0 Consolas;\f1\fnil\fcharset0 Consolas-Bold;}
|
||||
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red127\green127\blue127;\red255\green0\blue0;
|
||||
\red204\green204\blue204;\red0\green0\blue255;\red0\green128\blue0;\red255\green128\blue0;}
|
||||
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
|
||||
|
||||
\f0\fs24 \cf2 Read: A\cf0 TCAGCATC\
|
||||
\cf2 \
|
||||
ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATCA
|
||||
\f1\b \cf4 ATCAGCATC
|
||||
\f0\b0 \cf3 GAACTAGTCACAT\cf2 \
|
||||
\cf3 |||||\cf5 (high div) \cf3 |||\cf5 (novel ins)\cf3 ||||||||||\cf2 \
|
||||
Chromosome:\cf3 GCGTACATGATACGA
|
||||
\f1\b \cf6 ATCgGCATC
|
||||
\f0\b0 \cf3 ATC-------------CTAGTCACATCGTAATCGA\
|
||||
\cf2 \cf3 |||||||||||| |||||||\cf5 (novel ins) \cf3 ||||||||||\
|
||||
\cf2 ALT ctg 2:\cf3 TGATACGA
|
||||
\f1\b \cf7 ATCgcCATC
|
||||
\f0\b0 \cf3 ATCA
|
||||
\f1\b \cf8 ATCgcCAgC
|
||||
\f0\b0 \cf3 GAACTAGTCACAT\
|
||||
\
|
||||
\cf2 4 potential hits:
|
||||
\f1\b \cf4 ATCAGCATC
|
||||
\f0\b0 \cf0 >
|
||||
\f1\b \cf6 ATCgGCATC
|
||||
\f0\b0 \cf0 >
|
||||
\f1\b \cf7 ATCgcCATC
|
||||
\f0\b0 \cf2 >
|
||||
\f1\b \cf8 ATCgcCAgC\
|
||||
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
|
||||
|
||||
\f0\b0 \cf0 2 hit groups: \{
|
||||
\f1\b \cf4 ATCAGCATC
|
||||
\f0\b0 \cf0 ,
|
||||
\f1\b \cf8 ATCgcCAgC
|
||||
\f0\b0 \cf2 \} and\cf0 \{
|
||||
\f1\b \cf6 ATCgGCATC
|
||||
\f0\b0 \cf2 ,
|
||||
\f1\b \cf7 ATCgcCATC
|
||||
\f0\b0 \cf2 \}\
|
||||
\cf0 Hits considered in mapQ:
|
||||
\f1\b \cf4 ATCAGCATC
|
||||
\f0\b0 \cf0 and
|
||||
\f1\b \cf6 ATCgGCATC\
|
||||
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
|
||||
|
||||
\f0\b0 \cf3 \
|
||||
\cf2 In the output SAM:
|
||||
\f1\b \cf6 ATCgGCATC
|
||||
\f0\b0 \cf2 as the primary SAM line with mapQ=0\
|
||||
|
||||
\f1\b \cf4 ATCAGCATC
|
||||
\f0\b0 \cf2 as a supplementary line with mapQ>0\
|
||||
|
||||
\f1\b \cf8 ATCgcCAgC
|
||||
\f0\b0 \cf2 as a supplementary line with mapQ>0\
|
||||
|
||||
\f1\b \cf7 ATCgcCATC
|
||||
\f0\b0 \cf2 in an XA tag, not as a separate line}</string>
|
||||
<key>VerticalPad</key>
|
||||
<integer>0</integer>
|
||||
</dict>
|
||||
<key>Wrap</key>
|
||||
<string>NO</string>
|
||||
</dict>
|
||||
</array>
|
||||
<key>GridInfo</key>
|
||||
<dict>
|
||||
<key>GridSpacing</key>
|
||||
<real>7.1999998092651367</real>
|
||||
<key>MajorGridSpacing</key>
|
||||
<integer>10</integer>
|
||||
<key>SnapsToGrid</key>
|
||||
<string>YES</string>
|
||||
</dict>
|
||||
<key>GuidesLocked</key>
|
||||
<string>NO</string>
|
||||
<key>GuidesVisible</key>
|
||||
<string>YES</string>
|
||||
<key>HPages</key>
|
||||
<integer>1</integer>
|
||||
<key>ImageCounter</key>
|
||||
<integer>1</integer>
|
||||
<key>KeepToScale</key>
|
||||
<false/>
|
||||
<key>Layers</key>
|
||||
<array>
|
||||
<dict>
|
||||
<key>Lock</key>
|
||||
<string>NO</string>
|
||||
<key>Name</key>
|
||||
<string>Layer 1</string>
|
||||
<key>Print</key>
|
||||
<string>YES</string>
|
||||
<key>View</key>
|
||||
<string>YES</string>
|
||||
</dict>
|
||||
</array>
|
||||
<key>LayoutInfo</key>
|
||||
<dict>
|
||||
<key>Animate</key>
|
||||
<string>NO</string>
|
||||
<key>circoMinDist</key>
|
||||
<real>18</real>
|
||||
<key>circoSeparation</key>
|
||||
<real>0.0</real>
|
||||
<key>layoutEngine</key>
|
||||
<string>dot</string>
|
||||
<key>neatoSeparation</key>
|
||||
<real>0.0</real>
|
||||
<key>twopiSeparation</key>
|
||||
<real>0.0</real>
|
||||
</dict>
|
||||
<key>LinksVisible</key>
|
||||
<string>NO</string>
|
||||
<key>MagnetsVisible</key>
|
||||
<string>NO</string>
|
||||
<key>MasterSheets</key>
|
||||
<array/>
|
||||
<key>ModificationDate</key>
|
||||
<string>2014-11-17 18:01:49 +0000</string>
|
||||
<key>Modifier</key>
|
||||
<string>Heng Li</string>
|
||||
<key>NotesVisible</key>
|
||||
<string>NO</string>
|
||||
<key>Orientation</key>
|
||||
<integer>2</integer>
|
||||
<key>OriginVisible</key>
|
||||
<string>NO</string>
|
||||
<key>PageBreaks</key>
|
||||
<string>YES</string>
|
||||
<key>PrintInfo</key>
|
||||
<dict>
|
||||
<key>NSBottomMargin</key>
|
||||
<array>
|
||||
<string>float</string>
|
||||
<string>41</string>
|
||||
</array>
|
||||
<key>NSHorizonalPagination</key>
|
||||
<array>
|
||||
<string>coded</string>
|
||||
<string>BAtzdHJlYW10eXBlZIHoA4QBQISEhAhOU051bWJlcgCEhAdOU1ZhbHVlAISECE5TT2JqZWN0AIWEASqEhAFxlwCG</string>
|
||||
</array>
|
||||
<key>NSLeftMargin</key>
|
||||
<array>
|
||||
<string>float</string>
|
||||
<string>18</string>
|
||||
</array>
|
||||
<key>NSPaperSize</key>
|
||||
<array>
|
||||
<string>size</string>
|
||||
<string>{612, 792}</string>
|
||||
</array>
|
||||
<key>NSPrintReverseOrientation</key>
|
||||
<array>
|
||||
<string>int</string>
|
||||
<string>0</string>
|
||||
</array>
|
||||
<key>NSRightMargin</key>
|
||||
<array>
|
||||
<string>float</string>
|
||||
<string>18</string>
|
||||
</array>
|
||||
<key>NSTopMargin</key>
|
||||
<array>
|
||||
<string>float</string>
|
||||
<string>18</string>
|
||||
</array>
|
||||
</dict>
|
||||
<key>PrintOnePage</key>
|
||||
<false/>
|
||||
<key>ReadOnly</key>
|
||||
<string>NO</string>
|
||||
<key>RowAlign</key>
|
||||
<integer>1</integer>
|
||||
<key>RowSpacing</key>
|
||||
<real>36</real>
|
||||
<key>SheetTitle</key>
|
||||
<string>Canvas 1</string>
|
||||
<key>SmartAlignmentGuidesActive</key>
|
||||
<string>YES</string>
|
||||
<key>SmartDistanceGuidesActive</key>
|
||||
<string>YES</string>
|
||||
<key>UniqueID</key>
|
||||
<integer>1</integer>
|
||||
<key>UseEntirePage</key>
|
||||
<false/>
|
||||
<key>VPages</key>
|
||||
<integer>1</integer>
|
||||
<key>WindowInfo</key>
|
||||
<dict>
|
||||
<key>CurrentSheet</key>
|
||||
<integer>0</integer>
|
||||
<key>ExpandedCanvases</key>
|
||||
<array>
|
||||
<dict>
|
||||
<key>name</key>
|
||||
<string>Canvas 1</string>
|
||||
</dict>
|
||||
</array>
|
||||
<key>Frame</key>
|
||||
<string>{{367, 6}, {710, 872}}</string>
|
||||
<key>ListView</key>
|
||||
<true/>
|
||||
<key>OutlineWidth</key>
|
||||
<integer>142</integer>
|
||||
<key>RightSidebar</key>
|
||||
<false/>
|
||||
<key>ShowRuler</key>
|
||||
<true/>
|
||||
<key>Sidebar</key>
|
||||
<true/>
|
||||
<key>SidebarWidth</key>
|
||||
<integer>120</integer>
|
||||
<key>VisibleRegion</key>
|
||||
<string>{{0, 0}, {575, 733}}</string>
|
||||
<key>Zoom</key>
|
||||
<real>1</real>
|
||||
<key>ZoomValues</key>
|
||||
<array>
|
||||
<array>
|
||||
<string>Canvas 1</string>
|
||||
<real>1</real>
|
||||
<real>1</real>
|
||||
</array>
|
||||
</array>
|
||||
</dict>
|
||||
</dict>
|
||||
</plist>
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 45 KiB |
Loading…
Reference in New Issue