improved doc and don't type HLA by default
This commit is contained in:
parent
5aba188969
commit
c05a721f28
11
NEWS.md
11
NEWS.md
|
|
@ -1,17 +1,18 @@
|
||||||
Release 0.7.11 (XX November, 2014)
|
Release 0.7.11 (XX December, 2014)
|
||||||
-----------------------------------
|
----------------------------------
|
||||||
|
|
||||||
A major change to BWA-MEM is the support of mapping to ALT contigs in addition
|
A major change to BWA-MEM is the support of mapping to ALT contigs in addition
|
||||||
to the primary assembly. Part of the ALT mapping strategy is implemented in
|
to the primary assembly. Part of the ALT mapping strategy is implemented in
|
||||||
BWA-MEM and the rest in a postprocessing script for now. Due to the extra
|
BWA-MEM and the rest in a postprocessing script for now. Due to the extra
|
||||||
layer of complexity on generating the reference genome and on the two-step
|
layer of complexity on generating the reference genome and on the two-step
|
||||||
mapping, we start to provide a wrapper script and precompiled binaries since
|
mapping, we start to provide a wrapper script and precompiled binaries since
|
||||||
this release. Please check README-alt.md for details.
|
this release. The package may be more convenient to some specific use cases.
|
||||||
|
For general uses, the single BWA binary still works like the old way.
|
||||||
|
|
||||||
Another major addition to BWA-MEM is HLA typing, which made possible with the
|
Another major addition to BWA-MEM is HLA typing, which made possible with the
|
||||||
new ALT mapping strategy. Necessary data and programs are included in the
|
new ALT mapping strategy. Necessary data and programs are included in the
|
||||||
binary release. The wrapper script also performs HLA typing when HLA genes are
|
binary release. The wrapper script also performs HLA typing when HLA genes are
|
||||||
also included in the reference genome as additional ALT contigs.
|
included in the reference genome as additional ALT contigs.
|
||||||
|
|
||||||
Other notable changes to BWA-MEM:
|
Other notable changes to BWA-MEM:
|
||||||
|
|
||||||
|
|
@ -44,7 +45,7 @@ Other notable changes to BWA-MEM:
|
||||||
writing SAM. This saves significant wall-clock time when reading from
|
writing SAM. This saves significant wall-clock time when reading from
|
||||||
or writing to a slow Unix pipe.
|
or writing to a slow Unix pipe.
|
||||||
|
|
||||||
(0.7.11: XX November 2014, rXXX)
|
(0.7.11: XX December 2014, r10XX)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -5,10 +5,10 @@
|
||||||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||||
| gzip -dc | tar xf -
|
| gzip -dc | tar xf -
|
||||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||||
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
|
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
|
||||||
bwa.kit/bwa index hs38d6.fa # create BWA index
|
bwa.kit/bwa index hs38D1.fa # create BWA index
|
||||||
# mapping
|
# mapping
|
||||||
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
||||||
```
|
```
|
||||||
|
|
||||||
This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
|
This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
|
||||||
|
|
@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
|
||||||
|
|
||||||
For a more complete reference genome, we compiled a new set of decoy sequences
|
For a more complete reference genome, we compiled a new set of decoy sequences
|
||||||
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
|
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
|
||||||
The sequences are included in `hs38d6-extra.fa` from the [BWA binary
|
The sequences are included in `hs38D1-extra.fa` from the [BWA binary
|
||||||
package][res].
|
package][res].
|
||||||
|
|
||||||
In addition to decoy, we also put multiple alleles of HLA genes in
|
In addition to decoy, we also put multiple alleles of HLA genes in
|
||||||
`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
`hs38D1-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||||
version 3.18.0 and are used to collect reads sequenced from these genes.
|
version 3.18.0 and are used to collect reads sequenced from these genes.
|
||||||
|
|
||||||
### HLA typing
|
### HLA typing
|
||||||
|
|
@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses.
|
||||||
|
|
||||||
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
|
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
|
||||||
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
|
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
|
||||||
(hs38d6), and called small variants from the alignment. CHM1 is haploid.
|
(hs38D1), and called small variants from the alignment. CHM1 is haploid.
|
||||||
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
|
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
|
||||||
true positive (TP) heterozygous calls from NA12878 are approximately equal
|
true positive (TP) heterozygous calls from NA12878 are approximately equal
|
||||||
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
|
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
|
||||||
should yield higher TP and lower FP. The following table shows the numbers for
|
should yield higher TP and lower FP. The following table shows the numbers for
|
||||||
these assemblies:
|
these assemblies:
|
||||||
|
|
||||||
|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref|
|
|Assembly|hs37 |hs38 |hs38D1|CHM1_1.1| huref|
|
||||||
|:------:|------:|------:|------:|------:|------:|
|
|:------:|------:|------:|------:|------:|------:|
|
||||||
|FP | 255706| 168068| 142516|307172 | 575634|
|
|FP | 255706| 168068| 142516|307172 | 575634|
|
||||||
|TP |2142260|2163113|2150844|2167235|2137053|
|
|TP |2142260|2163113|2150844|2167235|2137053|
|
||||||
|
|
||||||
With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces
|
With this measurement, hs38 is clearly better than hs37. Genome hs38D1 reduces
|
||||||
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
|
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
|
||||||
from hs38 only and found the majority of them are associated with excessive read
|
from hs38 only and found the majority of them are associated with excessive read
|
||||||
depth, clustered variants or weak alignment. We believe most hs38-only calls are
|
depth, clustered variants or weak alignment. We believe most hs38-only calls are
|
||||||
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
|
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
|
||||||
with nearly identical library construction, the difference is ~140k, an order
|
with nearly identical library construction, the difference is ~140k, an order
|
||||||
of magnitude higher than the difference between hs38 and hs38d6. ALT contigs,
|
of magnitude higher than the difference between hs38 and hs38D1. ALT contigs,
|
||||||
decoy and HLA genes in hs38d6 improve variant calling and enable the analyses of
|
decoy and HLA genes in hs38D1 improve variant calling and enable the analyses of
|
||||||
ALT contigs and HLA typing at little cost.
|
ALT contigs and HLA typing at little cost.
|
||||||
|
|
||||||
## Problems and Future Development
|
## Problems and Future Development
|
||||||
|
|
|
||||||
|
|
@ -21,10 +21,10 @@ how to use bwakit:
|
||||||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||||
| gzip -dc | tar xf -
|
| gzip -dc | tar xf -
|
||||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||||
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
|
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
|
||||||
bwa.kit/bwa index hs38d6.fa # create BWA index
|
bwa.kit/bwa index hs38D1.fa # create BWA index
|
||||||
# mapping
|
# mapping
|
||||||
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh
|
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh
|
||||||
```
|
```
|
||||||
|
|
||||||
The last mapping command line will generate the following files:
|
The last mapping command line will generate the following files:
|
||||||
|
|
@ -44,7 +44,31 @@ Bwakit can be [downloaded here][res]. It is only available to x86_64-linux. The
|
||||||
scripts in the package are available in the [bwa/bwakit][kit] directory.
|
scripts in the package are available in the [bwa/bwakit][kit] directory.
|
||||||
Packaging is done manually for now.
|
Packaging is done manually for now.
|
||||||
|
|
||||||
## Contents
|
## Limitations
|
||||||
|
|
||||||
|
* HLA typing only works for high-coverage human data. The typing accuracy can
|
||||||
|
still be improved. We encourage researchers to develop better HLA typing tools
|
||||||
|
based on the intermediate output of bwakit (for each HLA gene included in the
|
||||||
|
index, bwakit writes all reads matching it in a separate file).
|
||||||
|
|
||||||
|
* Duplicate marking only works when all reads from a single paired-end library
|
||||||
|
are provided as the input. This limitation is the necessary tradeoff of fast
|
||||||
|
MarkDuplicate provided by samblaster.
|
||||||
|
|
||||||
|
* The adapter trimmer is chosen as it is fast, pipe friendly and does not
|
||||||
|
discard reads. However, it is conservative and suboptimal. If this is a
|
||||||
|
concern, it is recommended to preprocess input reads with a more sophisticated
|
||||||
|
adapter trimmer. We also hope existing trimmers can be modified to operate on
|
||||||
|
an interleaved FASTQ stream. We will replace trimadap once a better trimmer
|
||||||
|
meets our needs.
|
||||||
|
|
||||||
|
* Bwakit can be memory demanding depends on the functionality invoked. For 30X
|
||||||
|
human data, bwa-mem takes about 6GB RAM, samblaster uses close to 10GB and BAM
|
||||||
|
shuffling (if the input is sorted BAM) uses several GB. In the current
|
||||||
|
setting, sorting uses about 10GB.
|
||||||
|
|
||||||
|
|
||||||
|
## Package Contents
|
||||||
```
|
```
|
||||||
bwa.kit
|
bwa.kit
|
||||||
|-- README.md This README file.
|
|-- README.md This README file.
|
||||||
|
|
|
||||||
|
|
@ -18,29 +18,38 @@ Options: -o STR prefix for output files [inferred from
|
||||||
ont2d: Oxford Nanopore reads (~10kb query, higher error rate)
|
ont2d: Oxford Nanopore reads (~10kb query, higher error rate)
|
||||||
-t INT number of threads [1]
|
-t INT number of threads [1]
|
||||||
|
|
||||||
|
-H apply HLA typing
|
||||||
-a trim HiSeq2000/2500 PE resequencing adapters (via trimadap)
|
-a trim HiSeq2000/2500 PE resequencing adapters (via trimadap)
|
||||||
-d mark duplicate (via samblaster)
|
-d mark duplicate (via samblaster)
|
||||||
-S for SAM/BAM input, don\'t shuffle
|
-S for BAM input, don\'t shuffle
|
||||||
-s sort the output alignment (requring more RAM)
|
-s sort the output alignment (via samtools; requring more RAM)
|
||||||
-H skip HLA typing
|
|
||||||
-k keep temporary files generated by typeHLA
|
-k keep temporary files generated by typeHLA
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
|
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
|
||||||
|
|
||||||
run-bwamem -o prefix -t8 -R"@RG\tID:foo\tSM:bar" hs38d6.fa read1.fq.gz read2.fq.gz
|
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38D1.fa read1.fq.gz read2.fq.gz
|
||||||
|
|
||||||
|
Note: HLA typing is only effective for high-coverage data. The typing accuracy varies
|
||||||
|
with the quality of input. It is only intended for research purpose, not for diagnostic.
|
||||||
|
|
||||||
* Remap coordinate-sorted BAM, transfer read groups tags, trim Illumina PE adapters and
|
* Remap coordinate-sorted BAM, transfer read groups tags, trim Illumina PE adapters and
|
||||||
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
|
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
|
||||||
the two types. Specifying -R stops read group transfer.
|
the two types. Specifying -R stops read group transfer.
|
||||||
|
|
||||||
run-bwamem -sao prefix hs38d6.fa old-srt.bam
|
run-bwamem -sao prefix hs38D1.fa old-srt.bam
|
||||||
|
|
||||||
* Remap name-grouped BAM and mark duplicates. Note that in this case, all reads from
|
Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current
|
||||||
a single library should be aligned at the same time. Paired-end only.
|
mapping pipeline better. It is conservative and suboptimal. A more sophisticated
|
||||||
|
trimmer is recommended if this becomes a concern.
|
||||||
|
|
||||||
run-bwamem -Sdo prefix hs38d6.fa old-unsrt.bam
|
* Remap name-grouped BAM and mark duplicates:
|
||||||
|
|
||||||
|
run-bwamem -Sdo prefix hs38D1.fa old-unsrt.bam
|
||||||
|
|
||||||
|
Note: streamed duplicate marking requires all reads from a single paired-end library
|
||||||
|
to be aligned at the same time.
|
||||||
|
|
||||||
Output files:
|
Output files:
|
||||||
|
|
||||||
|
|
@ -156,7 +165,7 @@ if (-f "$ARGV[0].alt") {
|
||||||
my $t_sort = $opts{t} < 4? $opts{t} : 4;
|
my $t_sort = $opts{t} < 4? $opts{t} : 4;
|
||||||
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
|
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
|
||||||
|
|
||||||
if ($has_hla && !defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
|
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
|
||||||
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
|
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
|
||||||
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
|
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
|
||||||
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
|
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue