improved doc and don't type HLA by default
This commit is contained in:
parent
5aba188969
commit
c05a721f28
11
NEWS.md
11
NEWS.md
|
|
@ -1,17 +1,18 @@
|
|||
Release 0.7.11 (XX November, 2014)
|
||||
-----------------------------------
|
||||
Release 0.7.11 (XX December, 2014)
|
||||
----------------------------------
|
||||
|
||||
A major change to BWA-MEM is the support of mapping to ALT contigs in addition
|
||||
to the primary assembly. Part of the ALT mapping strategy is implemented in
|
||||
BWA-MEM and the rest in a postprocessing script for now. Due to the extra
|
||||
layer of complexity on generating the reference genome and on the two-step
|
||||
mapping, we start to provide a wrapper script and precompiled binaries since
|
||||
this release. Please check README-alt.md for details.
|
||||
this release. The package may be more convenient to some specific use cases.
|
||||
For general uses, the single BWA binary still works like the old way.
|
||||
|
||||
Another major addition to BWA-MEM is HLA typing, which made possible with the
|
||||
new ALT mapping strategy. Necessary data and programs are included in the
|
||||
binary release. The wrapper script also performs HLA typing when HLA genes are
|
||||
also included in the reference genome as additional ALT contigs.
|
||||
included in the reference genome as additional ALT contigs.
|
||||
|
||||
Other notable changes to BWA-MEM:
|
||||
|
||||
|
|
@ -44,7 +45,7 @@ Other notable changes to BWA-MEM:
|
|||
writing SAM. This saves significant wall-clock time when reading from
|
||||
or writing to a slow Unix pipe.
|
||||
|
||||
(0.7.11: XX November 2014, rXXX)
|
||||
(0.7.11: XX December 2014, r10XX)
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -5,10 +5,10 @@
|
|||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||
| gzip -dc | tar xf -
|
||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
|
||||
bwa.kit/bwa index hs38d6.fa # create BWA index
|
||||
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
|
||||
bwa.kit/bwa index hs38D1.fa # create BWA index
|
||||
# mapping
|
||||
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
||||
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
||||
```
|
||||
|
||||
This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
|
||||
|
|
@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
|
|||
|
||||
For a more complete reference genome, we compiled a new set of decoy sequences
|
||||
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
|
||||
The sequences are included in `hs38d6-extra.fa` from the [BWA binary
|
||||
The sequences are included in `hs38D1-extra.fa` from the [BWA binary
|
||||
package][res].
|
||||
|
||||
In addition to decoy, we also put multiple alleles of HLA genes in
|
||||
`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||
`hs38D1-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||
version 3.18.0 and are used to collect reads sequenced from these genes.
|
||||
|
||||
### HLA typing
|
||||
|
|
@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses.
|
|||
|
||||
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
|
||||
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
|
||||
(hs38d6), and called small variants from the alignment. CHM1 is haploid.
|
||||
(hs38D1), and called small variants from the alignment. CHM1 is haploid.
|
||||
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
|
||||
true positive (TP) heterozygous calls from NA12878 are approximately equal
|
||||
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
|
||||
should yield higher TP and lower FP. The following table shows the numbers for
|
||||
these assemblies:
|
||||
|
||||
|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref|
|
||||
|Assembly|hs37 |hs38 |hs38D1|CHM1_1.1| huref|
|
||||
|:------:|------:|------:|------:|------:|------:|
|
||||
|FP | 255706| 168068| 142516|307172 | 575634|
|
||||
|TP |2142260|2163113|2150844|2167235|2137053|
|
||||
|
||||
With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces
|
||||
With this measurement, hs38 is clearly better than hs37. Genome hs38D1 reduces
|
||||
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
|
||||
from hs38 only and found the majority of them are associated with excessive read
|
||||
depth, clustered variants or weak alignment. We believe most hs38-only calls are
|
||||
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
|
||||
with nearly identical library construction, the difference is ~140k, an order
|
||||
of magnitude higher than the difference between hs38 and hs38d6. ALT contigs,
|
||||
decoy and HLA genes in hs38d6 improve variant calling and enable the analyses of
|
||||
of magnitude higher than the difference between hs38 and hs38D1. ALT contigs,
|
||||
decoy and HLA genes in hs38D1 improve variant calling and enable the analyses of
|
||||
ALT contigs and HLA typing at little cost.
|
||||
|
||||
## Problems and Future Development
|
||||
|
|
|
|||
|
|
@ -21,10 +21,10 @@ how to use bwakit:
|
|||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||
| gzip -dc | tar xf -
|
||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
|
||||
bwa.kit/bwa index hs38d6.fa # create BWA index
|
||||
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
|
||||
bwa.kit/bwa index hs38D1.fa # create BWA index
|
||||
# mapping
|
||||
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh
|
||||
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh
|
||||
```
|
||||
|
||||
The last mapping command line will generate the following files:
|
||||
|
|
@ -44,7 +44,31 @@ Bwakit can be [downloaded here][res]. It is only available to x86_64-linux. The
|
|||
scripts in the package are available in the [bwa/bwakit][kit] directory.
|
||||
Packaging is done manually for now.
|
||||
|
||||
## Contents
|
||||
## Limitations
|
||||
|
||||
* HLA typing only works for high-coverage human data. The typing accuracy can
|
||||
still be improved. We encourage researchers to develop better HLA typing tools
|
||||
based on the intermediate output of bwakit (for each HLA gene included in the
|
||||
index, bwakit writes all reads matching it in a separate file).
|
||||
|
||||
* Duplicate marking only works when all reads from a single paired-end library
|
||||
are provided as the input. This limitation is the necessary tradeoff of fast
|
||||
MarkDuplicate provided by samblaster.
|
||||
|
||||
* The adapter trimmer is chosen as it is fast, pipe friendly and does not
|
||||
discard reads. However, it is conservative and suboptimal. If this is a
|
||||
concern, it is recommended to preprocess input reads with a more sophisticated
|
||||
adapter trimmer. We also hope existing trimmers can be modified to operate on
|
||||
an interleaved FASTQ stream. We will replace trimadap once a better trimmer
|
||||
meets our needs.
|
||||
|
||||
* Bwakit can be memory demanding depends on the functionality invoked. For 30X
|
||||
human data, bwa-mem takes about 6GB RAM, samblaster uses close to 10GB and BAM
|
||||
shuffling (if the input is sorted BAM) uses several GB. In the current
|
||||
setting, sorting uses about 10GB.
|
||||
|
||||
|
||||
## Package Contents
|
||||
```
|
||||
bwa.kit
|
||||
|-- README.md This README file.
|
||||
|
|
|
|||
|
|
@ -18,29 +18,38 @@ Options: -o STR prefix for output files [inferred from
|
|||
ont2d: Oxford Nanopore reads (~10kb query, higher error rate)
|
||||
-t INT number of threads [1]
|
||||
|
||||
-H apply HLA typing
|
||||
-a trim HiSeq2000/2500 PE resequencing adapters (via trimadap)
|
||||
-d mark duplicate (via samblaster)
|
||||
-S for SAM/BAM input, don\'t shuffle
|
||||
-s sort the output alignment (requring more RAM)
|
||||
-H skip HLA typing
|
||||
-S for BAM input, don\'t shuffle
|
||||
-s sort the output alignment (via samtools; requring more RAM)
|
||||
-k keep temporary files generated by typeHLA
|
||||
|
||||
Examples:
|
||||
|
||||
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
|
||||
|
||||
run-bwamem -o prefix -t8 -R"@RG\tID:foo\tSM:bar" hs38d6.fa read1.fq.gz read2.fq.gz
|
||||
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38D1.fa read1.fq.gz read2.fq.gz
|
||||
|
||||
Note: HLA typing is only effective for high-coverage data. The typing accuracy varies
|
||||
with the quality of input. It is only intended for research purpose, not for diagnostic.
|
||||
|
||||
* Remap coordinate-sorted BAM, transfer read groups tags, trim Illumina PE adapters and
|
||||
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
|
||||
the two types. Specifying -R stops read group transfer.
|
||||
|
||||
run-bwamem -sao prefix hs38d6.fa old-srt.bam
|
||||
run-bwamem -sao prefix hs38D1.fa old-srt.bam
|
||||
|
||||
* Remap name-grouped BAM and mark duplicates. Note that in this case, all reads from
|
||||
a single library should be aligned at the same time. Paired-end only.
|
||||
Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current
|
||||
mapping pipeline better. It is conservative and suboptimal. A more sophisticated
|
||||
trimmer is recommended if this becomes a concern.
|
||||
|
||||
run-bwamem -Sdo prefix hs38d6.fa old-unsrt.bam
|
||||
* Remap name-grouped BAM and mark duplicates:
|
||||
|
||||
run-bwamem -Sdo prefix hs38D1.fa old-unsrt.bam
|
||||
|
||||
Note: streamed duplicate marking requires all reads from a single paired-end library
|
||||
to be aligned at the same time.
|
||||
|
||||
Output files:
|
||||
|
||||
|
|
@ -156,7 +165,7 @@ if (-f "$ARGV[0].alt") {
|
|||
my $t_sort = $opts{t} < 4? $opts{t} : 4;
|
||||
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
|
||||
|
||||
if ($has_hla && !defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
|
||||
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
|
||||
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
|
||||
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
|
||||
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
|
||||
|
|
|
|||
Loading…
Reference in New Issue