hs38D1=>hs38DH; cmd help for run-gen-ref

This commit is contained in:
Heng Li 2014-12-23 11:48:59 -05:00
parent ed95769a33
commit d7ce7e1395
4 changed files with 37 additions and 28 deletions

View File

@ -5,10 +5,10 @@
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
bwa.kit/bwa index hs38D1.fa # create BWA index
bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
bwa.kit/bwa index hs38DH.fa # create BWA index
# mapping
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
```
This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
The sequences are included in `hs38D1-extra.fa` from the [BWA binary
The sequences are included in `hs38DH-extra.fa` from the [BWA binary
package][res].
In addition to decoy, we also put multiple alleles of HLA genes in
`hs38D1-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
`hs38DH-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
version 3.18.0 and are used to collect reads sequenced from these genes.
### HLA typing
@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses.
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
(hs38D1), and called small variants from the alignment. CHM1 is haploid.
(hs38DH), and called small variants from the alignment. CHM1 is haploid.
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
true positive (TP) heterozygous calls from NA12878 are approximately equal
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
should yield higher TP and lower FP. The following table shows the numbers for
these assemblies:
|Assembly|hs37 |hs38 |hs38D1|CHM1_1.1| huref|
|Assembly|hs37 |hs38 |hs38DH|CHM1_1.1| huref|
|:------:|------:|------:|------:|------:|------:|
|FP | 255706| 168068| 142516|307172 | 575634|
|TP |2142260|2163113|2150844|2167235|2137053|
With this measurement, hs38 is clearly better than hs37. Genome hs38D1 reduces
With this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
from hs38 only and found the majority of them are associated with excessive read
depth, clustered variants or weak alignment. We believe most hs38-only calls are
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
with nearly identical library construction, the difference is ~140k, an order
of magnitude higher than the difference between hs38 and hs38D1. ALT contigs,
decoy and HLA genes in hs38D1 improve variant calling and enable the analyses of
of magnitude higher than the difference between hs38 and hs38DH. ALT contigs,
decoy and HLA genes in hs38DH improve variant calling and enable the analyses of
ALT contigs and HLA typing at little cost.
## Problems and Future Development

View File

@ -21,10 +21,10 @@ how to use bwakit:
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
bwa.kit/bwa index hs38D1.fa # create BWA index
bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
bwa.kit/bwa index hs38DH.fa # create BWA index
# mapping
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh
bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh
```
The last mapping command line will generate the following files:
@ -63,9 +63,9 @@ Packaging is done manually for now.
meets our needs.
* Bwakit can be memory demanding depends on the functionality invoked. For 30X
human data, bwa-mem takes about 6GB RAM, samblaster uses close to 10GB and BAM
shuffling (if the input is sorted BAM) uses several GB. In the current
setting, sorting uses about 10GB.
human data, bwa-mem takes about 11GB RAM with 32 threads, samblaster uses
close to 10GB and BAM shuffling (if the input is sorted BAM) uses several GB.
In the current setting, sorting uses about 10GB.
## Package Contents
@ -84,8 +84,8 @@ bwa.kit
|
|-- run-gen-ref *Entry script* for generating human reference genomes.
|-- resource-GRCh38 Resources for generating GRCh38
| |-- hs38D1-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
| `-- hs38D1.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
| |-- hs38DH-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
| `-- hs38DH.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
|
|-- run-HLA HLA typing for sequences extracted by bwa-postalt.js.
|-- typeHLA.sh Type one HLA-gene. Called by run-HLA.

View File

@ -29,7 +29,7 @@ Examples:
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38D1.fa read1.fq.gz read2.fq.gz
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38DH.fa read1.fq.gz read2.fq.gz
Note: HLA typing is only effective for high-coverage data. The typing accuracy varies
with the quality of input. It is only intended for research purpose, not for diagnostic.
@ -38,7 +38,7 @@ Examples:
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
the two types. Specifying -R stops read group transfer.
run-bwamem -sao prefix hs38D1.fa old-srt.bam
run-bwamem -sao prefix hs38DH.fa old-srt.bam
Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current
mapping pipeline better. It is conservative and suboptimal. A more sophisticated
@ -46,7 +46,7 @@ Examples:
* Remap name-grouped BAM and mark duplicates:
run-bwamem -Sdo prefix hs38D1.fa old-unsrt.bam
run-bwamem -Sdo prefix hs38DH.fa old-unsrt.bam
Note: streamed duplicate marking requires all reads from a single paired-end library
to be aligned at the same time.
@ -93,7 +93,7 @@ if (defined $opts{o}) {
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
$prefix = $1;
}
die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix);
die("ERROR: failed to identify the prefix for output. Please specify -o.\n") unless defined($prefix);
my $size = 0;
my $comp_ratio = 3.;
@ -167,7 +167,7 @@ $cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.al
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
$cmd .= "touch $prefix.hla.HLA-dummy.gt; cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
}

View File

@ -6,16 +6,25 @@ url38="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals
url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
if [ $# -eq 0 ]; then
echo "Usage: $0 <hs38|hs38a|hs38D1|hs37|hs37d5>"
echo "Usage: $0 <hs38|hs38a|hs38DH|hs37|hs37d5>"
echo "Analysis sets:"
echo " hs38 primary assembly of GRCh38 (incl. chromosomes, unplaced and unlocalized contigs) and EBV"
echo " hs38a hs38 plus ALT contigs"
echo " hs38DH hs38a plus decoy contigs and HLA genes (recommended for GRCh38 mapping)"
echo " hs37 primary assembly of GRCh37 (used by 1000g phase 1) plus the EBV genome"
echo " hs37d5 hs37 plus decoy contigs (used by 1000g phase 3)"
echo ""
echo "Note: This script downloads human reference genomes. For hs38a and hs38DH, it needs additional"
echo " sequences and ALT-to-REF mapping included in the bwa.kit package."
exit 1;
fi
if [ $1 == "hs38D1" ]; then
(wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38D1-extra.fa) > $1.fa
[ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38D1.fa.alt $1.fa.alt
if [ $1 == "hs38DH" ]; then
(wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38DH-extra.fa) > $1.fa
[ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38DH.fa.alt $1.fa.alt
elif [ $1 == "hs38a" ]; then
wget -O- $url38 | gzip -dc > $1.fa
[ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38D1.fa.alt > $1.fa.alt
[ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38DH.fa.alt > $1.fa.alt
elif [ $1 == "hs38" ]; then
wget -O- $url38 | gzip -dc | awk '/^>/{f=/_alt/?0:1}f' > $1.fa
elif [ $1 == "hs37d5" ]; then