hs38D1=>hs38DH; cmd help for run-gen-ref
This commit is contained in:
parent
ed95769a33
commit
d7ce7e1395
|
|
@ -5,10 +5,10 @@
|
|||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||
| gzip -dc | tar xf -
|
||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
|
||||
bwa.kit/bwa index hs38D1.fa # create BWA index
|
||||
bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
|
||||
bwa.kit/bwa index hs38DH.fa # create BWA index
|
||||
# mapping
|
||||
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
||||
bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
|
||||
```
|
||||
|
||||
This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
|
||||
|
|
@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
|
|||
|
||||
For a more complete reference genome, we compiled a new set of decoy sequences
|
||||
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
|
||||
The sequences are included in `hs38D1-extra.fa` from the [BWA binary
|
||||
The sequences are included in `hs38DH-extra.fa` from the [BWA binary
|
||||
package][res].
|
||||
|
||||
In addition to decoy, we also put multiple alleles of HLA genes in
|
||||
`hs38D1-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||
`hs38DH-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
|
||||
version 3.18.0 and are used to collect reads sequenced from these genes.
|
||||
|
||||
### HLA typing
|
||||
|
|
@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses.
|
|||
|
||||
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
|
||||
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
|
||||
(hs38D1), and called small variants from the alignment. CHM1 is haploid.
|
||||
(hs38DH), and called small variants from the alignment. CHM1 is haploid.
|
||||
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
|
||||
true positive (TP) heterozygous calls from NA12878 are approximately equal
|
||||
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
|
||||
should yield higher TP and lower FP. The following table shows the numbers for
|
||||
these assemblies:
|
||||
|
||||
|Assembly|hs37 |hs38 |hs38D1|CHM1_1.1| huref|
|
||||
|Assembly|hs37 |hs38 |hs38DH|CHM1_1.1| huref|
|
||||
|:------:|------:|------:|------:|------:|------:|
|
||||
|FP | 255706| 168068| 142516|307172 | 575634|
|
||||
|TP |2142260|2163113|2150844|2167235|2137053|
|
||||
|
||||
With this measurement, hs38 is clearly better than hs37. Genome hs38D1 reduces
|
||||
With this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces
|
||||
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
|
||||
from hs38 only and found the majority of them are associated with excessive read
|
||||
depth, clustered variants or weak alignment. We believe most hs38-only calls are
|
||||
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
|
||||
with nearly identical library construction, the difference is ~140k, an order
|
||||
of magnitude higher than the difference between hs38 and hs38D1. ALT contigs,
|
||||
decoy and HLA genes in hs38D1 improve variant calling and enable the analyses of
|
||||
of magnitude higher than the difference between hs38 and hs38DH. ALT contigs,
|
||||
decoy and HLA genes in hs38DH improve variant calling and enable the analyses of
|
||||
ALT contigs and HLA typing at little cost.
|
||||
|
||||
## Problems and Future Development
|
||||
|
|
|
|||
|
|
@ -21,10 +21,10 @@ how to use bwakit:
|
|||
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
|
||||
| gzip -dc | tar xf -
|
||||
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
|
||||
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
|
||||
bwa.kit/bwa index hs38D1.fa # create BWA index
|
||||
bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
|
||||
bwa.kit/bwa index hs38DH.fa # create BWA index
|
||||
# mapping
|
||||
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh
|
||||
bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh
|
||||
```
|
||||
|
||||
The last mapping command line will generate the following files:
|
||||
|
|
@ -63,9 +63,9 @@ Packaging is done manually for now.
|
|||
meets our needs.
|
||||
|
||||
* Bwakit can be memory demanding depends on the functionality invoked. For 30X
|
||||
human data, bwa-mem takes about 6GB RAM, samblaster uses close to 10GB and BAM
|
||||
shuffling (if the input is sorted BAM) uses several GB. In the current
|
||||
setting, sorting uses about 10GB.
|
||||
human data, bwa-mem takes about 11GB RAM with 32 threads, samblaster uses
|
||||
close to 10GB and BAM shuffling (if the input is sorted BAM) uses several GB.
|
||||
In the current setting, sorting uses about 10GB.
|
||||
|
||||
|
||||
## Package Contents
|
||||
|
|
@ -84,8 +84,8 @@ bwa.kit
|
|||
|
|
||||
|-- run-gen-ref *Entry script* for generating human reference genomes.
|
||||
|-- resource-GRCh38 Resources for generating GRCh38
|
||||
| |-- hs38D1-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
|
||||
| `-- hs38D1.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
|
||||
| |-- hs38DH-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
|
||||
| `-- hs38DH.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
|
||||
|
|
||||
|-- run-HLA HLA typing for sequences extracted by bwa-postalt.js.
|
||||
|-- typeHLA.sh Type one HLA-gene. Called by run-HLA.
|
||||
|
|
|
|||
|
|
@ -29,7 +29,7 @@ Examples:
|
|||
|
||||
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
|
||||
|
||||
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38D1.fa read1.fq.gz read2.fq.gz
|
||||
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38DH.fa read1.fq.gz read2.fq.gz
|
||||
|
||||
Note: HLA typing is only effective for high-coverage data. The typing accuracy varies
|
||||
with the quality of input. It is only intended for research purpose, not for diagnostic.
|
||||
|
|
@ -38,7 +38,7 @@ Examples:
|
|||
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
|
||||
the two types. Specifying -R stops read group transfer.
|
||||
|
||||
run-bwamem -sao prefix hs38D1.fa old-srt.bam
|
||||
run-bwamem -sao prefix hs38DH.fa old-srt.bam
|
||||
|
||||
Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current
|
||||
mapping pipeline better. It is conservative and suboptimal. A more sophisticated
|
||||
|
|
@ -46,7 +46,7 @@ Examples:
|
|||
|
||||
* Remap name-grouped BAM and mark duplicates:
|
||||
|
||||
run-bwamem -Sdo prefix hs38D1.fa old-unsrt.bam
|
||||
run-bwamem -Sdo prefix hs38DH.fa old-unsrt.bam
|
||||
|
||||
Note: streamed duplicate marking requires all reads from a single paired-end library
|
||||
to be aligned at the same time.
|
||||
|
|
@ -93,7 +93,7 @@ if (defined $opts{o}) {
|
|||
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
|
||||
$prefix = $1;
|
||||
}
|
||||
die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix);
|
||||
die("ERROR: failed to identify the prefix for output. Please specify -o.\n") unless defined($prefix);
|
||||
|
||||
my $size = 0;
|
||||
my $comp_ratio = 3.;
|
||||
|
|
@ -167,7 +167,7 @@ $cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.al
|
|||
|
||||
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
|
||||
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
|
||||
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
|
||||
$cmd .= "touch $prefix.hla.HLA-dummy.gt; cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
|
||||
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -6,16 +6,25 @@ url38="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals
|
|||
url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
|
||||
|
||||
if [ $# -eq 0 ]; then
|
||||
echo "Usage: $0 <hs38|hs38a|hs38D1|hs37|hs37d5>"
|
||||
echo "Usage: $0 <hs38|hs38a|hs38DH|hs37|hs37d5>"
|
||||
echo "Analysis sets:"
|
||||
echo " hs38 primary assembly of GRCh38 (incl. chromosomes, unplaced and unlocalized contigs) and EBV"
|
||||
echo " hs38a hs38 plus ALT contigs"
|
||||
echo " hs38DH hs38a plus decoy contigs and HLA genes (recommended for GRCh38 mapping)"
|
||||
echo " hs37 primary assembly of GRCh37 (used by 1000g phase 1) plus the EBV genome"
|
||||
echo " hs37d5 hs37 plus decoy contigs (used by 1000g phase 3)"
|
||||
echo ""
|
||||
echo "Note: This script downloads human reference genomes. For hs38a and hs38DH, it needs additional"
|
||||
echo " sequences and ALT-to-REF mapping included in the bwa.kit package."
|
||||
exit 1;
|
||||
fi
|
||||
|
||||
if [ $1 == "hs38D1" ]; then
|
||||
(wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38D1-extra.fa) > $1.fa
|
||||
[ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38D1.fa.alt $1.fa.alt
|
||||
if [ $1 == "hs38DH" ]; then
|
||||
(wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38DH-extra.fa) > $1.fa
|
||||
[ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38DH.fa.alt $1.fa.alt
|
||||
elif [ $1 == "hs38a" ]; then
|
||||
wget -O- $url38 | gzip -dc > $1.fa
|
||||
[ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38D1.fa.alt > $1.fa.alt
|
||||
[ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38DH.fa.alt > $1.fa.alt
|
||||
elif [ $1 == "hs38" ]; then
|
||||
wget -O- $url38 | gzip -dc | awk '/^>/{f=/_alt/?0:1}f' > $1.fa
|
||||
elif [ $1 == "hs37d5" ]; then
|
||||
|
|
|
|||
Loading…
Reference in New Issue