diff --git a/README-alt.md b/README-alt.md index 08da754..4f72042 100644 --- a/README-alt.md +++ b/README-alt.md @@ -5,10 +5,10 @@ wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \ | gzip -dc | tar xf - # Generate the GRCh38+ALT+decoy+HLA and create the BWA index -bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa -bwa.kit/bwa index hs38D1.fa # create BWA index +bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa +bwa.kit/bwa index hs38DH.fa # create BWA index # mapping -bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh # skip "|sh" to show command lines +bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh # skip "|sh" to show command lines ``` This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA @@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or For a more complete reference genome, we compiled a new set of decoy sequences from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples. -The sequences are included in `hs38D1-extra.fa` from the [BWA binary +The sequences are included in `hs38DH-extra.fa` from the [BWA binary package][res]. In addition to decoy, we also put multiple alleles of HLA genes in -`hs38D1-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], +`hs38DH-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], version 3.18.0 and are used to collect reads sequenced from these genes. ### HLA typing @@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses. To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878 unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy -(hs38D1), and called small variants from the alignment. CHM1 is haploid. +(hs38DH), and called small variants from the alignment. CHM1 is haploid. Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The true positive (TP) heterozygous calls from NA12878 are approximately equal to the difference between NA12878 and CHM1 heterozygous calls. A better assembly should yield higher TP and lower FP. The following table shows the numbers for these assemblies: -|Assembly|hs37 |hs38 |hs38D1|CHM1_1.1| huref| +|Assembly|hs37 |hs38 |hs38DH|CHM1_1.1| huref| |:------:|------:|------:|------:|------:|------:| |FP | 255706| 168068| 142516|307172 | 575634| |TP |2142260|2163113|2150844|2167235|2137053| -With this measurement, hs38 is clearly better than hs37. Genome hs38D1 reduces +With this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces FP by ~25k but also reduces TP by ~12k. We manually inspected variants called from hs38 only and found the majority of them are associated with excessive read depth, clustered variants or weak alignment. We believe most hs38-only calls are problematic. In addition, if we compare two NA12878 replicates from HiSeq X10 with nearly identical library construction, the difference is ~140k, an order -of magnitude higher than the difference between hs38 and hs38D1. ALT contigs, -decoy and HLA genes in hs38D1 improve variant calling and enable the analyses of +of magnitude higher than the difference between hs38 and hs38DH. ALT contigs, +decoy and HLA genes in hs38DH improve variant calling and enable the analyses of ALT contigs and HLA typing at little cost. ## Problems and Future Development diff --git a/bwakit/README.md b/bwakit/README.md index 5c1490a..b3a4033 100644 --- a/bwakit/README.md +++ b/bwakit/README.md @@ -21,10 +21,10 @@ how to use bwakit: wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \ | gzip -dc | tar xf - # Generate the GRCh38+ALT+decoy+HLA and create the BWA index -bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa -bwa.kit/bwa index hs38D1.fa # create BWA index +bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa +bwa.kit/bwa index hs38DH.fa # create BWA index # mapping -bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh +bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh ``` The last mapping command line will generate the following files: @@ -63,9 +63,9 @@ Packaging is done manually for now. meets our needs. * Bwakit can be memory demanding depends on the functionality invoked. For 30X - human data, bwa-mem takes about 6GB RAM, samblaster uses close to 10GB and BAM - shuffling (if the input is sorted BAM) uses several GB. In the current - setting, sorting uses about 10GB. + human data, bwa-mem takes about 11GB RAM with 32 threads, samblaster uses + close to 10GB and BAM shuffling (if the input is sorted BAM) uses several GB. + In the current setting, sorting uses about 10GB. ## Package Contents @@ -84,8 +84,8 @@ bwa.kit | |-- run-gen-ref *Entry script* for generating human reference genomes. |-- resource-GRCh38 Resources for generating GRCh38 -| |-- hs38D1-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref. -| `-- hs38D1.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref. +| |-- hs38DH-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref. +| `-- hs38DH.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref. | |-- run-HLA HLA typing for sequences extracted by bwa-postalt.js. |-- typeHLA.sh Type one HLA-gene. Called by run-HLA. diff --git a/bwakit/run-bwamem b/bwakit/run-bwamem index d931067..165f93e 100755 --- a/bwakit/run-bwamem +++ b/bwakit/run-bwamem @@ -29,7 +29,7 @@ Examples: * Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing: - run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38D1.fa read1.fq.gz read2.fq.gz + run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38DH.fa read1.fq.gz read2.fq.gz Note: HLA typing is only effective for high-coverage data. The typing accuracy varies with the quality of input. It is only intended for research purpose, not for diagnostic. @@ -38,7 +38,7 @@ Examples: sort the output. The BAM may contain single-end or paired-end reads, or a mixture of the two types. Specifying -R stops read group transfer. - run-bwamem -sao prefix hs38D1.fa old-srt.bam + run-bwamem -sao prefix hs38DH.fa old-srt.bam Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current mapping pipeline better. It is conservative and suboptimal. A more sophisticated @@ -46,7 +46,7 @@ Examples: * Remap name-grouped BAM and mark duplicates: - run-bwamem -Sdo prefix hs38D1.fa old-unsrt.bam + run-bwamem -Sdo prefix hs38DH.fa old-unsrt.bam Note: streamed duplicate marking requires all reads from a single paired-end library to be aligned at the same time. @@ -93,7 +93,7 @@ if (defined $opts{o}) { } elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) { $prefix = $1; } -die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix); +die("ERROR: failed to identify the prefix for output. Please specify -o.\n") unless defined($prefix); my $size = 0; my $comp_ratio = 3.; @@ -167,7 +167,7 @@ $cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.al if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) { $cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n"; - $cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n"; + $cmd .= "touch $prefix.hla.HLA-dummy.gt; cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n"; $cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k}); } diff --git a/bwakit/run-gen-ref b/bwakit/run-gen-ref index 8a6baa0..e129b62 100755 --- a/bwakit/run-gen-ref +++ b/bwakit/run-gen-ref @@ -6,16 +6,25 @@ url38="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz" if [ $# -eq 0 ]; then - echo "Usage: $0 " + echo "Usage: $0 " + echo "Analysis sets:" + echo " hs38 primary assembly of GRCh38 (incl. chromosomes, unplaced and unlocalized contigs) and EBV" + echo " hs38a hs38 plus ALT contigs" + echo " hs38DH hs38a plus decoy contigs and HLA genes (recommended for GRCh38 mapping)" + echo " hs37 primary assembly of GRCh37 (used by 1000g phase 1) plus the EBV genome" + echo " hs37d5 hs37 plus decoy contigs (used by 1000g phase 3)" + echo "" + echo "Note: This script downloads human reference genomes. For hs38a and hs38DH, it needs additional" + echo " sequences and ALT-to-REF mapping included in the bwa.kit package." exit 1; fi -if [ $1 == "hs38D1" ]; then - (wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38D1-extra.fa) > $1.fa - [ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38D1.fa.alt $1.fa.alt +if [ $1 == "hs38DH" ]; then + (wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38DH-extra.fa) > $1.fa + [ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38DH.fa.alt $1.fa.alt elif [ $1 == "hs38a" ]; then wget -O- $url38 | gzip -dc > $1.fa - [ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38D1.fa.alt > $1.fa.alt + [ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38DH.fa.alt > $1.fa.alt elif [ $1 == "hs38" ]; then wget -O- $url38 | gzip -dc | awk '/^>/{f=/_alt/?0:1}f' > $1.fa elif [ $1 == "hs37d5" ]; then