smarter prefix finder

This commit is contained in:
Heng Li 2014-11-13 16:16:21 -05:00
parent 4373a077d5
commit 65c637b036
2 changed files with 51 additions and 89 deletions

View File

@ -4,24 +4,16 @@ Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
with long alternative haplotypes present in separate ALT contigs. To use the with long alternative haplotypes present in separate ALT contigs. To use the
ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
is available from the [BWA resource bundle for GRCh38][res]. is available from the [binary package of BWA][res].
#### Option 1: Mapping to the official GRCh38 with ALT contigs
Construct the index:
```sh ```sh
wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz # Generate the GRCh38+ALT+decoy+HLA and create the BWA index
gzip -d GCA_000001405.15_GRCh38_full_analysis_set.fna.gz wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \
mv GCA_000001405.15_GRCh38_full_analysis_set.fna hs38a.fa | gzip -dc | tar xf -
bwa index hs38a.fa bwa.kit/run-gen-hs38d6 # download GRCh38 and write hs38d6.fa
cp bwa-hs38-bundle/hs38d4.fa.alt hs38a.fa.alt bwa.kit/bwa index hs38d6.fa # create BWA index
``` # mapping
bwa.kit/run-bwamem hs38d6.fa read1.fq read2.fq
Perform mapping:
```sh
bwa mem hs38a.fa read1.fq read2.fq \
| bwa-hs38-bundle/k8-linux bwa-postalt.js hs38a.fa.alt \
| samtools view -bS - > aln.unsrt.bam
``` ```
In the final alignment, a read may be placed on the [primary assembly][grcdef] In the final alignment, a read may be placed on the [primary assembly][grcdef]
@ -30,26 +22,6 @@ Mapping quality (mapQ) is properly adjusted by the postprocessing script
`bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For `bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For
details, see the [Methods section](#methods). details, see the [Methods section](#methods).
#### Option 2: Mapping to the collection of GRCh38, decoy and HLA genes
Construct the index:
```sh
cat hs38a.fa bwa-hs38-bundle/hs38d4-extra.fa > hs38d4.fa
bwa index hs38d4.fa
cp bwa-hs38-bundle/hs38d4.fa.alt .
```
Perform mapping:
```sh
bwa mem hs38d4.fa read1.fq read2.fq \
| bwa-hs38-bundle/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \
| samtools view -bS - > aln.unsrt.bam
```
The benefit of this option is to have a more complete reference sequence and
to facilitate HLA typing with a 3rd-party tool (see below).
***If you are not interested in the way BWA-MEM performs ALT mapping, you can
skip the rest of this documentation.***
## Background ## Background
GRCh38 consists of several components: chromosomal assembly, unlocalized contigs GRCh38 consists of several components: chromosomal assembly, unlocalized contigs
@ -60,8 +32,8 @@ definitions from the [GRC website][grcdef].
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions. GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions.
However, sequences that are highly diverged from the primary assembly only However, sequences that are highly diverged from the primary assembly only
contribute a few million bp. Most subsequences of ALT contigs are highly similar contribute a few million bp. Most subsequences of ALT contigs are nearly
or identical to the primary assembly. If we align sequence reads to GRCh38+ALT identical to the primary assembly. If we align sequence reads to GRCh38+ALT
treating ALT equal to the primary assembly, we will get many reads with zero treating ALT equal to the primary assembly, we will get many reads with zero
mapping quality and lose variants on them. It is crucial to make the mapper mapping quality and lose variants on them. It is crucial to make the mapper
aware of ALTs. aware of ALTs.
@ -88,8 +60,8 @@ the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*,
depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports
alignments and assigns mapQ following these two rules: alignments and assigns mapQ following these two rules:
1. The original mapQ of a non-ALT hit is computed across non-ALT hits only. 1. The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of
The reported mapQ of an ALT hit is computed across all hits. an ALT hit is computed across all hits.
2. If there are no non-ALT hits, the best ALT hit is outputted as the primary 2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
alignment. If there are both ALT and non-ALT hits, non-ALT hits will be alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
@ -100,7 +72,7 @@ In theory, non-ALT alignments from step 1 should be identical to alignments
against a reference genome with ALT contigs. In practice, the two types of against a reference genome with ALT contigs. In practice, the two types of
alignments may differ in rare cases due to seeding heuristics. When an ALT hit alignments may differ in rare cases due to seeding heuristics. When an ALT hit
is significantly better than non-ALT hits, BWA-MEM may miss seeds on the is significantly better than non-ALT hits, BWA-MEM may miss seeds on the
non-ALT hits. This happens more often for contig mapping. non-ALT hits.
If we don't care about ALT hits, we may skip postprocessing (step 2). If we don't care about ALT hits, we may skip postprocessing (step 2).
Nonetheless, postprocessing is recommended as it improves mapQ and gives more Nonetheless, postprocessing is recommended as it improves mapQ and gives more
@ -110,18 +82,12 @@ information about ALT hits.
Postprocessing is done with a separate script `bwa-postalt.js`. It reads all Postprocessing is done with a separate script `bwa-postalt.js`. It reads all
potential hits reported in the XA tag, lifts ALT hits to the chromosomal potential hits reported in the XA tag, lifts ALT hits to the chromosomal
positions using the ALT-to-ref alignment, groups them after lifting and then positions using the ALT-to-ref alignment, groups them based on overlaps between
reassigns mapQ based on the best scoring hit in each group with all the hits in their lifted positions, and then re-estimates mapQ across the best scoring hit
a group get the same mapQ. Being aware of the ALT-to-ref alignment, this script in each group. Being aware of the ALT-to-ref alignment, this script can greatly
can greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also
hits. writes each hit overlapping the reported hit into a separate SAM line. This
enables variant calling on each ALT contig independent of others.
The script also measures the presence of each ALT contig. For a group of
overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j
P(c_k|r_j)}{\sum_j\max_i P(c_i|r_j)}`, where `P(c_k|r)=\frac{pow(4,s_k)}{\sum_i
pow(4,s_i)}` is the posterior of c_k given a read r mapped to it with a
Smith-Waterman score s_k. This weight is reported in `postinfo.ctw` in the
option 2 above.
### On the completeness of GRCh38+ALT ### On the completeness of GRCh38+ALT
@ -129,10 +95,10 @@ While GRCh38 is much more complete than GRCh37, it is still missing some true
human sequences. To make sure every piece of sequence in the reference assembly human sequences. To make sure every piece of sequence in the reference assembly
is correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig is correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig
to have enough support from multiple sources before considering to add it to the to have enough support from multiple sources before considering to add it to the
reference assembly. This careful procedure has left out some sequences, one of reference assembly. This careful and sophisticated procedure has left out some
which is [this example][novel], a 10kb contig assembled from CHM1 short sequences, one of which is [this example][novel], a 10kb contig assembled from
reads and present also in NA12878. You can try [BLAT][blat] or [BLAST][blast] to CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
see where it maps. [BLAST][blast] to see where it maps.
For a more complete reference genome, we compiled a new set of decoy sequences For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples. from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
@ -146,12 +112,11 @@ not to high resolution for now.
### More on HLA typing ### More on HLA typing
It is [well known][hlalink] that HLA genes are associated with many autoimmune It is [well known][hlalink] that HLA genes are associated with many autoimmunity
diseases as well as some others not directly related to the immune system. infectious diseases and drug responses. However, many HLA alleles are highly
However, many HLA alleles are highly diverged from the reference genome. If we diverged from the reference genome. If we map whole-genome shotgun (WGS) reads
map whole-genome shotgun (WGS) reads to the reference only, many to the reference only, many allele-informative will get lost. As a result, the
allele-informative will get lost. As a result, the vast majority of WGS projects vast majority of WGS projects have ignored these important genes.
have ignored these important genes.
We recommend to include the genomic regions of classical HLA genes in the BWA We recommend to include the genomic regions of classical HLA genes in the BWA
index. This way we will be able to get a more complete collection of reads index. This way we will be able to get a more complete collection of reads
@ -160,13 +125,6 @@ and type HLA genes with another program, such as [Warren et al (2012)][hla4],
[Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1] [Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1]
or others from [this list][hlatools]. or others from [this list][hlatools].
If the postprocessing script `bwa-postalt.js` is invoked with `-p prefix`, it
will also write the top three alleles to file `prefix.hla`. However, as most HLA
alleles from IMGT/HLA don't have intronic sequences and thus are not included in
the BWA index from option 2, we are unable to type HLA genes to high resolution
with the BWA-MEM mapping alone. A dedicated tool is recommended for accurate
typing.
### Evaluating ALT Mapping ### Evaluating ALT Mapping
(Coming soon...) (Coming soon...)

View File

@ -5,16 +5,16 @@ use warnings;
use Getopt::Std; use Getopt::Std;
my %opts = (t=>1, n=>64); my %opts = (t=>1, n=>64);
getopts("SPADsp:R:x:t:", \%opts); getopts("SpADso:R:x:t:", \%opts);
die(' die('
Usage: run-bwamem [options] <idxbase> <file1> [file2] Usage: run-bwamem [options] <idxbase> <file1> [file2]
Options: -p STR prefix for output files [inferred from file1] Options: -o STR prefix for output files [inferred from input]
-R STR read group header line such as \'@RG\tID:foo\tSM:bar\' [null] -R STR read group header line such as \'@RG\tID:foo\tSM:bar\' [null]
-x STR read type: pacbio, ont2d or intractg [default] -x STR read type: pacbio, ont2d or intractg [default]
-t INT number of threads [1] -t INT number of threads [1]
-P input are paired-end reads if file2 absent -p input are paired-end reads if file2 absent
-A skip HiSeq2000/2500 PE resequencing adapter trimming (via trimadap) -A skip HiSeq2000/2500 PE resequencing adapter trimming (via trimadap)
-D skip duplicate marking (via samblaster) -D skip duplicate marking (via samblaster)
@ -35,18 +35,6 @@ my $exepath = $0 =~/^\S+\/[^\/\s]+/? $0 : &which($0);
my $root = $0 =~/^(\S+)\/[^\/\s]+/? $1 : undef; my $root = $0 =~/^(\S+)\/[^\/\s]+/? $1 : undef;
die "ERROR: failed to locate the 'bwa.kit' directory\n" if !defined($root); die "ERROR: failed to locate the 'bwa.kit' directory\n" if !defined($root);
my $prefix;
if (defined $opts{p}) {
$prefix = $opts{p};
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
$prefix = $1;
} else {
die("ERROR: failed to identify the prefix for output. Please specify -p.\n")
}
my $prefix_dir = $prefix =~ /^(\S+)\/[^\/\s]+$/? $1 : ".";
die("ERROR: directory $prefix_dir is not writable. Please specify a new output prefix with -p.\n") unless (-w $prefix_dir);
die("ERROR: failed to locate the BWA index. Please run '$root/bwa index -p $idx ref.fa'.\n") die("ERROR: failed to locate the BWA index. Please run '$root/bwa index -p $idx ref.fa'.\n")
unless (-f "$idx.bwt" && -f "$idx.pac" && -f "$idx.sa" && -f "$idx.ann" && -f "$idx.amb"); unless (-f "$idx.bwt" && -f "$idx.pac" && -f "$idx.sa" && -f "$idx.ann" && -f "$idx.amb");
@ -55,21 +43,37 @@ if (@ARGV >= 3 && $ARGV[1] =~ /\.(bam|sam|sam\.gz)$/) {
@ARGV = 2; @ARGV = 2;
} }
if (defined($opts{P}) && @ARGV >= 3) { if (defined($opts{p}) && @ARGV >= 3) {
warn("WARNING: option -P is ignored as there are two input sequence files.\n"); warn("WARNING: option -P is ignored as there are two input sequence files.\n");
delete $opts{P}; delete $opts{p};
} }
my $prefix;
if (defined $opts{o}) {
$prefix = $opts{o};
} elsif (@ARGV >= 3) {
my $len = length($ARGV[1]) < length($ARGV[2])? length($ARGV[1]) : length($ARGV[2]);
my $i;
for ($i = 0; $i < $len; ++$i) {
last if substr($ARGV[1], $i, 1) ne substr($ARGV[2], $i, 1)
}
$prefix = substr($ARGV[1], 0, $i) if $i > 0;
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
$prefix = $1;
}
die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix);
my $size = 0; my $size = 0;
my $comp_ratio = 3.; my $comp_ratio = 3.;
for my $f (@ARGV[1..$#ARGV]) { for my $f (@ARGV[1..$#ARGV]) {
my @a = stat($f); my @a = stat($f);
my $s = $a[7]; my $s = $a[7];
die("ERROR: failed to read file $f\n") if !defined($s);
$s *= $comp_ratio if $f =~ /\.(gz|bam)$/; $s *= $comp_ratio if $f =~ /\.(gz|bam)$/;
$size += int($s) + 1; $size += int($s) + 1;
} }
my $is_pe = (defined($opts{P}) || @ARGV >= 3)? 1 : 0; my $is_pe = (defined($opts{p}) || @ARGV >= 3)? 1 : 0;
my $is_sam = $ARGV[1] =~ /\.(sam|sam\.gz)$/? 1 : 0; my $is_sam = $ARGV[1] =~ /\.(sam|sam\.gz)$/? 1 : 0;
my $is_bam = $ARGV[1] =~ /\.bam$/? 1 : 0; my $is_bam = $ARGV[1] =~ /\.bam$/? 1 : 0;
@ -79,7 +83,7 @@ if ($is_sam || $is_bam) {
my $ntmps = int($size / 4e9) + 1; my $ntmps = int($size / 4e9) + 1;
my $cmd_shuf = ($is_bam || $is_sam) && !defined($opts{S})? " | $root/htsbox bamshuf -uOn$ntmps - $prefix.shuf \\\n" : ""; my $cmd_shuf = ($is_bam || $is_sam) && !defined($opts{S})? " | $root/htsbox bamshuf -uOn$ntmps - $prefix.shuf \\\n" : "";
my $cmd_bam2fq = ""; my $cmd_bam2fq = "";
$cmd_bam2fq = " | $root/htsbox bam2fq -O " . (defined($opts{P})? "-s $prefix.out.se.fq.gz " : "") . "- \\\n"; $cmd_bam2fq = " | $root/htsbox bam2fq -O " . (defined($opts{p})? "-s $prefix.out.se.fq.gz " : "") . "- \\\n";
$cmd = $cmd_sam2bam . $cmd_shuf . $cmd_bam2fq; $cmd = $cmd_sam2bam . $cmd_shuf . $cmd_bam2fq;
} elsif (@ARGV >= 3) { } elsif (@ARGV >= 3) {
$cmd = " | $root/seqtk mergepe $ARGV[1] $ARGV[2] \\\n"; $cmd = " | $root/seqtk mergepe $ARGV[1] $ARGV[2] \\\n";