smarter prefix finder

This commit is contained in:
Heng Li 2014-11-13 16:16:21 -05:00
parent 4373a077d5
commit 65c637b036
2 changed files with 51 additions and 89 deletions

View File

@ -4,24 +4,16 @@ Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
with long alternative haplotypes present in separate ALT contigs. To use the
ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
is available from the [BWA resource bundle for GRCh38][res].
is available from the [binary package of BWA][res].
#### Option 1: Mapping to the official GRCh38 with ALT contigs
Construct the index:
```sh
wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
gzip -d GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
mv GCA_000001405.15_GRCh38_full_analysis_set.fna hs38a.fa
bwa index hs38a.fa
cp bwa-hs38-bundle/hs38d4.fa.alt hs38a.fa.alt
```
Perform mapping:
```sh
bwa mem hs38a.fa read1.fq read2.fq \
| bwa-hs38-bundle/k8-linux bwa-postalt.js hs38a.fa.alt \
| samtools view -bS - > aln.unsrt.bam
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
bwa.kit/run-gen-hs38d6 # download GRCh38 and write hs38d6.fa
bwa.kit/bwa index hs38d6.fa # create BWA index
# mapping
bwa.kit/run-bwamem hs38d6.fa read1.fq read2.fq
```
In the final alignment, a read may be placed on the [primary assembly][grcdef]
@ -30,26 +22,6 @@ Mapping quality (mapQ) is properly adjusted by the postprocessing script
`bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For
details, see the [Methods section](#methods).
#### Option 2: Mapping to the collection of GRCh38, decoy and HLA genes
Construct the index:
```sh
cat hs38a.fa bwa-hs38-bundle/hs38d4-extra.fa > hs38d4.fa
bwa index hs38d4.fa
cp bwa-hs38-bundle/hs38d4.fa.alt .
```
Perform mapping:
```sh
bwa mem hs38d4.fa read1.fq read2.fq \
| bwa-hs38-bundle/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \
| samtools view -bS - > aln.unsrt.bam
```
The benefit of this option is to have a more complete reference sequence and
to facilitate HLA typing with a 3rd-party tool (see below).
***If you are not interested in the way BWA-MEM performs ALT mapping, you can
skip the rest of this documentation.***
## Background
GRCh38 consists of several components: chromosomal assembly, unlocalized contigs
@ -60,8 +32,8 @@ definitions from the [GRC website][grcdef].
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions.
However, sequences that are highly diverged from the primary assembly only
contribute a few million bp. Most subsequences of ALT contigs are highly similar
or identical to the primary assembly. If we align sequence reads to GRCh38+ALT
contribute a few million bp. Most subsequences of ALT contigs are nearly
identical to the primary assembly. If we align sequence reads to GRCh38+ALT
treating ALT equal to the primary assembly, we will get many reads with zero
mapping quality and lose variants on them. It is crucial to make the mapper
aware of ALTs.
@ -88,8 +60,8 @@ the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*,
depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports
alignments and assigns mapQ following these two rules:
1. The original mapQ of a non-ALT hit is computed across non-ALT hits only.
The reported mapQ of an ALT hit is computed across all hits.
1. The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of
an ALT hit is computed across all hits.
2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
@ -100,7 +72,7 @@ In theory, non-ALT alignments from step 1 should be identical to alignments
against a reference genome with ALT contigs. In practice, the two types of
alignments may differ in rare cases due to seeding heuristics. When an ALT hit
is significantly better than non-ALT hits, BWA-MEM may miss seeds on the
non-ALT hits. This happens more often for contig mapping.
non-ALT hits.
If we don't care about ALT hits, we may skip postprocessing (step 2).
Nonetheless, postprocessing is recommended as it improves mapQ and gives more
@ -110,18 +82,12 @@ information about ALT hits.
Postprocessing is done with a separate script `bwa-postalt.js`. It reads all
potential hits reported in the XA tag, lifts ALT hits to the chromosomal
positions using the ALT-to-ref alignment, groups them after lifting and then
reassigns mapQ based on the best scoring hit in each group with all the hits in
a group get the same mapQ. Being aware of the ALT-to-ref alignment, this script
can greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT
hits.
The script also measures the presence of each ALT contig. For a group of
overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j
P(c_k|r_j)}{\sum_j\max_i P(c_i|r_j)}`, where `P(c_k|r)=\frac{pow(4,s_k)}{\sum_i
pow(4,s_i)}` is the posterior of c_k given a read r mapped to it with a
Smith-Waterman score s_k. This weight is reported in `postinfo.ctw` in the
option 2 above.
positions using the ALT-to-ref alignment, groups them based on overlaps between
their lifted positions, and then re-estimates mapQ across the best scoring hit
in each group. Being aware of the ALT-to-ref alignment, this script can greatly
improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also
writes each hit overlapping the reported hit into a separate SAM line. This
enables variant calling on each ALT contig independent of others.
### On the completeness of GRCh38+ALT
@ -129,10 +95,10 @@ While GRCh38 is much more complete than GRCh37, it is still missing some true
human sequences. To make sure every piece of sequence in the reference assembly
is correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig
to have enough support from multiple sources before considering to add it to the
reference assembly. This careful procedure has left out some sequences, one of
which is [this example][novel], a 10kb contig assembled from CHM1 short
reads and present also in NA12878. You can try [BLAT][blat] or [BLAST][blast] to
see where it maps.
reference assembly. This careful and sophisticated procedure has left out some
sequences, one of which is [this example][novel], a 10kb contig assembled from
CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
[BLAST][blast] to see where it maps.
For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
@ -146,12 +112,11 @@ not to high resolution for now.
### More on HLA typing
It is [well known][hlalink] that HLA genes are associated with many autoimmune
diseases as well as some others not directly related to the immune system.
However, many HLA alleles are highly diverged from the reference genome. If we
map whole-genome shotgun (WGS) reads to the reference only, many
allele-informative will get lost. As a result, the vast majority of WGS projects
have ignored these important genes.
It is [well known][hlalink] that HLA genes are associated with many autoimmunity
infectious diseases and drug responses. However, many HLA alleles are highly
diverged from the reference genome. If we map whole-genome shotgun (WGS) reads
to the reference only, many allele-informative will get lost. As a result, the
vast majority of WGS projects have ignored these important genes.
We recommend to include the genomic regions of classical HLA genes in the BWA
index. This way we will be able to get a more complete collection of reads
@ -160,13 +125,6 @@ and type HLA genes with another program, such as [Warren et al (2012)][hla4],
[Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1]
or others from [this list][hlatools].
If the postprocessing script `bwa-postalt.js` is invoked with `-p prefix`, it
will also write the top three alleles to file `prefix.hla`. However, as most HLA
alleles from IMGT/HLA don't have intronic sequences and thus are not included in
the BWA index from option 2, we are unable to type HLA genes to high resolution
with the BWA-MEM mapping alone. A dedicated tool is recommended for accurate
typing.
### Evaluating ALT Mapping
(Coming soon...)

View File

@ -5,16 +5,16 @@ use warnings;
use Getopt::Std;
my %opts = (t=>1, n=>64);
getopts("SPADsp:R:x:t:", \%opts);
getopts("SpADso:R:x:t:", \%opts);
die('
Usage: run-bwamem [options] <idxbase> <file1> [file2]
Options: -p STR prefix for output files [inferred from file1]
Options: -o STR prefix for output files [inferred from input]
-R STR read group header line such as \'@RG\tID:foo\tSM:bar\' [null]
-x STR read type: pacbio, ont2d or intractg [default]
-t INT number of threads [1]
-P input are paired-end reads if file2 absent
-p input are paired-end reads if file2 absent
-A skip HiSeq2000/2500 PE resequencing adapter trimming (via trimadap)
-D skip duplicate marking (via samblaster)
@ -35,18 +35,6 @@ my $exepath = $0 =~/^\S+\/[^\/\s]+/? $0 : &which($0);
my $root = $0 =~/^(\S+)\/[^\/\s]+/? $1 : undef;
die "ERROR: failed to locate the 'bwa.kit' directory\n" if !defined($root);
my $prefix;
if (defined $opts{p}) {
$prefix = $opts{p};
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
$prefix = $1;
} else {
die("ERROR: failed to identify the prefix for output. Please specify -p.\n")
}
my $prefix_dir = $prefix =~ /^(\S+)\/[^\/\s]+$/? $1 : ".";
die("ERROR: directory $prefix_dir is not writable. Please specify a new output prefix with -p.\n") unless (-w $prefix_dir);
die("ERROR: failed to locate the BWA index. Please run '$root/bwa index -p $idx ref.fa'.\n")
unless (-f "$idx.bwt" && -f "$idx.pac" && -f "$idx.sa" && -f "$idx.ann" && -f "$idx.amb");
@ -55,21 +43,37 @@ if (@ARGV >= 3 && $ARGV[1] =~ /\.(bam|sam|sam\.gz)$/) {
@ARGV = 2;
}
if (defined($opts{P}) && @ARGV >= 3) {
if (defined($opts{p}) && @ARGV >= 3) {
warn("WARNING: option -P is ignored as there are two input sequence files.\n");
delete $opts{P};
delete $opts{p};
}
my $prefix;
if (defined $opts{o}) {
$prefix = $opts{o};
} elsif (@ARGV >= 3) {
my $len = length($ARGV[1]) < length($ARGV[2])? length($ARGV[1]) : length($ARGV[2]);
my $i;
for ($i = 0; $i < $len; ++$i) {
last if substr($ARGV[1], $i, 1) ne substr($ARGV[2], $i, 1)
}
$prefix = substr($ARGV[1], 0, $i) if $i > 0;
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
$prefix = $1;
}
die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix);
my $size = 0;
my $comp_ratio = 3.;
for my $f (@ARGV[1..$#ARGV]) {
my @a = stat($f);
my $s = $a[7];
die("ERROR: failed to read file $f\n") if !defined($s);
$s *= $comp_ratio if $f =~ /\.(gz|bam)$/;
$size += int($s) + 1;
}
my $is_pe = (defined($opts{P}) || @ARGV >= 3)? 1 : 0;
my $is_pe = (defined($opts{p}) || @ARGV >= 3)? 1 : 0;
my $is_sam = $ARGV[1] =~ /\.(sam|sam\.gz)$/? 1 : 0;
my $is_bam = $ARGV[1] =~ /\.bam$/? 1 : 0;
@ -79,7 +83,7 @@ if ($is_sam || $is_bam) {
my $ntmps = int($size / 4e9) + 1;
my $cmd_shuf = ($is_bam || $is_sam) && !defined($opts{S})? " | $root/htsbox bamshuf -uOn$ntmps - $prefix.shuf \\\n" : "";
my $cmd_bam2fq = "";
$cmd_bam2fq = " | $root/htsbox bam2fq -O " . (defined($opts{P})? "-s $prefix.out.se.fq.gz " : "") . "- \\\n";
$cmd_bam2fq = " | $root/htsbox bam2fq -O " . (defined($opts{p})? "-s $prefix.out.se.fq.gz " : "") . "- \\\n";
$cmd = $cmd_sam2bam . $cmd_shuf . $cmd_bam2fq;
} elsif (@ARGV >= 3) {
$cmd = " | $root/seqtk mergepe $ARGV[1] $ARGV[2] \\\n";