smarter prefix finder

2014-11-13 16:16:21 -05:00 · 2014-11-13 16:16:21 -05:00 · 65c637b036
parent 4373a077d5
commit 65c637b036
2 changed files with 51 additions and 89 deletions
--- a/README-alt.md
+++ b/README-alt.md
@ -4,24 +4,16 @@ Since version 0.7.11, BWA-MEM supports read mapping against a reference genome
 with long alternative haplotypes present in separate ALT contigs. To use the
 ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the
 SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment
-is available from the [BWA resource bundle for GRCh38][res].
+is available from the [binary package of BWA][res].

-#### Option 1: Mapping to the official GRCh38 with ALT contigs
-
-Construct the index:
 ```sh
-wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
-gzip -d GCA_000001405.15_GRCh38_full_analysis_set.fna.gz
-mv GCA_000001405.15_GRCh38_full_analysis_set.fna hs38a.fa
-bwa index hs38a.fa
-cp bwa-hs38-bundle/hs38d4.fa.alt hs38a.fa.alt
-```
-
-Perform mapping:
-```sh
-bwa mem hs38a.fa read1.fq read2.fq \
-  | bwa-hs38-bundle/k8-linux bwa-postalt.js hs38a.fa.alt \
-  | samtools view -bS - > aln.unsrt.bam
+# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
+wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \
+  | gzip -dc | tar xf -
+bwa.kit/run-gen-hs38d6       # download GRCh38 and write hs38d6.fa
+bwa.kit/bwa index hs38d6.fa  # create BWA index
+# mapping
+bwa.kit/run-bwamem hs38d6.fa read1.fq read2.fq
 ```

 In the final alignment, a read may be placed on the [primary assembly][grcdef]
@ -30,26 +22,6 @@ Mapping quality (mapQ) is properly adjusted by the postprocessing script
 `bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For
 details, see the [Methods section](#methods).

-#### Option 2: Mapping to the collection of GRCh38, decoy and HLA genes
-
-Construct the index:
-```sh
-cat hs38a.fa bwa-hs38-bundle/hs38d4-extra.fa > hs38d4.fa
-bwa index hs38d4.fa
-cp bwa-hs38-bundle/hs38d4.fa.alt .
-```
-Perform mapping:
-```sh
-bwa mem hs38d4.fa read1.fq read2.fq \
-  | bwa-hs38-bundle/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \
-  | samtools view -bS - > aln.unsrt.bam
-```
-The benefit of this option is to have a more complete reference sequence and
-to facilitate HLA typing with a 3rd-party tool (see below).
-
-***If you are not interested in the way BWA-MEM performs ALT mapping, you can
-skip the rest of this documentation.***
-
 ## Background

 GRCh38 consists of several components: chromosomal assembly, unlocalized contigs
@ -60,8 +32,8 @@ definitions from the [GRC website][grcdef].

 GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions.
 However, sequences that are highly diverged from the primary assembly only
-contribute a few million bp. Most subsequences of ALT contigs are highly similar
-or identical to the primary assembly. If we align sequence reads to GRCh38+ALT
+contribute a few million bp. Most subsequences of ALT contigs are nearly
+identical to the primary assembly. If we align sequence reads to GRCh38+ALT
 treating ALT equal to the primary assembly, we will get many reads with zero
 mapping quality and lose variants on them. It is crucial to make the mapper
 aware of ALTs.
@ -88,8 +60,8 @@ the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*,
 depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports
 alignments and assigns mapQ following these two rules:

-1. The original mapQ of a non-ALT hit is computed across non-ALT hits only.
-   The reported mapQ of an ALT hit is computed across all hits.
+1. The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of
+   an ALT hit is computed across all hits.

 2. If there are no non-ALT hits, the best ALT hit is outputted as the primary
   alignment. If there are both ALT and non-ALT hits, non-ALT hits will be
@ -100,7 +72,7 @@ In theory, non-ALT alignments from step 1 should be identical to alignments
 against a reference genome with ALT contigs. In practice, the two types of
 alignments may differ in rare cases due to seeding heuristics. When an ALT hit
 is significantly better than non-ALT hits, BWA-MEM may miss seeds on the
-non-ALT hits. This happens more often for contig mapping.
+non-ALT hits.

 If we don't care about ALT hits, we may skip postprocessing (step 2).
 Nonetheless, postprocessing is recommended as it improves mapQ and gives more
@ -110,18 +82,12 @@ information about ALT hits.

 Postprocessing is done with a separate script `bwa-postalt.js`. It reads all
 potential hits reported in the XA tag, lifts ALT hits to the chromosomal
-positions using the ALT-to-ref alignment, groups them after lifting and then
-reassigns mapQ based on the best scoring hit in each group with all the hits in
-a group get the same mapQ. Being aware of the ALT-to-ref alignment, this script
-can greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT
-hits.
-
-The script also measures the presence of each ALT contig. For a group of
-overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j
-P(c_k|r_j)}{\sum_j\max_i P(c_i|r_j)}`, where `P(c_k|r)=\frac{pow(4,s_k)}{\sum_i
-pow(4,s_i)}` is the posterior of c_k given a read r mapped to it with a
-Smith-Waterman score s_k. This weight is reported in `postinfo.ctw` in the
-option 2 above.
+positions using the ALT-to-ref alignment, groups them based on overlaps between
+their lifted positions, and then re-estimates mapQ across the best scoring hit
+in each group. Being aware of the ALT-to-ref alignment, this script can greatly
+improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also
+writes each hit overlapping the reported hit into a separate SAM line. This
+enables variant calling on each ALT contig independent of others.

 ### On the completeness of GRCh38+ALT

@ -129,10 +95,10 @@ While GRCh38 is much more complete than GRCh37, it is still missing some true
 human sequences. To make sure every piece of sequence in the reference assembly
 is correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig
 to have enough support from multiple sources before considering to add it to the
-reference assembly. This careful procedure has left out some sequences, one of
-which is [this example][novel], a 10kb contig assembled from CHM1 short
-reads and present also in NA12878. You can try [BLAT][blat] or [BLAST][blast] to
-see where it maps.
+reference assembly. This careful and sophisticated procedure has left out some
+sequences, one of which is [this example][novel], a 10kb contig assembled from
+CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
+[BLAST][blast] to see where it maps.

 For a more complete reference genome, we compiled a new set of decoy sequences
 from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
@ -146,12 +112,11 @@ not to high resolution for now.

 ### More on HLA typing

-It is [well known][hlalink] that HLA genes are associated with many autoimmune
-diseases as well as some others not directly related to the immune system.
-However, many HLA alleles are highly diverged from the reference genome. If we
-map whole-genome shotgun (WGS) reads to the reference only, many
-allele-informative will get lost. As a result, the vast majority of WGS projects
-have ignored these important genes.
+It is [well known][hlalink] that HLA genes are associated with many autoimmunity
+infectious diseases and drug responses. However, many HLA alleles are highly
+diverged from the reference genome. If we map whole-genome shotgun (WGS) reads
+to the reference only, many allele-informative will get lost. As a result, the
+vast majority of WGS projects have ignored these important genes.

 We recommend to include the genomic regions of classical HLA genes in the BWA
 index. This way we will be able to get a more complete collection of reads
@ -160,13 +125,6 @@ and type HLA genes with another program, such as [Warren et al (2012)][hla4],
 [Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1]
 or others from [this list][hlatools].

-If the postprocessing script `bwa-postalt.js` is invoked with `-p prefix`, it
-will also write the top three alleles to file `prefix.hla`. However, as most HLA
-alleles from IMGT/HLA don't have intronic sequences and thus are not included in
-the BWA index from option 2, we are unable to type HLA genes to high resolution
-with the BWA-MEM mapping alone. A dedicated tool is recommended for accurate
-typing.
-
 ### Evaluating ALT Mapping

 (Coming soon...)
--- a/extras/run-bwamem
+++ b/extras/run-bwamem
@ -5,16 +5,16 @@ use warnings;
 use Getopt::Std;

 my %opts = (t=>1, n=>64);
-getopts("SPADsp:R:x:t:", \%opts);
+getopts("SpADso:R:x:t:", \%opts);

 die('
 Usage:   run-bwamem [options] <idxbase> <file1> [file2]

-Options: -p STR    prefix for output files                       [inferred from file1]
+Options: -o STR    prefix for output files                       [inferred from input]
         -R STR    read group header line such as \'@RG\tID:foo\tSM:bar\'         [null]
         -x STR    read type: pacbio, ont2d or intractg                      [default]
         -t INT    number of threads                                               [1]
-         -P        input are paired-end reads if file2 absent
+         -p        input are paired-end reads if file2 absent

         -A        skip HiSeq2000/2500 PE resequencing adapter trimming (via trimadap)
         -D        skip duplicate marking (via samblaster)
@ -35,18 +35,6 @@ my $exepath = $0 =~/^\S+\/[^\/\s]+/? $0 : &which($0);
 my $root = $0 =~/^(\S+)\/[^\/\s]+/? $1 : undef;
 die "ERROR: failed to locate the 'bwa.kit' directory\n" if !defined($root);

-my $prefix;
-if (defined $opts{p}) {
-	$prefix = $opts{p};
-} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
-	$prefix = $1;
-} else {
-	die("ERROR: failed to identify the prefix for output. Please specify -p.\n")
-}
-
-my $prefix_dir = $prefix =~ /^(\S+)\/[^\/\s]+$/? $1 : ".";
-die("ERROR: directory $prefix_dir is not writable. Please specify a new output prefix with -p.\n") unless (-w $prefix_dir);
-
 die("ERROR: failed to locate the BWA index. Please run '$root/bwa index -p $idx ref.fa'.\n")
  unless (-f "$idx.bwt" && -f "$idx.pac" && -f "$idx.sa" && -f "$idx.ann" && -f "$idx.amb");

@ -55,21 +43,37 @@ if (@ARGV >= 3 && $ARGV[1] =~ /\.(bam|sam|sam\.gz)$/) {
 	@ARGV = 2;
 }

-if (defined($opts{P}) && @ARGV >= 3) {
+if (defined($opts{p}) && @ARGV >= 3) {
 	warn("WARNING: option -P is ignored as there are two input sequence files.\n");
-	delete $opts{P};
+	delete $opts{p};
 }

+my $prefix;
+if (defined $opts{o}) {
+	$prefix = $opts{o};
+} elsif (@ARGV >= 3) {
+	my $len = length($ARGV[1]) < length($ARGV[2])? length($ARGV[1]) : length($ARGV[2]);
+	my $i;
+	for ($i = 0; $i < $len; ++$i) {
+		last if substr($ARGV[1], $i, 1) ne substr($ARGV[2], $i, 1)
+	}
+	$prefix = substr($ARGV[1], 0, $i) if $i > 0;
+} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
+	$prefix = $1;
+}
+die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix);
+
 my $size = 0;
 my $comp_ratio = 3.;
 for my $f (@ARGV[1..$#ARGV]) {
 	my @a = stat($f);
 	my $s = $a[7];
+	die("ERROR: failed to read file $f\n") if !defined($s);
 	$s *= $comp_ratio if $f =~ /\.(gz|bam)$/;
 	$size += int($s) + 1;
 }

-my $is_pe = (defined($opts{P}) || @ARGV >= 3)? 1 : 0;
+my $is_pe = (defined($opts{p}) || @ARGV >= 3)? 1 : 0;
 my $is_sam = $ARGV[1] =~ /\.(sam|sam\.gz)$/? 1 : 0;
 my $is_bam = $ARGV[1] =~ /\.bam$/? 1 : 0;

@ -79,7 +83,7 @@ if ($is_sam || $is_bam) {
 	my $ntmps = int($size / 4e9) + 1;
 	my $cmd_shuf = ($is_bam || $is_sam) && !defined($opts{S})? "  | $root/htsbox bamshuf -uOn$ntmps - $prefix.shuf \\\n" : "";
 	my $cmd_bam2fq = "";
-	$cmd_bam2fq = "  | $root/htsbox bam2fq -O " . (defined($opts{P})? "-s $prefix.out.se.fq.gz " : "") . "- \\\n";
+	$cmd_bam2fq = "  | $root/htsbox bam2fq -O " . (defined($opts{p})? "-s $prefix.out.se.fq.gz " : "") . "- \\\n";
 	$cmd = $cmd_sam2bam . $cmd_shuf . $cmd_bam2fq;
 } elsif (@ARGV >= 3) {
 	$cmd = "  | $root/seqtk mergepe $ARGV[1] $ARGV[2] \\\n";