From fab1f2b561ef67670116010236013d945ef343ba Mon Sep 17 00:00:00 2001 From: Heng Li Date: Thu, 16 Oct 2014 17:15:39 -0400 Subject: [PATCH] README on ALT mapping --- README-alt.md | 123 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 README-alt.md diff --git a/README-alt.md b/README-alt.md new file mode 100644 index 0000000..7a578f9 --- /dev/null +++ b/README-alt.md @@ -0,0 +1,123 @@ +## Getting Started + +Since version 0.7.11, BWA-MEM supports read mapping against a reference genome +with long alternative haplotypes present in separate ALT contigs. To use the +ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the +SAM format and rename the file to ""*idxbase*.alt". For GRCh38, this alignment is +available from the [BWA resource bundle for GRCh38][res]. + +### Option 1: Mapping to the official GRCh38 with ALT contigs + +Construct the index: +```sh +wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz +gzip -d GCA_000001405.15_GRCh38_full_analysis_set.fna.gz +mv GCA_000001405.15_GRCh38_full_analysis_set.fna hs38a.fa +bwa index hs38a.fa +cp bwa-hs38-res/hs38d4.fa.alt hs38a.fa.alt +``` + +Perform mapping: +```sh +bwa mem hs38a.fa read1.fq read2.fq \ + | samblaster \ + | bwa-hs38-res/k8-linux bwa-postalt.js hs38a.fa.alt \ + | samtools view -bS - > aln.unsrt.bam +``` +For short reads, the postprocessing script `bwa-postalt.js` runs at about the +same speed as BAM compression. + +### Option 2: Mapping to the collection of GRCh38, decoy and HLA genes + +Construct the index: +```sh +cat hs38a.fa bwa-hs38-res/hs38d4-extra.fa > hs38d4.fa +bwa index hs38d4.fa +cp bwa-hs38-res/hs38d4.fa.alt . +``` +Perform mapping: +```sh +bwa mem -g.8 hs38d4.fa read1.fq read2.fq \ + | samblaster \ + | bwa-hs38-res/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \ + | samtools view -bS - > aln.unsrt.bam +``` +This command line generates `postinfo.ctw` which loosely evaluates the presence +of an ALT contig with an empirical score at the last column. + +## Background + +GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions. +However, sequences that are highly diverged from the primary assembly only +contribute a few million bp. Most subsequences of ALT contigs are highly similar +or identical to the primary assembly. If we align sequence reads to GRCh38+ALT +treating ALT equal to the primary assembly, we will get many reads with zero +mapping quality and lose variants on them. It is crucial to make the mapper +aware of ALTs. + +BWA-MEM is designed to minimize the interference of ALT contigs such that on the +primary assembly, the ALT-aware alignment is highly similar to the alignment +without using ALT contigs in the index. This design choice makes it almost +always safe to map reads to GRCh38+ALT. Although we don't know yet how much +variations on ALT contigs contribute to phenotypes, we would not get the answer +without mapping large cohorts to these extra sequences. We hope our current +implementation encourages researchers to use ALT contigs soon and often. + +## Methods + +As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and +postprocessing. + +### Step 1: BWA-MEM mapping + +At this step, BWA-MEM reads the ALT contig names from "*idxbase*.alt", ignoring +the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*, +depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports +alignments and assigns mapQ following these two rules: + + * The original mapQ of a non-ALT hit is computed across non-ALT hits only. + The reported mapQ of an ALT hit is computed across all hits. + + * An ALT hit is only reported if its score is strictly better than all + overlapping non-ALT hits. A reported ALT hit is flagged with 0x800 + (supplementary) unless there are no non-ALT hits. + +When option `-g FLOAT` is in use (which is the default), a third rule kicks in: + + * The mapQ of a non-ALT hit is reduced to zero if its score is less than FLOAT + times the score of an overlapping ALT hit. In this case, the original mapQ is + moved to the `om` tag. + +If we don't care about ALT hits, we may actually skip postprocessing (step 2). +Nonetheless, postprocessing is recommended as it improves mapQ and gives more +information about ALT hits. + +### Step 2: Postprocessing + +Postprocessing is done with a separate script `bwa-postalt.js`. It reads all +potential hits reported in the XA tag, lifts ALT hits to the chromosomal +positions using the ALT-to-ref alignment, groups them after lifting and then +reassigns mapQ based on the best scoring hit in each group with all the hits in +a group get the same mapQ. Knowing the ALT-to-ref alignment, this script can +greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. + +The script also measures the presence of each ALT contig. For a group of +overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j +P(c_k|r_j)}{\sum_j\max_i P(c_i|r_j)}`, where `P(c_k|r)=\frac{pow(4,s_k)}{\sum_i +pow(4,s_i)}` is the posterior of c_k given a read r mapped to it with a +Smith-Waterman score s_k. This weight is reported in `postinfo.ctw` in the +option 2 above. + +## Problems and Future Development + +There are some uncertainties about ALT mappings - we are not sure whether they +help biological discovery and don't know the best way to analyze them. Without +clear demand from downstream analyses, it is very difficult to design the +optimal mapping strategy. The current BWA-MEM method is just a start. If it +turns out to be useful in research, we will probably rewrite bwa-postalt.js in C +for performance; if not, we will try new designs. It is also possible that we +may make breakthrough on the representation of multiple genomes, in which case, +we can even get rid of ALT contigs once for all. + +[res]: https://sourceforge.net/projects/bio-bwa/files/ +[sb]: https://github.com/GregoryFaust/samblaster