diff --git a/NEWS.md b/NEWS.md index 3d8dec9..4cf69ab 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,29 +1,17 @@ -Release 0.7.11 (XX September, 2014) +Release 0.7.11 (XX November, 2014) ----------------------------------- -A major change to BWA-MEM is the support of mapping to ALT contigs. To use this -feature, users need to manually create a file `indexbase.alt` with each line -giving the name of an ALT contig. During alignment, BWA-MEM will be able to -classify potential hits to ALT and non-ALT hits. It reports alignments and -assigns mapping quality (mapQ) loosely following these rules: +A major change to BWA-MEM is the support of mapping to ALT contigs in addition +to the primary assembly. Part of the ALT mapping strategy is implemented in +BWA-MEM and the rest in a postprocessing script for now. Due to the extra +layer of complexity on generating the reference genome and on the two-step +mapping, we start to provide a wrapper script and precompiled binaries since +this release. Please check README-alt.md for details. - 1. The original mapQ of a non-ALT hit is computed across non-ALT hits only. - The reported mapQ of an ALT hit is always computed across all hits. - - 2. An ALT hit is only reported if its score is strictly better than all - overlapping non-ALT hits. A reported ALT hit is flagged with 0x800 - (supplementary) unless there are no non-ALT hits. - - 3. The mapQ of a non-ALT hit is reduced to zero if its score is less than 80% - (controlled by option `-g`) of the score of an overlapping ALT hit. In this - case, the original mapQ is moved to the `om` tag. - -This way, non-ALT alignments are only affected by ALT contigs if there are -significantly better ALT alignments. BWA-MEM is carefully engineered such that -ALT contigs do not interfere with the alignments to the primary assembly. - -Users may consider to use ALT contigs from GRCh38. I am also constructing a -non-redundant and more complete set of sequences missing from GRCh38. +Another major addition to BWA-MEM is HLA typing, which made possible with the +new ALT mapping strategy. Necessary data and programs are included in the +binary release. The wrapper script also performs HLA typing when HLA genes are +also included in the reference genome as additional ALT contigs. Other notable changes to BWA-MEM: @@ -42,7 +30,9 @@ Other notable changes to BWA-MEM: * Added new pre-setting for Oxford Nanopore 2D reads. For small genomes, though, LAST is still more sensitive. -(0.7.11: XX September 2014, rXXX) + * Added LAST-like seeding. This improves the accuracy for longer reads. + +(0.7.11: XX November 2014, rXXX) diff --git a/README-alt.md b/README-alt.md index 1ee2e9f..7d3a3c7 100644 --- a/README-alt.md +++ b/README-alt.md @@ -1,85 +1,63 @@ ## Getting Started -Since version 0.7.11, BWA-MEM supports read mapping against a reference genome -with long alternative haplotypes present in separate ALT contigs. To use the -ALT-aware mode, users need to provide pairwise ALT-to-reference alignment in the -SAM format and rename the file to "*idxbase*.alt". For GRCh38, this alignment -is available from the [BWA resource bundle for GRCh38][res]. - -#### Option 1: Mapping to the official GRCh38 with ALT contigs - -Construct the index: ```sh -wget ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -gzip -d GCA_000001405.15_GRCh38_full_analysis_set.fna.gz -mv GCA_000001405.15_GRCh38_full_analysis_set.fna hs38a.fa -bwa index hs38a.fa -cp bwa-hs38-bundle/hs38d4.fa.alt hs38a.fa.alt +# Download the bwa-0.7.11 binary package +wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit-0.7.11_x64-linux.tar.bz2/download \ + | gzip -dc | tar xf - +# Generate the GRCh38+ALT+decoy+HLA and create the BWA index +bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa +bwa.kit/bwa index hs38d6.fa # create BWA index +# mapping +bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines ``` -Perform mapping: -```sh -bwa mem hs38a.fa read1.fq read2.fq \ - | bwa-hs38-bundle/k8-linux bwa-postalt.js hs38a.fa.alt \ - | samtools view -bS - > aln.unsrt.bam -``` +This will generate the following files: -In the final alignment, a read may be placed on the [primary assembly][grcdef] -and multiple overlapping ALT contigs at the same time (on multiple SAM lines). -Mapping quality (mapQ) is properly adjusted by the postprocessing script -`bwa-postalt.js` using the ALT-to-reference alignment `hs38a.fa.alt`. For -details, see the [Methods section](#methods). +* `out.aln.bam`: unsorted alignments with ALT-aware mapping quality. In this + file, one read may be placed on multiple overlapping ALT contigs at the same + time even if the read is mapped better to some contigs than others. This makes + it possible to analyze each contig independent of others. -#### Option 2: Mapping to the collection of GRCh38, decoy and HLA genes +* `out.hla.top`: best genotypes for HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1 genes. -Construct the index: -```sh -cat hs38a.fa bwa-hs38-bundle/hs38d4-extra.fa > hs38d4.fa -bwa index hs38d4.fa -cp bwa-hs38-bundle/hs38d4.fa.alt . -``` -Perform mapping: -```sh -bwa mem hs38d4.fa read1.fq read2.fq \ - | bwa-hs38-bundle/k8-linux bwa-postalt.js -p postinfo hs38d4.fa.alt \ - | samtools view -bS - > aln.unsrt.bam -``` -The benefit of this option is to have a more complete reference sequence and -to facilitate HLA typing with a 3rd-party tool (see below). +* `out.hla.all`: other possible genotypes on the six HLA genes. -***If you are not interested in the way BWA-MEM performs ALT mapping, you can -skip the rest of this documentation.*** +* `out.log.*`: bwa-mem, samblaster and HLA typing log files. + +Note that `run-bwamem` only prints command lines but doesn't execute them. It +is advised to have a look at the command lines before passing them to `sh` for +actual execution. ## Background GRCh38 consists of several components: chromosomal assembly, unlocalized contigs (chromosome known but location unknown), unplaced contigs (chromosome unknown) and ALT contigs (long clustered variations). The combination of the first three -components is called the *primary assembly*. You can find the more exact -definitions from the [GRC website][grcdef]. +components is called the *primary assembly*. It is recommended to use the +complete primary assembly for all analyses. Using ALT contigs in read mapping is +tricky. -GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp genomic regions. -However, sequences that are highly diverged from the primary assembly only -contribute a few million bp. Most subsequences of ALT contigs are highly similar -or identical to the primary assembly. If we align sequence reads to GRCh38+ALT -treating ALT equal to the primary assembly, we will get many reads with zero -mapping quality and lose variants on them. It is crucial to make the mapper -aware of ALTs. +GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary +assembly. However, sequences that are highly diverged from the primary assembly +only contribute a few million bp. Most subsequences of ALT contigs are nearly +identical to the primary assembly. If we align sequence reads to GRCh38+ALT +blindly, we will get many additional reads with zero mapping quality and miss +variants on them. It is crucial to make mappers aware of ALTs. -BWA-MEM is designed to minimize the interference of ALT contigs such that on the -primary assembly, the ALT-aware alignment is highly similar to the alignment -without using ALT contigs in the index. This design choice makes it almost -always safe to map reads to GRCh38+ALT. Although we don't know yet how much -variations on ALT contigs contribute to phenotypes, we would not get the answer -without mapping large cohorts to these extra sequences. We hope our current -implementation encourages researchers to use ALT contigs soon and often. +BWA-MEM is ALT-aware. It essentially computes mapping quality across the +non-redundant content of the primary assembly plus the ALT contigs and is free +of the problem above. ## Methods ### Sequence alignment As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and -postprocessing. +postprocessing. The `bwa.kit/run-bwamem` script performs the two steps when ALT +contigs are present. The following picture shows an example about how BWA-MEM +infers mapping quality and reports alignment after step 2: + +![](https://raw.githubusercontent.com/lh3/bwa/dev/extras/alt-demo.png) #### Step 1: BWA-MEM mapping @@ -88,19 +66,19 @@ the ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*, depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports alignments and assigns mapQ following these two rules: -1. The original mapQ of a non-ALT hit is computed across non-ALT hits only. - The reported mapQ of an ALT hit is computed across all hits. +1. The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of + an ALT hit is computed across all hits. 2. If there are no non-ALT hits, the best ALT hit is outputted as the primary alignment. If there are both ALT and non-ALT hits, non-ALT hits will be - primary. ALT hits are reported as supplementary alignments (flag 0x800) only - if they are better than all overlapping non-ALT hits. + primary and ALT hits be supplementary (SAM flag 0x800) if ALT hits are better + than the best overlapping non-ALT hits. In theory, non-ALT alignments from step 1 should be identical to alignments -against a reference genome with ALT contigs. In practice, the two types of +against the reference genome with ALT contigs. In practice, the two types of alignments may differ in rare cases due to seeding heuristics. When an ALT hit is significantly better than non-ALT hits, BWA-MEM may miss seeds on the -non-ALT hits. This happens more often for contig mapping. +non-ALT hits. If we don't care about ALT hits, we may skip postprocessing (step 2). Nonetheless, postprocessing is recommended as it improves mapQ and gives more @@ -110,18 +88,12 @@ information about ALT hits. Postprocessing is done with a separate script `bwa-postalt.js`. It reads all potential hits reported in the XA tag, lifts ALT hits to the chromosomal -positions using the ALT-to-ref alignment, groups them after lifting and then -reassigns mapQ based on the best scoring hit in each group with all the hits in -a group get the same mapQ. Being aware of the ALT-to-ref alignment, this script -can greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT -hits. - -The script also measures the presence of each ALT contig. For a group of -overlapping ALT contigs c_1, ..., c_m, the weight for c_k equals `\frac{\sum_j -P(c_k|r_j)}{\sum_j\max_i P(c_i|r_j)}`, where `P(c_k|r)=\frac{pow(4,s_k)}{\sum_i -pow(4,s_i)}` is the posterior of c_k given a read r mapped to it with a -Smith-Waterman score s_k. This weight is reported in `postinfo.ctw` in the -option 2 above. +positions using the ALT-to-ref alignment, groups them based on overlaps between +their lifted positions, and then re-estimates mapQ across the best scoring hit +in each group. Being aware of the ALT-to-ref alignment, this script can greatly +improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also +writes each hit overlapping the reported hit into a separate SAM line. This +enables variant calling on each ALT contig independent of others. ### On the completeness of GRCh38+ALT @@ -129,47 +101,65 @@ While GRCh38 is much more complete than GRCh37, it is still missing some true human sequences. To make sure every piece of sequence in the reference assembly is correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig to have enough support from multiple sources before considering to add it to the -reference assembly. This careful procedure has left out some sequences, one of -which is [this example][novel], a 10kb contig assembled from CHM1 short -reads and present also in NA12878. You can try [BLAT][blat] or [BLAST][blast] to -see where it maps. +reference assembly. This careful and sophisticated procedure has left out some +sequences, one of which is [this example][novel], a 10kb contig assembled from +CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or +[BLAST][blast] to see where it maps. For a more complete reference genome, we compiled a new set of decoy sequences from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples. -The sequences are included in `hs38d4-extra.fa` from the [BWA resource bundle -for GRCh38][res]. +The sequences are included in `hs38d6-extra.fa` from the [BWA binary +package][res]. In addition to decoy, we also put multiple alleles of HLA genes in -`hs38d4-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], -version 3.18.0. Script `bwa-postalt.js` also helps to genotype HLA genes, though -not to high resolution for now. +`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], +version 3.18.0 and are used to collect reads sequenced from these genes. -### More on HLA typing +### HLA typing -It is [well known][hlalink] that HLA genes are associated with many autoimmune -diseases as well as some others not directly related to the immune system. -However, many HLA alleles are highly diverged from the reference genome. If we -map whole-genome shotgun (WGS) reads to the reference only, many -allele-informative will get lost. As a result, the vast majority of WGS projects -have ignored these important genes. +HLA genes are known to be associated with many autoimmune diseases, infectious +diseases and drug responses. They are among the most important genes but are +rarely studied by WGS projects due to the high sequence divergence between +HLA genes and the reference genome in these regions. -We recommend to include the genomic regions of classical HLA genes in the BWA -index. This way we will be able to get a more complete collection of reads -mapped to HLA. We can then isolate these reads with little computational cost -and type HLA genes with another program, such as [Warren et al (2012)][hla4], -[Liu et al (2013)][hla2], [Bai et al (2014)][hla3], [Dilthey et al (2014)][hla1] -or others from [this list][hlatools]. +By including the HLA gene regions in the reference assembly as ALT contigs, we +are able to effectively identify reads coming from these genes. We also provide +a pipeline, which is included in the [BWA binary package][res], to type the +several classic HLA genes. The pipeline is conceptually simple. It de novo +assembles sequence reads mapped to each gene, aligns exon sequences of each +allele to the assembled contigs and then finds the pairs of alleles that best +explain the contigs. In practice, however, the completeness of IMGT/HLA and +copy-number changes related to these genes are not so straightforward to +resolve. HLA typing may not always be successful. Users may also consider to use +other programs for typing such as [Warren et al (2012)][hla4], [Liu et al +(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though +most of them are distributed under restrictive licenses. -If the postprocessing script `bwa-postalt.js` is invoked with `-p prefix`, it -will also write the top three alleles to file `prefix.hla`. However, as most HLA -alleles from IMGT/HLA don't have intronic sequences and thus are not included in -the BWA index from option 2, we are unable to type HLA genes to high resolution -with the BWA-MEM mapping alone. A dedicated tool is recommended for accurate -typing. +## Preliminary Evaluation -### Evaluating ALT Mapping +To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878 +unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy +(hs38d6), and called small variants from the alignment. CHM1 is haploid. +Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The +true positive (TP) heterozygous calls from NA12878 are approximately equal +to the difference between NA12878 and CHM1 heterozygous calls. A better assembly +should yield higher TP and lower FP. The following table shows the numbers for +these assemblies: -(Coming soon...) +|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref| +|:------:|------:|------:|------:|------:|------:| +|FP | 255706| 168068| 142516|307172 | 575634| +|TP |2142260|2163113|2150844|2167235|2137053| + +With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces +FP by ~25k but also reduces TP by ~12k. We manually inspected variants called +from hs38 only and found the majority of them are associated with excessive read +depth, clustered variants or weak alignment. We believe most hs38-only calls are +problematic. In addition, if we compare two NA12878 replicates from HiSeq X10 +with nearly identical library construction, the difference is ~140k, an order +of magnitude higher than the difference between hs38 and hs38d6. ALT contigs, +decoy and HLA genes in hs38d6 improve variant calling and enable the analyses of +ALT contigs and HLA typing at little cost. ## Problems and Future Development diff --git a/bwa-helper.js b/bwa-helper.js deleted file mode 100644 index d8477fc..0000000 --- a/bwa-helper.js +++ /dev/null @@ -1,373 +0,0 @@ -/***************************************************************** - * The K8 Javascript interpreter is required to run this script. * - * * - * Source code: https://github.com/attractivechaos/k8 * - * Binary: http://sourceforge.net/projects/lh3/files/k8/ * - * * - * Data file used for generating GRCh38 ALT alignments: * - * * - * http://sourceforge.net/projects/bio-bwa/files/ * - *****************************************************************/ - -/****************** - *** From k8.js *** - ******************/ - -var getopt = function(args, ostr) { - var oli; // option letter list index - if (typeof(getopt.place) == 'undefined') - getopt.ind = 0, getopt.arg = null, getopt.place = -1; - if (getopt.place == -1) { // update scanning pointer - if (getopt.ind >= args.length || args[getopt.ind].charAt(getopt.place = 0) != '-') { - getopt.place = -1; - return null; - } - if (getopt.place + 1 < args[getopt.ind].length && args[getopt.ind].charAt(++getopt.place) == '-') { // found "--" - ++getopt.ind; - getopt.place = -1; - return null; - } - } - var optopt = args[getopt.ind].charAt(getopt.place++); // character checked for validity - if (optopt == ':' || (oli = ostr.indexOf(optopt)) < 0) { - if (optopt == '-') return null; // if the user didn't specify '-' as an option, assume it means null. - if (getopt.place < 0) ++getopt.ind; - return '?'; - } - if (oli+1 >= ostr.length || ostr.charAt(++oli) != ':') { // don't need argument - getopt.arg = null; - if (getopt.place < 0 || getopt.place >= args[getopt.ind].length) ++getopt.ind, getopt.place = -1; - } else { // need an argument - if (getopt.place >= 0 && getopt.place < args[getopt.ind].length) - getopt.arg = args[getopt.ind].substr(getopt.place); - else if (args.length <= ++getopt.ind) { // no arg - getopt.place = -1; - if (ostr.length > 0 && ostr.charAt(0) == ':') return ':'; - return '?'; - } else getopt.arg = args[getopt.ind]; // white space - getopt.place = -1; - ++getopt.ind; - } - return optopt; -} - -/************************ - *** command markovlp *** - ************************/ - -function bwa_markOvlp(args) -{ - var c, min_aln_ratio = .9, min_ext = 50; - while ((c = getopt(args, "r:e:")) != null) { - if (c == 'r') min_aln_ratio = parseFloat(getopt.arg); - else if (c == 'e') min_ext = parseInt(getopt.arg); - } - - var file = args.length == getopt.ind? new File() : new File(args[getopt.ind]); - var buf = new Bytes(); - var dir4 = ['>>', '><', '<>', '<<']; - - while (file.readline(buf) >= 0) { - var t = buf.toString().split("\t") - for (var i = 0; i < t.length; ++i) - if (i != 0 && i != 4) - t[i] = parseInt(t[i]); - var el, a1, a2, e1, e2, r1, r2; // a: aligned length; e: extended length; r: remaining length - e2 = a2 = t[7] - t[6]; - if (t[2] < t[3]) { // forward-forward match - e1 = a1 = t[3] - t[2]; - r1 = t[2] - t[6]; - r2 = (t[5] - t[7]) - (t[1] - t[3]); - el = r1 > 0? t[6] : t[2]; - el += r2 > 0? t[1] - t[3] : t[5] - t[7]; - } else { // reverse-forward match - e1 = a1 = t[2] - t[3]; - r1 = (t[1] - t[2]) - t[6]; - r2 = (t[5] - t[7]) - t[3]; - el = r1 > 0? t[6] : t[1] - t[2]; - el += r2 > 0? t[3] : t[5] - t[7]; - } - e1 += el; e2 += el; - var type; - if (a1 / e1 >= min_aln_ratio && a2 / e2 >= min_aln_ratio) { - if ((r1 >= min_ext && r2 >= min_ext) || (r1 <= -min_ext && r2 <= -min_ext)) { // suffix-prefix match - var d = t[2] < t[3]? 0 : 2; - if (r1 < 0) d ^= 3; // reverse the direction - type = 'O' + dir4[d]; - } else type = 'C' + (e1 / t[1] > e2 / t[5]? 1 : 2); - } else type = 'I'; // internal local match; not a suffix-prefix match - //print(t[1], e1, a1, t[5], e2, a2); - print(type, buf); - } - - buf.destroy(); - file.close(); -} - -/*********************** - *** command pas2bed *** - ***********************/ - -function bwa_pas2reg(args) -{ - var file = args.length? new File(args[0]) : new File(); - var buf = new Bytes(); - - while (file.readline(buf) >= 0) { - var t = buf.toString().split("\t"); - if (t[0] == t[4]) continue; - if (parseInt(t[2]) < parseInt(t[3])) print(t[0], t[1], t[2], t[3], t[8]); - else print(t[0], t[1], t[3], t[2], t[8]); - print(t[4], t[5], t[6], t[7], t[8]); - } - - buf.destroy(); - file.close(); -} - -/******************* - * command sam2pas * - *******************/ - -function bwa_sam2pas(args) -{ - var file = args.length == 0? new File() : new File(args[0]); - var buf = new Bytes(); - var seq_dict = {}; - - while (file.readline(buf) >= 0) { - var line = buf.toString(); - var m; - if (/^@SQ/.test(line)) { - var name = null, len = null; - if ((m = /\tSN:(\S+)/.exec(line)) != null) name = m[1]; - if ((m = /\tLN:(\S+)/.exec(line)) != null) len = parseInt(m[1]); - if (name != null && len != null) seq_dict[name] = len; - } - if (/^@/.test(line)) continue; - var t = line.split("\t"); - var pos = parseInt(t[3]) - 1; - var x = 0, y = 0, i = 0, clip = [0, 0], n_ins = 0, n_del = 0, o_ins = 0, o_del = 0, n_M = 0; - var re = /(\d+)([MIDSH])/g; - while ((m = re.exec(t[5])) != null) { - var l = parseInt(m[1]); - if (m[2] == 'M') x += l, y += l, n_M += l; - else if (m[2] == 'I') y += l, n_ins += l, ++o_ins; - else if (m[2] == 'D') x += l, n_del += l, ++o_del; - else if (m[2] == 'S' || m[2] == 'H') - clip[i == 0? 0 : 1] = l; - ++i; - } - var is_rev = (parseInt(t[1]) & 16)? true : false; - var misc = 'mapQ=' + t[4] + ';'; - var usc = 1; - if ((m = /\tNM:i:(\d+)/.exec(line)) != null) { - var NM = parseInt(m[1]); - var diff = (NM / (n_M + n_ins + n_del)).toFixed(3); - misc += 'diff=' + diff + ';n_mis=' + (NM - n_del - n_ins) + ';'; - } - misc += 'n_del='+n_del+';n_ins='+n_ins+';o_del='+o_del+';o_ins='+o_ins + ';'; - if ((m = /\tAS:i:(\d+)/.exec(line)) != null) { - misc += 'AS='+m[1] + ';'; - usc = (parseInt(m[1]) / (x > y? x : y)).toFixed(3); - } - if ((m = /\tXS:i:(\d+)/.exec(line)) != null) misc += 'XS='+m[1] + ';'; - var len = y + clip[0] + clip[1]; - var z = [t[0], len, clip[0], clip[0] + y, t[2], seq_dict[t[2]], pos, pos + x, usc, misc]; - if (parseInt(t[1]) & 16) z[2] = clip[1] + y, z[3] = clip[1]; - print(z.join("\t")); - } - - buf.destroy(); - file.close(); -} - -/*********************** - *** command reg2cut *** - ***********************/ - -function bwa_reg2cut(args) -{ - var c, min_usc = 0.5, min_ext = 100, min_len = 5000, cut = 250; - while ((c = getopt(args, "s:e:l:c:")) != null) { - if (c == 's') min_usc = parseFloat(getopt.arg); - else if (c == 'e') min_ext = parseInt(getopt.arg); - else if (c == 'l') min_len = parseInt(getopt.arg); - else if (c == 'c') cut = parseInt(getopt.arg); - } - - var file = args.length == getopt.ind? new File () : new File(args[getopt.ind]); - var buf = new Bytes(); - - function print_bed() { - for (var i = 0; i < a.length; ++i) { - var start = a[i][0] - cut > 0? a[i][0] : 0; - var end = a[i][1] + cut < last_len? a[i][1] : last_len; - if (end - start >= min_len) print(last_chr, start, end); - } - } - - var last_chr = null, last_len = null, max_c_usc = 0, start = 0, end = 0; - var a = []; - while (file.readline(buf) >= 0) { - var t = buf.toString().split("\t"); - t[1] = parseInt(t[1]); - t[2] = parseInt(t[2]); - t[3] = parseInt(t[3]); - t[4] = parseFloat(t[4]); - var is_contained = t[2] < min_ext && t[1] - t[3] < min_ext? true : false; - if (t[3] - t[2] < cut<<1) continue; - t[2] += cut; t[3] -= cut; - if (t[0] != last_chr) { - a.push([start, end]); - if (last_chr != null && max_c_usc < min_usc) print_bed(); - last_chr = t[0]; last_len = t[1]; start = t[2]; end = t[3]; - max_c_usc = is_contained? t[4] : 0; - a.length = 0; - } else { - if (is_contained) - max_c_usc = max_c_usc > t[4]? max_c_usc : t[4]; - if (t[4] < min_usc) continue; - if (t[2] > end) { - a.push([start, end]); - start = t[2]; - end = end > t[3]? end : t[3]; - } else end = end > t[3]? end : t[3]; - } - } - a.push([start, end]); - if (max_c_usc < min_usc) print_bed(); // the last sequence - - buf.destroy(); - file.close(); -} - -function bwa_shortname(args) -{ - var file = args.length? new File(args[0]) : new File(); - var buf = new Bytes(); - - var re = /(\S+)\/(\d+)_(\d+)((:\d+-\d+)+)/g; - var re2 = /:(\d+)-(\d+)/g; - while (file.readline(buf) >= 0) { - var match, match2; - var line = buf.toString(); - var x = []; - while ((match = re.exec(line)) != null) { - var start = parseInt(match[2]), len = parseInt(match[3]) - start; - while ((match2 = re2.exec(match[4])) != null) { - var a = parseInt(match2[1]) - 1; - var b = parseInt(match2[2]); - start += a; len = b - a; - } - x.push([match[0], match[1] + '/' + start.toString() + '_' + (start+len).toString()]); - } - for (var i = 0; i < x.length; ++i) - line = line.replace(x[i][0], x[i][1]); - print(line); - } - - buf.destroy(); - file.close(); -} - -/******************* - * Command gff2sam * - *******************/ - -function bwa_gff2sam(args) -{ - if (args.length < 2) { - print("Usage: k8 bwa-helper.js "); - exit(1); - } - - var file = new File(args[1]); - var buf = new Bytes(); - var len = {}; - - while (file.readline(buf) >= 0) { - var t = buf.toString().split(/\s+/); - len[t[0]] = parseInt(t[1]); - } - file.close(); - - file = new File(args[0]); - var re_cigar = /([MID])(\d+)/g; - var lineno = 0; - while (file.readline(buf) >= 0) { - ++lineno; - var t = buf.toString().split("\t"); - var m = /Target=(\S+)\s+(\d+)\s+(\d+)\s+([+-])/.exec(t[8]); - if (m == null) { - warn("WARNING: skipped line "+lineno+" due to the lack of Target."); - continue; - } - var qname = m[1]; - var flag = t[6] == m[4]? 0 : 16; - var qb = parseInt(m[2]) - 1, qe = parseInt(m[3]), qlen = len[qname]; - if (qlen == null) - throw Error("Sequence "+qname+" is not present in the query-length.txt"); - var clip5 = qb, clip3 = qlen - qe; - if (flag&16) clip5 ^= clip3, clip3 ^= clip5, clip5 ^= clip3; // swap - - m = /Gap\s*=\s*(([MID]\d+\s*)+)/.exec(t[8]); - var cigar = clip5? clip5 + 'S' : ''; - var n_ins = 0, n_del = 0, n_match = 0, NM = null; - if (m) { - var mc; - while ((mc = re_cigar.exec(m[1])) != null) { - var l = parseInt(mc[2]); - cigar += mc[2] + mc[1]; - if (mc[1] == 'I') n_ins += l; - else if (mc[1] == 'D') n_del += l; - else if (mc[1] == 'M') n_match += l; - } - if (n_ins + n_match != qe - qb || n_del + n_match != parseInt(t[4]) - parseInt(t[3]) + 1) - throw Error("Inconsistent CIGAR at line "+lineno); - } else { // ungapped alignment - var tb = parseInt(t[3]) - 1, te = parseInt(t[4]); - if (te - tb != qe - qb) { - warn("WARNING: line "+lineno+" should contain gaps, but lacks Gap. Skipped.\n"); - } else cigar = (qe - qb) + 'M'; - } - if (clip3) cigar += clip3 + 'S'; - if ((m = /num_mismatch=(\d+)/.exec(t[8])) != null) - NM = parseInt(m[1]) + n_ins + n_del; - var out = [qname, flag, t[0], t[3], 255, cigar, '*', 0, 0, '*', '*']; - if (NM != null) out.push('NM:i:' + NM); - print(out.join("\t")); - } - buf.destroy(); - file.close(); -} - -/********************* - *** Main function *** - *********************/ - -function main(args) -{ - if (args.length == 0) { - print("\nUsage: k8 bwa-helper.js [arguments]\n"); - print("Commands: sam2pas convert SAM to pairwise alignment summary format (PAS)"); - print(" pas2reg extract covered regions"); - print(" reg2cut regions to extract for the 2nd round bwa-mem"); - print(" markovlp identify bi-directional overlaps"); - print(" gff2sam convert GFF3 alignment to SAM"); - print(" shortname shorten sequence name after subseq (PacBio read names only)"); - print(""); - exit(1); - } - - var cmd = args.shift(); - if (cmd == 'sam2pas') bwa_sam2pas(args); - else if (cmd == 'gff2sam') bwa_gff2sam(args); - else if (cmd == 'markovlp') bwa_markOvlp(args); - else if (cmd == 'pas2reg') bwa_pas2reg(args); - else if (cmd == 'reg2cut') bwa_reg2cut(args); - else if (cmd == 'shortname') bwa_shortname(args); - else warn("Unrecognized command"); -} - -main(arguments); diff --git a/bwa-postalt.js b/bwa-postalt.js index 6cf3dc7..5717045 100644 --- a/bwa-postalt.js +++ b/bwa-postalt.js @@ -52,31 +52,6 @@ var getopt = function(args, ostr) { return optopt; } -// print an object in a format similar to JSON. For debugging only. -function obj2str(o) -{ - if (typeof(o) != 'object') { - return o.toString(); - } else if (o == null) { - return "null"; - } else if (Array.isArray(o)) { - var s = "["; - for (var i = 0; i < o.length; ++i) { - if (i) s += ','; - s += obj2str(o[i]); - } - return s + "]"; - } else { - var i = 0, s = "{"; - for (var key in o) { - if (i++) s += ','; - s += key + ":"; - s += obj2str(o[key]); - } - return s + "}"; - } -} - // reverse a string Bytes.prototype.reverse = function() { @@ -203,7 +178,7 @@ function parse_hit(s, opt) return h; } -function print_buffer(buf2, fp_hla, hla) +function print_buffer(buf2, fp_hla, hla) // output alignments { if (buf2.length == 0) return; for (var i = 0; i < buf2.length; ++i) @@ -217,34 +192,34 @@ function print_buffer(buf2, fp_hla, hla) } } +function collect_hla_hits(idx, ctg, start, end, hla) // collect reads hit to HLA genes +{ + var m, ofunc = idx[ctg]; + if (ofunc == null) return; + var ovlp_alt = ofunc(start, end); + for (var i = 0; i < ovlp_alt.length; ++i) + if ((m = /^(HLA-[^\s\*]+)\*\d+/.exec(ovlp_alt[i][2])) != null) + hla[m[1]] = true; +} + function bwa_postalt(args) { - var c, opt = { a:1, b:4, o:6, e:1, verbose:3, show_pri:false, update_mapq:true, min_mapq:10, min_sc:90, max_nm_sc:10, show_ev:false, min_pa_ratio:1 }; + var version = "r985"; + var c, opt = { a:1, b:4, o:6, e:1, min_mapq:10, min_sc:90, max_nm_sc:10, min_pa_ratio:1 }; - while ((c = getopt(args, 'Pqev:p:r:')) != null) { - if (c == 'v') opt.verbose = parseInt(getopt.arg); - else if (c == 'p') opt.pre = getopt.arg; - else if (c == 'P') opt.show_pri = true; - else if (c == 'q') opt.recover_maq = false; - else if (c == 'e') opt.show_ev = true; + while ((c = getopt(args, 'vp:r:')) != null) { + if (c == 'p') opt.pre = getopt.arg; else if (c == 'r') opt.min_pa_ratio = parseFloat(getopt.arg); + else if (c == 'v') { print(version); exit(0); } } if (opt.min_pa_ratio > 1.) opt.min_pa_ratio = 1.; - if (opt.show_ev && opt.pre == null) { - warn("ERROR: option '-p' must be specified if '-e' is applied."); - exit(1); - } - if (args.length == getopt.ind) { print(""); print("Usage: k8 bwa-postalt.js [options] [aln.sam]\n"); - print("Options: -p STR prefix of file(s) for additional information [null]"); - print(" PREFIX.ctw - weight of each ALT contig"); - print(" PREFIX.evi - reads supporting ALT contigs (effective with -e)"); - print(" -e show reads supporting ALT contigs into file PREFIX.evi"); + print("Options: -p STR prefix of output files containting sequences matching HLA genes [null]"); print(" -r FLOAT reduce mapQ to 0 if not overlapping lifted best and pa= 0) { var line = buf.toString(); if (line.charAt(0) == '@') continue; var t = line.split("\t"); if (t.length < 11) continue; // incomplete lines + is_alt[t[0]] = true; var pos = parseInt(t[3]) - 1; var flag = parseInt(t[1]); - if ((flag&4) || t[2] == '*') { - idx_un[t[0]] = true; - continue; - } + if ((flag&4) || t[2] == '*') continue; var m, cigar = [], l_qaln = 0, l_tlen = 0, l_qclip = 0; if ((m = /^(HLA-[^\s\*]+)\*\d+/.exec(t[0])) != null) { // read HLA contigs if (hla_ctg[m[1]] == null) hla_ctg[m[1]] = 0; ++hla_ctg[m[1]]; + hla_chr = t[2]; } while ((m = re_cigar.exec(t[5])) != null) { var l = parseInt(m[1]); @@ -296,10 +269,8 @@ function bwa_postalt(args) } file.close(); var idx_alt = {}, idx_pri = {}; - for (var ctg in intv_alt) - idx_alt[ctg] = intv_ovlp(intv_alt[ctg]); - for (var ctg in intv_pri) - idx_pri[ctg] = intv_ovlp(intv_pri[ctg]); + for (var ctg in intv_alt) idx_alt[ctg] = intv_ovlp(intv_alt[ctg]); + for (var ctg in intv_pri) idx_pri[ctg] = intv_ovlp(intv_pri[ctg]); // initialize the list of HLA contigs var fp_hla = null; @@ -309,18 +280,12 @@ function bwa_postalt(args) fp_hla[h] = new File(opt.pre + '.' + h + '.fq', "w"); } - // initialize the list of ALT contigs - var weight_alt = []; - for (var ctg in idx_alt) - weight_alt[ctg] = [0, 0, 0, 0, 0, 0, intv_alt[ctg][0][3], intv_alt[ctg][0][5], intv_alt[ctg][0][7]]; - for (var ctg in idx_un) - weight_alt[ctg] = [0, 0, 0, 0, 0, 0, '~', 0, 0]; - // process SAM var buf2 = [], hla = {}; file = args.length - getopt.ind >= 2? new File(args[getopt.ind+1]) : new File(); while (file.readline(buf) >= 0) { var m, line = buf.toString(); + if (line.charAt(0) == '@') { // print and then skip the header line print(line); continue; @@ -345,6 +310,8 @@ function bwa_postalt(args) var NM = (m = /\tNM:i:(\d+)/.exec(line)) == null? '0' : m[1]; var flag = t[1]; var h = parse_hit([t[2], ((flag&16)?'-':'+') + t[3], t[5], NM], opt); + if (t[2] == hla_chr) collect_hla_hits(idx_pri, h.ctg, h.start, h.end, hla); + if (h.hard) { // the following does not work with hard clipped alignments buf2.push(t); continue; @@ -362,7 +329,7 @@ function bwa_postalt(args) // check if there are ALT hits var has_alt = false; for (var i = 0; i < hits.length; ++i) - if (weight_alt[hits[i].ctg] != null) { + if (is_alt[hits[i].ctg] != null) { has_alt = true; break; } @@ -429,7 +396,7 @@ function bwa_postalt(args) if (hits[i].g == reported_g) ++n_group0; } else { - if (weight_alt[hits[0].ctg] == null) { // no need to go through the following if the single hit is non-ALT + if (is_alt[hits[0].ctg] == null) { // no need to go through the following if the single hit is non-ALT buf2.push(t); continue; } @@ -455,8 +422,8 @@ function bwa_postalt(args) else mapQ = mapQ > ori_mapQ? mapQ : ori_mapQ; } else mapQ = t[4]; - var pri_ofunc = idx_pri[hits[reported_i].pctg], ovlp_alt = []; - if (pri_ofunc != null) { + // find out whether the read is overlapping HLA genes + if (hits[reported_i].pctg == hla_chr) { var rpt_start = 1<<30, rpt_end = 0; for (var i = 0; i < hits.length; ++i) { var h = hits[i]; @@ -465,70 +432,26 @@ function bwa_postalt(args) rpt_end = rpt_end > h.pend ? rpt_end : h.pend; } } - ovlp_alt = pri_ofunc(rpt_start, rpt_end); - for (var i = 0; i < ovlp_alt.length; ++i) - if ((m = /^(HLA-[^\s\*]+)\*\d+/.exec(ovlp_alt[i][2])) != null) - hla[m[1]] = true; + collect_hla_hits(idx_pri, hla_chr, rpt_start, rpt_end, hla); } - // ALT genotyping - if (mapQ >= opt.min_mapq && hits[reported_i].score >= opt.min_sc) { - // collect all overlapping ALT contigs - var alts = {}; - for (var i = 0; i < hits.length; ++i) { - var h = hits[i]; - if (h.g == reported_g && weight_alt[h.ctg] != null) - alts[h.ctg] = [h.score, h.NM]; - } - if (ovlp_alt.length > 0) { // add other unreported hits - for (var i = 0; i < ovlp_alt.length; ++i) - if (ovlp_alt[i][0] <= rpt_start && rpt_end <= ovlp_alt[i][1] && alts[ovlp_alt[i][2]] == null) - alts[ovlp_alt[i][2]] = [0, 0]; - } - - // add weight to each ALT contig - var alt_arr = [], max_sc = -1, max_i = -1, sum = 0, min_sc = 1<<30, max_nm = -1; - for (var ctg in alts) - alt_arr.push([ctg, alts[ctg][0], 0, alts[ctg][1]]); - for (var i = 0; i < alt_arr.length; ++i) { - if (max_sc < alt_arr[i][1]) - max_sc = alt_arr[i][1], max_i = i; - min_sc = min_sc < alt_arr[i][1]? min_sc : alt_arr[i][1]; - var nm = alt_arr[i][1] > 0? alt_arr[i][3] : opt.max_nm_sc; - max_nm = max_nm > nm? max_nm : nm; - } - if (max_nm > opt.max_nm_sc) max_nm = opt.max_nm_sc; - if (max_sc > 0 && (alt_arr.length == 1 || min_sc < max_sc)) { - for (var i = 0; i < alt_arr.length; ++i) - sum += (alt_arr[i][2] = Math.pow(10, .6 * (alt_arr[i][1] - max_sc))); - for (var i = 0; i < alt_arr.length; ++i) alt_arr[i][2] /= sum; - for (var i = 0; i < alt_arr.length; ++i) { - var e = [alt_arr[i][0], 1, alt_arr[max_i][2], alt_arr[i][2], max_nm, max_nm * alt_arr[max_i][2], max_nm * alt_arr[i][2]]; - var w = weight_alt[e[0]]; - for (var j = 0; j < 6; ++j) w[j] += e[j+1]; - if (fp_evi) { - e[2] = e[2].toFixed(3); e[3] = e[3].toFixed(3); - e[5] = e[5].toFixed(3); e[6] = e[6].toFixed(3); - fp_evi.write(t[0] + '/' + (t[1]>>6&3) + '\t' + e.join("\t") + '\n'); - } - } - } - } - - // check if the reported hit overlaps a hit to the primary assembly; if so, don't reduce mapping quality - if (opt.update_mapq && mapQ > 0 && n_rpt_lifted <= 1) { + // adjust the mapQ of the primary hits + if (n_rpt_lifted <= 1) { var l = n_rpt_lifted == 1? rpt_lifted : null; for (var i = 0; i < buf2.length; ++i) { var s = buf2[i], is_ovlp = true; if (l != null) { if (l[0] != s[2]) is_ovlp = false; // different chr - if (((s[1]&16) != 0) != l[1]) is_ovlp = false; // different strand - var start = s[3] - 1, end = start; - while ((m = re_cigar.exec(t[5])) != null) - if (m[2] == 'M' || m[2] == 'D' || m[2] == 'N') - end += parseInt(m[1]); - if (!(start < l[3] && l[2] < end)) is_ovlp = false; // no overlap + else if (((s[1]&16) != 0) != l[1]) is_ovlp = false; // different strand + else { + var start = s[3] - 1, end = start; + while ((m = re_cigar.exec(t[5])) != null) + if (m[2] == 'M' || m[2] == 'D' || m[2] == 'N') + end += parseInt(m[1]); + if (!(start < l[3] && l[2] < end)) is_ovlp = false; // no overlap + } } else is_ovlp = false; + // get the "pa" tag if present var om = -1, pa = 10.; for (var j = 11; j < s.length; ++j) if ((m = /^om:i:(\d+)/.exec(s[j])) != null) @@ -555,20 +478,6 @@ function bwa_postalt(args) } } - // generate reversed quality and reverse-complemented sequence if necessary - var rs = null, rq = null; // reversed quality and reverse complement sequence - var need_rev = false; - for (var i = 0; i < hits.length; ++i) { - if (hits[i].g != reported_g || i == reported_i) continue; - if (hits[i].rev != hits[reported_i].rev) - need_rev = true; - } - if (need_rev) { // reverse and reverse complement - aux.length = 0; - aux.set(t[9], 0); aux.revcomp(); rs = aux.toString(); - aux.set(t[10],0); aux.reverse(); rq = aux.toString(); - } - // stage the reported hit t[4] = mapQ; if (n_group0 > 1) t.push("om:i:"+ori_mapQ); @@ -576,18 +485,23 @@ function bwa_postalt(args) buf2.push(t); // stage the hits generated from the XA tag - var cnt = 0; + var cnt = 0, rs = null, rq = null; // rq: reverse quality; rs: reverse complement sequence var rg = (m = /\t(RG:Z:\S+)/.exec(line)) != null? m[1] : null; for (var i = 0; i < hits.length; ++i) { - if (opt.verbose >= 5) print(obj2str(hits[i])); if (hits[i].g != reported_g || i == reported_i) continue; - if (!opt.show_pri && idx_alt[hits[i].ctg] == null) continue; + if (idx_alt[hits[i].ctg] == null) continue; var s = [t[0], 0, hits[i].ctg, hits[i].start+1, mapQ, hits[i].cigar, t[6], t[7], t[8]]; + if (t[6] == '=' && s[2] != t[2]) s[6] = t[2]; // print sequence/quality and set the rev flag if (hits[i].rev == hits[reported_i].rev) { s.push(t[9], t[10]); s[1] = flag | 0x800; - } else { + } else { // we need to write the reverse sequence + if (rs == null || rq == null) { + aux.length = 0; + aux.set(t[9], 0); aux.revcomp(); rs = aux.toString(); + aux.set(t[10],0); aux.reverse(); rq = aux.toString(); + } s.push(rs, rq); s[1] = (flag ^ 0x10) | 0x800; } @@ -599,35 +513,12 @@ function bwa_postalt(args) } print_buffer(buf2, fp_hla, hla); file.close(); - if (fp_evi != null) fp_evi.close(); if (fp_hla != null) for (var h in fp_hla) fp_hla[h].close(); buf.destroy(); aux.destroy(); - - // print weight of each contig - if (opt.pre != null) { - var fpout = new File(opt.pre + '.ctw', "w"); - var weight_arr = []; - var weight_hla = []; - for (var ctg in weight_alt) { - var w = weight_alt[ctg]; - weight_arr.push([ctg, w[6], w[7], w[8], - w[0], w[1].toFixed(3), w[2].toFixed(3), w[1] > 0? (w[2]/w[1]).toFixed(3) : '0.000', - w[3], w[4].toFixed(3), w[5].toFixed(3), w[4] > 0? (w[5]/w[4]).toFixed(3) : '0.000']); - weight_hla.push([ctg, w[4] > 0? w[5]/w[4] : 0]); - } - weight_arr.sort(function(a,b) { - return a[1] < b[1]? -1 : a[1] > b[1]? 1 : a[2] != b[2]? a[2] - b[2] : a[0] < b[0]? -1 : a[0] > b[0]? 1 : 0; - }); - for (var i = 0; i < weight_arr.length; ++i) { - if (weight_arr[i][1] == '~') weight_arr[i][1] = '*'; - fpout.write(weight_arr[i].join("\t") + '\n'); - } - fpout.close(); - } } bwa_postalt(arguments); diff --git a/bwa.1 b/bwa.1 index ba8bc9e..0d556be 100644 --- a/bwa.1 +++ b/bwa.1 @@ -1,4 +1,4 @@ -.TH bwa 1 "19 May 2014" "bwa-0.7.9-r783" "Bioinformatics tools" +.TH bwa 1 "18 November 2014" "bwa-0.7.11-r999" "Bioinformatics tools" .SH NAME .PP bwa - Burrows-Wheeler Alignment Tool @@ -233,8 +233,9 @@ more aggressive read pair. [17] .B INPUT/OUTPUT OPTIONS: .TP .B -p -Assume the first input query file is interleaved paired-end FASTA/Q. See the -command description for details. +Smart pairing. If two adjacent reads have the same name, they are considered +to form a read pair. This way, paired-end and single-end reads can be mixed +in a single FASTA/Q stream. .TP .BI -R \ STR Complete read group header line. '\\t' can be used in diff --git a/bwa.c b/bwa.c index 80a4b8f..053025e 100644 --- a/bwa.c +++ b/bwa.c @@ -7,6 +7,7 @@ #include "ksw.h" #include "utils.h" #include "kstring.h" +#include "kvec.h" #ifdef USE_MALLOC_WRAPPERS # include "malloc_wrap.h" @@ -55,10 +56,12 @@ bseq1_t *bseq_read(int chunk_size, int *n_, void *ks1_, void *ks2_) } trim_readno(&ks->name); kseq2bseq1(ks, &seqs[n]); + seqs[n].id = n; size += seqs[n++].l_seq; if (ks2) { trim_readno(&ks2->name); kseq2bseq1(ks2, &seqs[n]); + seqs[n].id = n; size += seqs[n++].l_seq; } if (size >= chunk_size && (n&1) == 0) break; @@ -71,6 +74,24 @@ bseq1_t *bseq_read(int chunk_size, int *n_, void *ks1_, void *ks2_) return seqs; } +void bseq_classify(int n, bseq1_t *seqs, int m[2], bseq1_t *sep[2]) +{ + int i, has_last; + kvec_t(bseq1_t) a[2] = {{0,0,0}, {0,0,0}}; + for (i = 1, has_last = 1; i < n; ++i) { + if (has_last) { + if (strcmp(seqs[i].name, seqs[i-1].name) == 0) { + kv_push(bseq1_t, a[1], seqs[i-1]); + kv_push(bseq1_t, a[1], seqs[i]); + has_last = 0; + } else kv_push(bseq1_t, a[0], seqs[i-1]); + } else has_last = 1; + } + if (has_last) kv_push(bseq1_t, a[0], seqs[i-1]); + sep[0] = a[0].a, m[0] = a[0].n; + sep[1] = a[1].a, m[1] = a[1].n; +} + /***************** * CIGAR related * *****************/ diff --git a/bwa.h b/bwa.h index 6fcc82e..830bc3c 100644 --- a/bwa.h +++ b/bwa.h @@ -23,7 +23,7 @@ typedef struct { } bwaidx_t; typedef struct { - int l_seq; + int l_seq, id; char *name, *comment, *seq, *qual, *sam; } bseq1_t; @@ -35,6 +35,7 @@ extern "C" { #endif bseq1_t *bseq_read(int chunk_size, int *n_, void *ks1_, void *ks2_); + void bseq_classify(int n, bseq1_t *seqs, int m[2], bseq1_t *sep[2]); void bwa_fill_scmat(int a, int b, int8_t mat[25]); uint32_t *bwa_gen_cigar(const int8_t mat[25], int q, int r, int w_, int64_t l_pac, const uint8_t *pac, int l_query, uint8_t *query, int64_t rb, int64_t re, int *score, int *n_cigar, int *NM); diff --git a/bwamem.h b/bwamem.h index 1746ab5..8ffe729 100644 --- a/bwamem.h +++ b/bwamem.h @@ -20,6 +20,7 @@ typedef struct __smem_i smem_i; #define MEM_F_ALN_REG 0x80 #define MEM_F_REF_HDR 0x100 #define MEM_F_SOFTCLIP 0x200 +#define MEM_F_SMARTPE 0x400 typedef struct { int a, b; // match score and mismatch penalty diff --git a/bwamem_pair.c b/bwamem_pair.c index e405423..f8b5d46 100644 --- a/bwamem_pair.c +++ b/bwamem_pair.c @@ -362,8 +362,13 @@ int mem_sam_pe(const mem_opt_t *opt, const bntseq_t *bns, const uint8_t *pac, co no_pairing: for (i = 0; i < 2; ++i) { - if (a[i].n && a[i].a[0].score >= opt->T) - h[i] = mem_reg2aln(opt, bns, pac, s[i].l_seq, s[i].seq, &a[i].a[0]); + int which = -1; + if (a[i].n) { + if (a[i].a[0].score >= opt->T) which = 0; + else if (n_pri[i] < a[i].n && a[i].a[n_pri[i]].score >= opt->T) + which = n_pri[i]; + } + if (which >= 0) h[i] = mem_reg2aln(opt, bns, pac, s[i].l_seq, s[i].seq, &a[i].a[which]); else h[i] = mem_reg2aln(opt, bns, pac, s[i].l_seq, s[i].seq, 0); } if (!(opt->flag & MEM_F_NOPAIRING) && h[0].rid == h[1].rid && h[0].rid >= 0) { // if the top hits from the two ends constitute a proper pair, flag it. diff --git a/extras/alt-demo.graffle b/extras/alt-demo.graffle new file mode 100644 index 0000000..32a8f5f --- /dev/null +++ b/extras/alt-demo.graffle @@ -0,0 +1,330 @@ + + + + + ActiveLayerIndex + 0 + ApplicationVersion + + com.omnigroup.OmniGraffle + 139.18.0.187838 + + AutoAdjust + + BackgroundGraphic + + Bounds + {{0, 0}, {576, 733}} + Class + SolidGraphic + ID + 2 + Style + + shadow + + Draws + NO + + stroke + + Draws + NO + + + + BaseZoom + 0 + CanvasOrigin + {0, 0} + ColumnAlign + 1 + ColumnSpacing + 36 + CreationDate + 2014-11-17 16:51:42 +0000 + Creator + Heng Li + DisplayScale + 1 0/72 in = 1 0/72 in + GraphDocumentVersion + 8 + GraphicsList + + + Bounds + {{35.699992179870605, 151.89999580383301}, {476, 224}} + Class + ShapedGraphic + FitText + YES + Flow + Resize + FontInfo + + Font + AndaleMono + Size + 12 + + ID + 28 + Shape + Rectangle + Style + + fill + + Draws + NO + + shadow + + Draws + NO + + stroke + + Draws + NO + + + Text + + Align + 0 + Pad + 0 + Text + {\rtf1\ansi\ansicpg1252\cocoartf1265\cocoasubrtf210 +\cocoascreenfonts1{\fonttbl\f0\fnil\fcharset0 Consolas;\f1\fnil\fcharset0 Consolas-Bold;} +{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red127\green127\blue127;\red255\green0\blue0; +\red204\green204\blue204;\red0\green0\blue255;\red0\green128\blue0;\red255\green128\blue0;} +\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural + +\f0\fs24 \cf2 Read: A\cf0 TCAGCATC\ +\cf2 \ + ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATGGTCA +\f1\b \cf4 ATCAGCATC +\f0\b0 \cf3 GAACTAGTCACAT\cf2 \ + \cf3 |||||\cf5 (high div) \cf3 ||||||\cf5 (novel ins)\cf3 ||||||||||\cf2 \ +Chromosome:\cf3 GCGTACATGATACGA +\f1\b \cf6 ATCgGCATC +\f0\b0 \cf3 ATGGTC-------------CTAGTCACATCGTAATC\ +\cf2 \cf3 |||||||||||| ||||||||||\cf5 (novel ins) \cf3 ||||||||||\ +\cf2 ALT ctg 2:\cf3 TGATACGA +\f1\b \cf7 ATCgcCATC +\f0\b0 \cf3 ATGGTCA +\f1\b \cf8 ATCgcCAgC +\f0\b0 \cf3 GAACTAGTCACAT\ +\ +\cf2 4 potential hits: +\f1\b \cf4 ATCAGCATC +\f0\b0 \cf0 > +\f1\b \cf6 ATCgGCATC +\f0\b0 \cf0 > +\f1\b \cf7 ATCgcCATC +\f0\b0 \cf2 > +\f1\b \cf8 ATCgcCAgC\ +\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural + +\f0\b0 \cf0 2 hit groups: \{ +\f1\b \cf4 ATCAGCATC +\f0\b0 \cf0 , +\f1\b \cf8 ATCgcCAgC +\f0\b0 \cf2 \} and\cf0 \{ +\f1\b \cf6 ATCgGCATC +\f0\b0 \cf2 , +\f1\b \cf7 ATCgcCATC +\f0\b0 \cf2 \}\ +\cf0 Hits considered in mapQ: +\f1\b \cf4 ATCAGCATC +\f0\b0 \cf0 and +\f1\b \cf6 ATCgGCATC +\f0\b0 \cf2 (best from each group) +\f1\b \cf6 \ +\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural + +\f0\b0 \cf3 \ +\cf2 In the output SAM: +\f1\b \cf6 ATCgGCATC +\f0\b0 \cf2 as the primary SAM line with mapQ=0\ + +\f1\b \cf4 ATCAGCATC +\f0\b0 \cf2 as a supplementary line with mapQ>0\ + +\f1\b \cf8 ATCgcCAgC +\f0\b0 \cf2 as a supplementary line with mapQ>0\ + +\f1\b \cf7 ATCgcCATC +\f0\b0 \cf2 in an XA tag, not as a separate line} + VerticalPad + 0 + + Wrap + NO + + + GridInfo + + GridSpacing + 7.1999998092651367 + MajorGridSpacing + 10 + SnapsToGrid + YES + + GuidesLocked + NO + GuidesVisible + YES + HPages + 1 + ImageCounter + 1 + KeepToScale + + Layers + + + Lock + NO + Name + Layer 1 + Print + YES + View + YES + + + LayoutInfo + + Animate + NO + circoMinDist + 18 + circoSeparation + 0.0 + layoutEngine + dot + neatoSeparation + 0.0 + twopiSeparation + 0.0 + + LinksVisible + NO + MagnetsVisible + NO + MasterSheets + + ModificationDate + 2014-11-17 18:28:10 +0000 + Modifier + Heng Li + NotesVisible + NO + Orientation + 2 + OriginVisible + NO + PageBreaks + YES + PrintInfo + + NSBottomMargin + + float + 41 + + NSHorizonalPagination + + coded + BAtzdHJlYW10eXBlZIHoA4QBQISEhAhOU051bWJlcgCEhAdOU1ZhbHVlAISECE5TT2JqZWN0AIWEASqEhAFxlwCG + + NSLeftMargin + + float + 18 + + NSPaperSize + + size + {612, 792} + + NSPrintReverseOrientation + + int + 0 + + NSRightMargin + + float + 18 + + NSTopMargin + + float + 18 + + + PrintOnePage + + ReadOnly + NO + RowAlign + 1 + RowSpacing + 36 + SheetTitle + Canvas 1 + SmartAlignmentGuidesActive + YES + SmartDistanceGuidesActive + YES + UniqueID + 1 + UseEntirePage + + VPages + 1 + WindowInfo + + CurrentSheet + 0 + ExpandedCanvases + + + name + Canvas 1 + + + Frame + {{367, 6}, {710, 872}} + ListView + + OutlineWidth + 142 + RightSidebar + + ShowRuler + + Sidebar + + SidebarWidth + 120 + VisibleRegion + {{0, 0}, {575, 733}} + Zoom + 1 + ZoomValues + + + Canvas 1 + 1 + 1 + + + + + diff --git a/extras/alt-demo.png b/extras/alt-demo.png new file mode 100644 index 0000000..71f4976 Binary files /dev/null and b/extras/alt-demo.png differ diff --git a/extras/run-HLA b/extras/run-HLA new file mode 100755 index 0000000..4fee16f --- /dev/null +++ b/extras/run-HLA @@ -0,0 +1,20 @@ +#!/bin/bash + +ctg_opt="" +if [ $# -gt 1 ] && [ $1 == '-A' ]; then + ctg_opt="-A" + shift +fi + +if [ $# -eq 0 ]; then + echo "Usage: $0 " + exit 1 +fi + +for f in $1.HLA-*.fq; do + gene=`echo $f | perl -pe 's/^.*(HLA-[A-Z]+[0-9]*).*fq$/$1/'` + echo -e "\n*** Processing gene $gene...\n" >&2 + `dirname $0`/typeHLA.sh $ctg_opt $1 $gene +done + +ls $1.HLA-*.gt | xargs -i echo grep ^GT {} \| head -1 | sh | sed "s,^GT,$1," diff --git a/extras/run-bwamem b/extras/run-bwamem new file mode 100755 index 0000000..9d5fa19 --- /dev/null +++ b/extras/run-bwamem @@ -0,0 +1,160 @@ +#!/usr/bin/env perl + +use strict; +use warnings; +use Getopt::Std; + +my %opts = (t=>1); +getopts("SadskHo:R:x:t:", \%opts); + +die(' +Usage: run-bwamem [options] [file2] + +Options: -o STR prefix for output files [inferred from input] + -R STR read group header line such as \'@RG\tID:foo\tSM:bar\' [null] + -x STR read type: pacbio, ont2d or intractg [default] + intractg: intra-species contig (kb query, highly similar) + pacbio: pacbio subreads (~10kb query, high error rate) + ont2d: Oxford Nanopore reads (~10kb query, higher error rate) + -t INT number of threads [1] + + -a trim HiSeq2000/2500 PE resequencing adapters (via trimadap) + -d mark duplicate (via samblaster) + -S for SAM/BAM input, don\'t shuffle + -s sort the output alignment (requring more RAM) + -H skip HLA typing + -k keep temporary files generated by typeHLA + +Examples: + + * Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing: + + run-bwamem -o prefix -t8 -R"@RG\tID:foo\tSM:bar" hs38d6.fa read1.fq.gz read2.fq.gz + + * Remap coordinate-sorted BAM, trim Illumina PE adapters and sort the output. The BAM + may contain single-end or paired-end reads, or a mixture of the two types. Read groups + are not transferred to the output BAM, though. + + run-bwamem -sao prefix hs38d6.fa old-srt.bam + + * Remap name-grouped BAM and mark duplicates. Note that in this case, all reads from + a single library should be aligned at the same time. Paired-end only. + + run-bwamem -Sdo prefix hs38d6.fa old-unsrt.bam + +Output files: + + {-o}.aln.bam - final alignment + {-o}.hla.top - best genotypes for the 6 classical HLA genes (if there are HLA-* contigs) + {-o}.hla.all - additional HLA genotypes consistent with data + {-o}.log.* - log files + +') if @ARGV < 2; + +warn("WARNING: many programs require read groups. Please specify with -R if you can.\n") unless defined($opts{R}); + +my $idx = $ARGV[0]; + +my $exepath = $0 =~/^\S+\/[^\/\s]+/? $0 : &which($0); +my $root = $0 =~/^(\S+)\/[^\/\s]+/? $1 : undef; +die "ERROR: failed to locate the 'bwa.kit' directory\n" if !defined($root); + +die("ERROR: failed to locate the BWA index. Please run '$root/bwa index -p $idx ref.fa'.\n") + unless (-f "$idx.bwt" && -f "$idx.pac" && -f "$idx.sa" && -f "$idx.ann" && -f "$idx.amb"); + +if (@ARGV >= 3 && $ARGV[1] =~ /\.(bam|sam|sam\.gz)$/) { + warn("WARNING: for SAM/BAM input, only the first sequence file is used.\n"); + @ARGV = 2; +} + +if (defined($opts{p}) && @ARGV >= 3) { + warn("WARNING: option -P is ignored as there are two input sequence files.\n"); + delete $opts{p}; +} + +my $prefix; +if (defined $opts{o}) { + $prefix = $opts{o}; +} elsif (@ARGV >= 3) { + my $len = length($ARGV[1]) < length($ARGV[2])? length($ARGV[1]) : length($ARGV[2]); + my $i; + for ($i = 0; $i < $len; ++$i) { + last if substr($ARGV[1], $i, 1) ne substr($ARGV[2], $i, 1) + } + $prefix = substr($ARGV[1], 0, $i) if $i > 0; +} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|sam|sam\.gz|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) { + $prefix = $1; +} +die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix); + +my $size = 0; +my $comp_ratio = 3.; +for my $f (@ARGV[1..$#ARGV]) { + my @a = stat($f); + my $s = $a[7]; + die("ERROR: failed to read file $f\n") if !defined($s); + $s *= $comp_ratio if $f =~ /\.(gz|bam)$/; + $size += int($s) + 1; +} + +my $is_pe = (defined($opts{p}) || @ARGV >= 3)? 1 : 0; +my $is_sam = $ARGV[1] =~ /\.(sam|sam\.gz)$/? 1 : 0; +my $is_bam = $ARGV[1] =~ /\.bam$/? 1 : 0; + +if (defined($opts{x})) { + delete($opts{d}); delete($opts{a}); delete $opts{p}; +} + +my $cmd = ''; +if ($is_sam || $is_bam) { + my $cmd_sam2bam = $is_sam? "$root/htsbox samview -uS $ARGV[1] \\\n" : "cat $ARGV[1] \\\n"; + my $ntmps = int($size / 4e9) + 1; + my $cmd_shuf = ($is_bam || $is_sam) && !defined($opts{S})? " | $root/htsbox bamshuf -uOn$ntmps - $prefix.shuf \\\n" : ""; + my $cmd_bam2fq = " | $root/htsbox bam2fq -O - \\\n"; + $cmd = $cmd_sam2bam . $cmd_shuf . $cmd_bam2fq; +} elsif (@ARGV >= 3) { + $cmd = "$root/seqtk mergepe $ARGV[1] $ARGV[2] \\\n"; +} else { + $cmd = "cat $ARGV[1] \\\n"; +} + +my $bwa_opts = "-p " . ($opts{t} > 1? "-t$opts{t} " : "") . (defined($opts{x})? "-x $opts{x} " : "") . (defined($opts{R})? "-R'$opts{R}' " : ""); + +$cmd .= " | $root/trimadap 2> $prefix.log.trim \\\n" if defined($opts{a}); +$cmd .= " | $root/bwa mem $bwa_opts$ARGV[0] - 2> $prefix.log.bwamem \\\n"; +$cmd .= " | $root/samblaster 2> $prefix.log.dedup \\\n" if defined($opts{d}); + +my $has_hla = 0; +if (-f "$ARGV[0].alt") { + my $fh; + open($fh, "$ARGV[0].alt") || die; + while (<$fh>) { + $has_hla = 1 if /^HLA-[^\s\*]+\*\d+/; + } + close($fh); + my $hla_pre = $has_hla? "-p $prefix.hla " : ""; + $cmd .= " | $root/k8 $root/bwa-postalt.js $hla_pre$ARGV[0].alt \\\n"; +} + +my $t_sort = $opts{t} < 4? $opts{t} : 4; +$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n"; + +if ($has_hla && !defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) { + $cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n"; + $cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n"; + $cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k}); +} + +print $cmd; + +sub which +{ + my $file = shift; + my $path = (@_)? shift : $ENV{PATH}; + return if (!defined($path)); + foreach my $x (split(":", $path)) { + $x =~ s/\/$//; + return "$x/$file" if (-x "$x/$file"); + } + return; +} diff --git a/extras/run-gen-ref b/extras/run-gen-ref new file mode 100755 index 0000000..3be3325 --- /dev/null +++ b/extras/run-gen-ref @@ -0,0 +1,30 @@ +#!/bin/bash + +root=`dirname $0` + +url38="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz" +url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz" + +if [ $# -eq 0 ]; then + echo "Usage: $0 " + exit 1; +fi + +if [ $1 == "hs38d6" ]; then + (wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38d6-extra.fa) > $1.fa + [ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38d6.fa.alt $1.fa.alt +elif [ $1 == "hs38a" ]; then + wget -O- $url38 | gzip -dc > $1.fa + [ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38d6.fa.alt > $1.fa.alt +elif [ $1 == "hs38" ]; then + wget -O- $url38 | gzip -dc | awk '/^>/{f=/_alt/?0:1}f' > $1.fa +elif [ $1 == "hs37d5" ]; then + wget -O- $url37d5 | gzip -dc > $1.fa 2>/dev/null +elif [ $1 == "hs37" ]; then + wget -O- $url37d5 | gzip -dc 2>/dev/null | awk '/^>/{f=/>hs37d5/?0:1}f' > $1.fa +else + echo "ERROR: unknown genome build" +fi + +[ ! -f $1.fa.bwt ] && echo -e "\nPlease run 'bwa index $1.fa'...\n" + diff --git a/extras/typeHLA-selctg.js b/extras/typeHLA-selctg.js new file mode 100644 index 0000000..0e02a65 --- /dev/null +++ b/extras/typeHLA-selctg.js @@ -0,0 +1,62 @@ +var min_ovlp = 30; + +if (arguments.length < 3) { + print("Usage: k8 selctg.js [min_ovlp="+min_ovlp+"]"); + exit(1); +} + +if (arguments.length >= 4) min_ovlp = parseInt(arguments[3]); +var gene = arguments[0]; + +var buf = new Bytes(); + +var h = {}; +var file = new File(arguments[1]); +while (file.readline(buf) >= 0) { + var t = buf.toString().split("\t"); + if (t[3] != gene) continue; + if (h[t[0]] == null) h[t[0]] = []; + h[t[0]].push([parseInt(t[1]), parseInt(t[2])]); +} +file.close(); + +var s = {}, re = /(\d+)([MIDSHN])/g; +file = new File(arguments[2]); +while (file.readline(buf) >= 0) { + var line = buf.toString(); + var m, t = line.split("\t"); + var x = h[t[2]]; + if (x == null) continue; + + var start = parseInt(t[3]) - 1, end = start; + while ((m = re.exec(t[5])) != null) // parse CIGAR to get the end position + if (m[2] == 'M' || m[2] == 'D') + end += parseInt(m[1]); + + var max_ovlp = 0; + for (var i = 0; i < x.length; ++i) { + var max_left = x[i][0] > start? x[i][0] : start; + var min_rght = x[i][1] < end ? x[i][1] : end; + max_ovlp = max_ovlp > min_rght - max_left? max_ovlp : min_rght - max_left; + } + + var AS = null, XS = null; + if ((m = /AS:i:(\d+)/.exec(line)) != null) AS = parseInt(m[1]); + if ((m = /XS:i:(\d+)/.exec(line)) != null) XS = parseInt(m[1]); + + if (s[t[0]] == null) s[t[0]] = []; + s[t[0]].push([AS, XS, max_ovlp]); +} +file.close(); + +buf.destroy(); + +for (var x in s) { + var is_rejected = false, y = s[x]; + y.sort(function(a,b) {return b[0]-a[0]}); + for (var i = 0; i < y.length && y[i][0] == y[0][0]; ++i) + if (y[0][2] < min_ovlp || y[i][0] == y[i][1]) + is_rejected = true; + if (is_rejected) continue; + print(x); +} diff --git a/extras/typeHLA.js b/extras/typeHLA.js new file mode 100644 index 0000000..b265d07 --- /dev/null +++ b/extras/typeHLA.js @@ -0,0 +1,496 @@ +/***************************************************************** + * The K8 Javascript interpreter is required to run this script. * + * * + * Source code: https://github.com/attractivechaos/k8 * + * Binary: http://sourceforge.net/projects/lh3/files/k8/ * + *****************************************************************/ + +var getopt = function(args, ostr) { + var oli; // option letter list index + if (typeof(getopt.place) == 'undefined') + getopt.ind = 0, getopt.arg = null, getopt.place = -1; + if (getopt.place == -1) { // update scanning pointer + if (getopt.ind >= args.length || args[getopt.ind].charAt(getopt.place = 0) != '-') { + getopt.place = -1; + return null; + } + if (getopt.place + 1 < args[getopt.ind].length && args[getopt.ind].charAt(++getopt.place) == '-') { // found "--" + ++getopt.ind; + getopt.place = -1; + return null; + } + } + var optopt = args[getopt.ind].charAt(getopt.place++); // character checked for validity + if (optopt == ':' || (oli = ostr.indexOf(optopt)) < 0) { + if (optopt == '-') return null; // if the user didn't specify '-' as an option, assume it means null. + if (getopt.place < 0) ++getopt.ind; + return '?'; + } + if (oli+1 >= ostr.length || ostr.charAt(++oli) != ':') { // don't need argument + getopt.arg = null; + if (getopt.place < 0 || getopt.place >= args[getopt.ind].length) ++getopt.ind, getopt.place = -1; + } else { // need an argument + if (getopt.place >= 0 && getopt.place < args[getopt.ind].length) + getopt.arg = args[getopt.ind].substr(getopt.place); + else if (args.length <= ++getopt.ind) { // no arg + getopt.place = -1; + if (ostr.length > 0 && ostr.charAt(0) == ':') return ':'; + return '?'; + } else getopt.arg = args[getopt.ind]; // white space + getopt.place = -1; + ++getopt.ind; + } + return optopt; +} + +/************************ + * Command line parsing * + ************************/ + +var ver = "r19"; +var c, thres_len = 50, thres_ratio = .8, thres_nm = 5, thres_frac = .33, dbg = false; + +// parse command line options +while ((c = getopt(arguments, "vdl:n:f:")) != null) { + if (c == 'l') thres_len = parseInt(getopt.arg); + else if (c == 'n') thres_nm = parseInt(getopt.arg); + else if (c == 'd') dbg = true; + else if (c == 'f') thres_frac = parseFloat(getopt.arg); + else if (c == 'v') { print(ver); exit(0); } +} +if (arguments.length == getopt.ind) { + print(""); + print("Usage: k8 typeHLA.js [options] \n"); + print("Options: -n INT drop a contig if the edit distance to the closest gene is >INT ["+thres_nm+"]"); + print(" -l INT drop a contig if its match too short ["+thres_len+"]"); + print(" -f FLOAT drop inconsistent contigs if their length = 0) { + var m, mm, line = buf.toString(); + var t = line.split("\t"); + var flag = parseInt(t[1]); + // SAM header + if (t[0].charAt(0) == '@') { + if (t[0] == '@SQ' && (m = /LN:(\d+)/.exec(line)) != null && (mm = /SN:(\S+)/.exec(line)) != null) + len[mm[1]] = parseInt(m[1]); + continue; + } + // parse gene name and exon number + var gene = null, exon = null; + if ((m = /^(HLA-[^\s_]+)_(\d+)/.exec(t[0])) != null) { + gene = m[1], exon = parseInt(m[2]) - 1; + if (gcnt[exon] == null) gcnt[exon] = {}; + gcnt[exon][gene] = true; + } + if (gene == null || exon == null || t[2] == '*') continue; + // parse clipping and aligned length + var x = 0, ts = parseInt(t[3]) - 1, te = ts, clip = [0, 0]; + while ((m = re_cigar.exec(t[5])) != null) { + var l = parseInt(m[1]); + if (m[2] == 'M') x += l, te += l; + else if (m[2] == 'I') x += l; + else if (m[2] == 'D') te += l; + else if (m[2] == 'S' || m[2] == 'H') clip[x==0?0:1] = l; + } + var tl = len[t[2]]; + var left = ts < clip[0]? ts : clip[0]; + var right = tl - te < clip[1]? tl - te : clip[1]; + var qs, qe, ql = clip[0] + x + clip[1]; + if (flag & 16) qs = clip[1], qe = ql - clip[0]; + else qs = clip[0], qe = ql - clip[1]; + var nm = (m = /\tNM:i:(\d+)/.exec(line)) != null? parseInt(m[1]) : 0; + list.push([t[2], gene, exon, ts, te, nm, left + right, qs, qe, ql]); // left+right should be 0 given a prefix-suffix alignment +} + +buf.destroy(); +file.close(); + +/************************************** + * Prepare data structures for typing * + **************************************/ + +// identify the primary exons, the exons associated with most genes +var pri_exon = [], n_pri_exons; +{ + var cnt = [], max = 0; + // count the number of genes per exon and track the max + for (var e = 0; e < gcnt.length; ++e) { + if (gcnt[e] != null) { + var c = 0, h = gcnt[e]; + for (var x in h) ++c; + cnt[e] = c; + max = max > c? max : c; + } else cnt[e] = 0; + } + warn("- Number of genes for each exon: [" +cnt.join(",") + "]"); + // find primary exons + var pri_list = []; + for (var e = 0; e < cnt.length; ++e) { + if (cnt[e] == max) pri_list.push(e + 1); + pri_exon[e] = cnt[e] == max? 1 : 0; + } + warn("- List of primary exon(s): ["+pri_list.join(",")+"]"); + n_pri_exons = pri_list.length; +} + +// convert strings to integers (for performance) +var ghash = {}, glist = [], chash = {}, clist = [], elist = []; +for (var i = 0; i < list.length; ++i) { + if (ghash[list[i][1]] == null) { + ghash[list[i][1]] = glist.length; + glist.push(list[i][1]); + } + if (chash[list[i][0]] == null) { + chash[list[i][0]] = clist.length; + clist.push(list[i][0]); + } + var g = ghash[list[i][1]]; + if (elist[g] == null) elist[g] = {}; + elist[g][list[i][2]] = true; +} + +// extract the 3rd and 4th digits +var gsub = [], gsuf = []; +for (var i = 0; i < glist.length; ++i) { + var m = /^HLA-[^*\s]+\*\d+:(\d+).*([A-Z]?)$/.exec(glist[i]); + gsub[i] = parseInt(m[1]); + gsuf[i] = /[A-Z]$/.test(glist[i])? 1 : 0; +} + +/************************************************* + * Collect genes with perfect matches on primary * + *************************************************/ + +// collect exons with fully covered by perfect match(es) +var perf_exons = []; + +function push_perf_exons(matches, last) +{ + matches.sort(function(a, b) { return a[0]-b[0]; }); + var cov = 0, start = 0, end = 0; + for (var i = 0; i < matches.length; ++i) { + if (matches[i][3] > 0) continue; + if (matches[i][0] <= end) + end = end > matches[i][1]? end : matches[i][1]; + else cov += end - start, start = matches[i][0], end = matches[i][1]; + } + cov += end - start; + if (matches[0][2] == cov) { + if (perf_exons[last[1]] == null) perf_exons[last[1]] = []; + //print(last[0], last[1], ghash[last[0]]); + perf_exons[last[1]].push(ghash[last[0]]); + } +} + +var last = [null, -1], matches = []; +for (var i = 0; i < list.length; ++i) { + var li = list[i]; + if (last[0] != li[1] || last[1] != li[2]) { + if (matches.length) push_perf_exons(matches, last); + matches = []; + last = [li[1], li[2]]; + } + matches.push([li[7], li[8], li[9], li[5]+li[6]]); +} +if (matches.length) push_perf_exons(matches, last); + +// for each gene, count how many primary exons are perfect +var pg_aux_cnt = {}; +for (var e = 0; e < perf_exons.length; ++e) { + if (!pri_exon[e]) continue; + var pe = perf_exons[e]; + var n = pe? pe.length : 0; + for (var i = 0; i < n; ++i) { + var g = pe[i]; + if (pg_aux_cnt[g] == null) pg_aux_cnt[g] = 1; + else ++pg_aux_cnt[g]; + } +} + +// find genes with perfect matches on the primary exons +var perf_genes = []; +for (var g in pg_aux_cnt) + if (pg_aux_cnt[g] == n_pri_exons) + perf_genes.push(parseInt(g)); +warn("- Found " +perf_genes.length+ " genes fully covered by perfect matches on the primary exon(s)"); + +var h_perf_genes = {}; +for (var i = 0; i < perf_genes.length; ++i) { + if (dbg) print("PG", glist[perf_genes[i]]); + h_perf_genes[perf_genes[i]] = true; +} + +/******************* + * Filter hit list * + *******************/ + +// reorganize hits to exons +function list2exons(list, flt_flag, perf_hash) +{ + var exons = []; + for (var i = 0; i < list.length; ++i) { + var li = list[i], c = chash[li[0]], g = ghash[li[1]]; + if (flt_flag != null && flt_flag[c] == 1) continue; + if (perf_hash != null && !perf_hash[g]) continue; + if (exons[li[2]] == null) exons[li[2]] = []; + exons[li[2]].push([c, g, li[5] + li[6], li[4] - li[3]]); + } + return exons; +} + +var exons = list2exons(list), flt_flag = [], ovlp_len = []; +for (var c = 0; c < clist.length; ++c) flt_flag[c] = ovlp_len[c] = 0; +for (var e = 0; e < exons.length; ++e) { + if (!pri_exon[e]) continue; + var ee = exons[e]; + var max_len = []; + for (var c = 0; c < clist.length; ++c) max_len[c] = 0; + for (var i = 0; i < ee.length; ++i) { + var l = ee[i][3] - ee[i][2]; + if (l < 1) l = 1; + if (max_len[ee[i][0]] < l) max_len[ee[i][0]] = l; + } + for (var c = 0; c < clist.length; ++c) ovlp_len[c] += max_len[c]; + for (var i = 0; i < ee.length; ++i) + flt_flag[ee[i][0]] |= (!h_perf_genes[ee[i][1]] || ee[i][2])? 1 : 1<<1; +} + +var l_cons = 0, l_incons = 0; +for (var c = 0; c < clist.length; ++c) + if (flt_flag[c]&2) l_cons += ovlp_len[c]; + else if (flt_flag[c] == 1) l_incons += ovlp_len[c]; + +warn("- Total length of contigs consistent/inconsistent with perfect genes: " +l_cons+ "/" +l_incons); +var attempt_perf = (l_incons/(l_cons+l_incons) < thres_frac); + +/******************************** + * Core function for genotyping * + ********************************/ + +function type_gene(perf_mode) +{ + if (perf_mode) { + var flt_list = []; + for (var c = 0; c < clist.length; ++c) + if (flt_flag[c] == 1) flt_list.push(clist[c]); + warn(" - Filtered " +flt_list.length+ " inconsistent contig(s): [" +flt_list.join(",")+ "]"); + exons = list2exons(list, flt_flag, h_perf_genes); + } else exons = list2exons(list); + + /*********************** + * Score each genotype * + ***********************/ + + // initialize genotype scores + var pair = []; + for (var i = 0; i < glist.length; ++i) { + pair[i] = []; + for (var j = 0; j <= i; ++j) + pair[i][j] = 0; + } + + // these two arrays are used to output debugging information + var score = [], ctg = []; + + function type_exon(e, gt_list) + { + function update_pair(x, m, is_pri) + { + var y, z; + y = (x>>14&0xff) + m < 0xff? (x>>14&0xff) + m : 0xff; + if (is_pri) z = (x>>22) + m < 0xff? (x>>22) + m : 0xff; + else z = x>>22; + return z<<22 | y<<14 | ((x&0x3fff) + (1<<6|is_pri)); + } + + score[e] = []; ctg[e] = []; + if (exons[e] == null) return; + var ee = exons[e], is_pri = pri_exon[e]? 1 : 0; + // find contigs and genes associated with the current exon + var ch = {}, gh = {}; + for (var i = 0; i < ee.length; ++i) + if (elist[ee[i][1]][e] != null) + ch[ee[i][0]] = true, gh[ee[i][1]] = true; + var ga = [], ca = ctg[e]; + for (var c in ch) ca.push(parseInt(c)); + for (var g in gh) ga.push(parseInt(g)); + var named_ca = []; + for (var i = 0; i < ca.length; ++i) named_ca.push(clist[ca[i]]); + warn(" - Processing exon "+(e+1)+" (" +ga.length+ " genes; " +ca.length+ " contigs: [" +named_ca.join(", ")+ "])..."); + // set unmapped entries to high mismatch + var sc = score[e]; + for (var k = 0; k < ga.length; ++k) { + var g = ga[k]; + if (sc[g] == null) sc[g] = []; + for (var i = 0; i < ca.length; ++i) + sc[g][ca[i]] = 0xff; + } + // convert representation again and compute max_len[] + var max_len = []; + for (var i = 0; i < ee.length; ++i) { + var c = ee[i][0], g = ee[i][1]; + if (gh[g] == null || ch[c] == null) continue; + sc[g][c] = sc[g][c] < ee[i][2]? sc[g][c] : ee[i][2]; + if (max_len[c] == null) max_len[c] = 0; + max_len[c] = max_len[c] > ee[i][3]? max_len[c] : ee[i][3]; + } + // drop mismapped contigs + var max_max_len = 0; + for (var k = 0; k < ca.length; ++k) + max_max_len = max_max_len > max_len[ca[k]]? max_max_len : max_len[ca[k]]; + var dropped = []; + for (var k = 0; k < ca.length; ++k) { + var min = 0x7fffffff, c = ca[k]; + for (var i = 0; i < ga.length; ++i) { + var g = ga[i]; + min = min < sc[g][c]? min : sc[g][c]; + } + dropped[c] = min > thres_nm? true : false; + if (max_len[c] < thres_len && max_len[c] < thres_ratio * max_max_len) dropped[c] = true; + if (dropped[c]) warn(" . Dropped low-quality contig " +clist[c]+ " (minNM=" +min+ "; maxLen=" +max_len[c]+ ")"); + } + // fill the pair array + if (gt_list == null) { + for (var i = 0; i < ga.length; ++i) { + var m = 0, gi = ga[i], g1 = sc[gi]; + // homozygous + for (var k = 0; k < ca.length; ++k) { + var c = ca[k]; + if (!dropped[c]) m += g1[c]; + } + pair[gi][gi] = update_pair(pair[gi][gi], m, is_pri); + // heterozygous + for (var j = i + 1; j < ga.length; ++j) { + var gj = ga[j], g2 = sc[gj], m = 0, a = [0, 0]; + for (var k = 0; k < ca.length; ++k) { + var c = ca[k]; + if (!dropped[c]) { + m += g1[c] < g2[c]? g1[c] : g2[c]; + ++a[g1[c]>22? min_nm_pri : pair[i][j]>>22; + + var gt_list = []; + for (var i = 0; i < glist.length; ++i) + for (var j = 0; j <= i; ++j) + if ((pair[i][j]&63) == n_pri_exons && pair[i][j]>>22 == min_nm_pri) + gt_list.push([i, j]); + + warn(" - Collected " +gt_list.length+ " top genotypes on the primary exon(s); minimal edit distance: " +min_nm_pri); + + // type other exons + warn(" - Processing other exon(s)..."); + for (var e = 0; e < exons.length; ++e) + if (!pri_exon[e]) type_exon(e, gt_list); + + /***************************** + * Choose the best genotypes * + *****************************/ + + // genotyping + var min_nm = 0x7fffffff; + for (var i = 0; i < glist.length; ++i) + for (var j = 0; j <= i; ++j) + if ((pair[i][j]&63) == n_pri_exons) + min_nm = min_nm < pair[i][j]>>14? min_nm : pair[i][j]>>14; + + var out = []; + for (var i = 0; i < glist.length; ++i) + for (var j = 0; j <= i; ++j) + if ((pair[i][j]&63) == n_pri_exons && pair[i][j]>>14 <= min_nm + 1) + out.push([pair[i][j]>>14, pair[i][j]>>6&0xff, i, j, (gsuf[i] + gsuf[j])<<16|(gsub[i] + gsub[j])]); + + out.sort(function(a, b) { return a[0]!=b[0]? a[0]-b[0] : a[1]!=b[1]? b[1]-a[1] : a[4]!=b[4]? a[4]-b[4] : a[2]!=b[2]? a[2]-b[2] : a[3]-b[3]}); + + return out; +} + +/********************** + * Perform genotyping * + **********************/ + +warn("- Typing in the imperfect mode..."); +var rst = type_gene(false); +if (attempt_perf) { + warn("- Typing in the perfect mode..."); + var rst_perf = type_gene(true); + warn("- Imperfect vs perfect mode: [" +(rst[0][0]>>8&0xff)+ "," +(rst[0][0]&0xff)+ "] vs [" +(rst_perf[0][0]>>8&0xff)+ "," +(rst_perf[0][0]&0xff)+ "]"); + if (rst_perf[0][0] < rst[0][0]) { + warn("- Chose the result from the perfect mode"); + rst = rst_perf; + } else warn("- Chose the result from the imperfect mode"); +} else warn("- Perfect mode is not attempted"); + +/********** + * Output * + **********/ + +for (var i = 0; i < rst.length; ++i) + print("GT", glist[rst[i][3]], glist[rst[i][2]], rst[i][0]>>8&0xff, rst[i][0]&0xff, rst[i][1]); diff --git a/extras/typeHLA.sh b/extras/typeHLA.sh new file mode 100755 index 0000000..b5a2b4e --- /dev/null +++ b/extras/typeHLA.sh @@ -0,0 +1,48 @@ +#!/bin/bash + +is_ctg=0 + +if [ $# -gt 1 ] && [ $1 == '-A' ]; then + is_ctg=1 + shift +fi + +if [ $# -lt 2 ]; then + echo "Usage: $0 [-A] " + exit 1 +fi + +preres="resource-human-HLA" +root=`dirname $0` +pre=$1.$2 + +if [ ! -s $pre.fq ]; then + echo '** Empty input file. Abort!' >&2 + exit 0 +fi + +if [ $is_ctg -eq 0 ]; then + echo "** De novo assembling..." >&2 + len=`$root/seqtk comp $pre.fq | awk '{++x;y+=$2}END{printf("%.0f\n", y/x)}'` + $root/fermi2.pl unitig -f $root/fermi2 -r $root/ropebwt2 -t2 -l$len -p $pre.tmp $pre.fq > $pre.tmp.mak + make -f $pre.tmp.mak >&2 + cp $pre.tmp.mag.gz $pre.mag.gz +else + rm -f $pre.tmp.mag.gz + ln -s $pre.fq $pre.tmp.mag.gz +fi + +echo "** Selecting contigs overlapping target exons..." >&2 +(ls $root/$preres/HLA-ALT-idx/*.fa.bwt | sed s,.bwt,, | xargs -i $root/bwa mem -t2 -B1 -O1 -E1 {} $pre.tmp.mag.gz 2>/dev/null) | grep -v ^@ | sort -k3,3 -k4,4n | gzip > $pre.tmp.ALT.sam.gz +$root/k8 $root/typeHLA-selctg.js $2 $root/$preres/HLA-ALT-exons.bed $pre.tmp.ALT.sam.gz | $root/seqtk subseq $pre.tmp.mag.gz - | gzip -1 > $pre.tmp.fq.gz + +echo "** Mapping exons to de novo contigs..." >&2 +$root/bwa index -p $pre.tmp $pre.tmp.fq.gz 2>/dev/null +$root/seqtk comp $root/$preres/HLA-CDS.fa | cut -f1 | grep ^$2 | $root/seqtk subseq $root/$preres/HLA-CDS.fa - | $root/bwa mem -aD.1 -t2 $pre.tmp - 2>/dev/null | gzip -1 > $pre.sam.gz + +echo "** Typing..." >&2 +$root/k8 $root/typeHLA.js $pre.sam.gz > $pre.gt + +# delete temporary files +rm -f $pre.tmp.* +[ $is_ctg -eq 1 ] && rm -f $pre.mag.gz diff --git a/fastmap.c b/fastmap.c index 18f3b0e..988b6ed 100644 --- a/fastmap.c +++ b/fastmap.c @@ -55,7 +55,7 @@ int main_mem(int argc, char *argv[]) opt = mem_opt_init(); memset(&opt0, 0, sizeof(mem_opt_t)); - while ((c = getopt(argc, argv, "epaFMCSPHVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:W:x:G:h:y:K:")) >= 0) { + while ((c = getopt(argc, argv, "epaFMCSPHVYjk:c:v:s:r:t:R:A:B:O:E:U:w:L:d:T:Q:D:m:I:N:W:x:G:h:y:K:X:")) >= 0) { if (c == 'k') opt->min_seed_len = atoi(optarg), opt0.min_seed_len = 1; else if (c == 'x') mode = optarg; else if (c == 'w') opt->w = atoi(optarg), opt0.w = 1; @@ -66,7 +66,7 @@ int main_mem(int argc, char *argv[]) else if (c == 't') opt->n_threads = atoi(optarg), opt->n_threads = opt->n_threads > 1? opt->n_threads : 1; else if (c == 'P') opt->flag |= MEM_F_NOPAIRING; else if (c == 'a') opt->flag |= MEM_F_ALL; - else if (c == 'p') opt->flag |= MEM_F_PE; + else if (c == 'p') opt->flag |= MEM_F_PE | MEM_F_SMARTPE; else if (c == 'M') opt->flag |= MEM_F_NO_MULTI; else if (c == 'S') opt->flag |= MEM_F_NO_RESCUE; else if (c == 'e') opt->flag |= MEM_F_SELF_OVLP; @@ -87,6 +87,7 @@ int main_mem(int argc, char *argv[]) else if (c == 'y') opt->max_mem_intv = atol(optarg), opt0.max_mem_intv = 1; else if (c == 'C') copy_comment = 1; else if (c == 'K') fixed_chunk_size = atoi(optarg); + else if (c == 'X') opt->mask_level = atof(optarg); else if (c == 'h') { opt0.max_XA_hits = opt0.max_XA_hits_alt = 1; opt->max_XA_hits = opt->max_XA_hits_alt = strtol(optarg, &p, 10); @@ -166,7 +167,7 @@ int main_mem(int argc, char *argv[]) fprintf(stderr, " intractg: -B9 -O16 -L5 (intra-species contigs to ref)\n"); // fprintf(stderr, " pbread: -k13 -W40 -c1000 -r10 -A1 -B1 -O1 -E1 -N25 -FeaD.001\n"); fprintf(stderr, "\nInput/output options:\n\n"); - fprintf(stderr, " -p first query file consists of interleaved paired-end sequences\n"); + fprintf(stderr, " -p smart pairing (ignoring in2.fq)\n"); fprintf(stderr, " -R STR read group header line such as '@RG\\tID:foo\\tSM:bar' [null]\n"); fprintf(stderr, " -j ignore ALT contigs\n"); fprintf(stderr, "\n"); @@ -247,7 +248,7 @@ int main_mem(int argc, char *argv[]) if (optind + 2 < argc) { if (opt->flag&MEM_F_PE) { if (bwa_verbose >= 2) - fprintf(stderr, "[W::%s] when '-p' is in use, the second query file will be ignored.\n", __func__); + fprintf(stderr, "[W::%s] when '-p' is in use, the second query file is ignored.\n", __func__); } else { ko2 = kopen(argv[optind + 2], &fd2); if (ko2 == 0) { @@ -264,11 +265,6 @@ int main_mem(int argc, char *argv[]) actual_chunk_size = fixed_chunk_size > 0? fixed_chunk_size : opt->chunk_size * opt->n_threads; while ((seqs = bseq_read(actual_chunk_size, &n, ks, ks2)) != 0) { int64_t size = 0; - if ((opt->flag & MEM_F_PE) && (n&1) == 1) { - if (bwa_verbose >= 2) - fprintf(stderr, "[W::%s] odd number of reads in the PE mode; last read dropped\n", __func__); - n = n>>1<<1; - } if (!copy_comment) for (i = 0; i < n; ++i) { free(seqs[i].comment); seqs[i].comment = 0; @@ -276,7 +272,27 @@ int main_mem(int argc, char *argv[]) for (i = 0; i < n; ++i) size += seqs[i].l_seq; if (bwa_verbose >= 3) fprintf(stderr, "[M::%s] read %d sequences (%ld bp)...\n", __func__, n, (long)size); - mem_process_seqs(opt, idx->bwt, idx->bns, idx->pac, n_processed, n, seqs, pes0); + if (opt->flag & MEM_F_SMARTPE) { + bseq1_t *sep[2]; + int i, n_sep[2]; + mem_opt_t tmp_opt = *opt; + bseq_classify(n, seqs, n_sep, sep); + if (bwa_verbose >= 3) + fprintf(stderr, "[M::%s] %d single-end sequences; %d paired-end sequences\n", __func__, n_sep[0], n_sep[1]); + if (n_sep[0]) { + tmp_opt.flag &= ~MEM_F_PE; + mem_process_seqs(&tmp_opt, idx->bwt, idx->bns, idx->pac, n_processed, n_sep[0], sep[0], 0); + for (i = 0; i < n_sep[0]; ++i) + seqs[sep[0][i].id].sam = sep[0][i].sam; + } + if (n_sep[1]) { + tmp_opt.flag |= MEM_F_PE; + mem_process_seqs(&tmp_opt, idx->bwt, idx->bns, idx->pac, n_processed + n_sep[0], n_sep[1], sep[1], pes0); + for (i = 0; i < n_sep[1]; ++i) + seqs[sep[1][i].id].sam = sep[1][i].sam; + } + free(sep[0]); free(sep[1]); + } else mem_process_seqs(opt, idx->bwt, idx->bns, idx->pac, n_processed, n, seqs, pes0); n_processed += n; for (i = 0; i < n; ++i) { if (seqs[i].sam) err_fputs(seqs[i].sam, stdout); diff --git a/kseq.h b/kseq.h index 642cd33..f3862c6 100644 --- a/kseq.h +++ b/kseq.h @@ -54,9 +54,9 @@ #define __KS_BASIC(type_t, __bufsize) \ static inline kstream_t *ks_init(type_t f) \ { \ - kstream_t *ks = (kstream_t*)calloc(1, sizeof(kstream_t)); \ + kstream_t *ks = (kstream_t*)calloc(1, sizeof(kstream_t)); \ ks->f = f; \ - ks->buf = (unsigned char*)malloc(__bufsize); \ + ks->buf = (unsigned char*)malloc(__bufsize); \ return ks; \ } \ static inline void ks_destroy(kstream_t *ks) \ @@ -74,8 +74,7 @@ if (ks->begin >= ks->end) { \ ks->begin = 0; \ ks->end = __read(ks->f, ks->buf, __bufsize); \ - if (ks->end < __bufsize) ks->is_eof = 1; \ - if (ks->end == 0) return -1; \ + if (ks->end == 0) { ks->is_eof = 1; return -1;} \ } \ return (int)ks->buf[ks->begin++]; \ } @@ -95,17 +94,16 @@ typedef struct __kstring_t { #define __KS_GETUNTIL(__read, __bufsize) \ static int ks_getuntil2(kstream_t *ks, int delimiter, kstring_t *str, int *dret, int append) \ { \ + int gotany = 0; \ if (dret) *dret = 0; \ str->l = append? str->l : 0; \ - if (ks->begin >= ks->end && ks->is_eof) return -1; \ for (;;) { \ int i; \ if (ks->begin >= ks->end) { \ if (!ks->is_eof) { \ ks->begin = 0; \ ks->end = __read(ks->f, ks->buf, __bufsize); \ - if (ks->end < __bufsize) ks->is_eof = 1; \ - if (ks->end == 0) break; \ + if (ks->end == 0) { ks->is_eof = 1; break; } \ } else break; \ } \ if (delimiter == KS_SEP_LINE) { \ @@ -124,8 +122,9 @@ typedef struct __kstring_t { if (str->m - str->l < (size_t)(i - ks->begin + 1)) { \ str->m = str->l + (i - ks->begin) + 1; \ kroundup32(str->m); \ - str->s = (char*)realloc(str->s, str->m); \ + str->s = (char*)realloc(str->s, str->m); \ } \ + gotany = 1; \ memcpy(str->s + str->l, ks->buf + ks->begin, i - ks->begin); \ str->l = str->l + (i - ks->begin); \ ks->begin = i + 1; \ @@ -134,6 +133,7 @@ typedef struct __kstring_t { break; \ } \ } \ + if (!gotany && ks_eof(ks)) return -1; \ if (str->s == 0) { \ str->m = 1; \ str->s = (char*)calloc(1, 1); \ @@ -155,7 +155,7 @@ typedef struct __kstring_t { #define __KSEQ_BASIC(SCOPE, type_t) \ SCOPE kseq_t *kseq_init(type_t fd) \ { \ - kseq_t *s = (kseq_t*)calloc(1, sizeof(kseq_t)); \ + kseq_t *s = (kseq_t*)calloc(1, sizeof(kseq_t)); \ s->f = ks_init(fd); \ return s; \ } \ diff --git a/main.c b/main.c index f8f6995..5a2de61 100644 --- a/main.c +++ b/main.c @@ -4,7 +4,7 @@ #include "utils.h" #ifndef PACKAGE_VERSION -#define PACKAGE_VERSION "0.7.10-r960-dirty" +#define PACKAGE_VERSION "0.7.10-r998-dirty" #endif int bwa_fa2pac(int argc, char *argv[]);