Merge branch 'dev'

This commit is contained in:
Heng Li 2014-12-23 15:29:34 -05:00
commit b8e189c1e5
9 changed files with 151 additions and 57 deletions

38
NEWS.md
View File

@ -1,34 +1,37 @@
Release 0.7.11 (XX November, 2014) Release 0.7.11 (23 December, 2014)
----------------------------------- ----------------------------------
A major change to BWA-MEM is the support of mapping to ALT contigs in addition A major change to BWA-MEM is the support of mapping to ALT contigs in addition
to the primary assembly. Part of the ALT mapping strategy is implemented in to the primary assembly. Part of the ALT mapping strategy is implemented in
BWA-MEM and the rest in a postprocessing script for now. Due to the extra BWA-MEM and the rest in a postprocessing script for now. Due to the extra
layer of complexity on generating the reference genome and on the two-step layer of complexity on generating the reference genome and on the two-step
mapping, we start to provide a wrapper script and precompiled binaries since mapping, we start to provide a wrapper script and precompiled binaries since
this release. Please check README-alt.md for details. this release. The package may be more convenient to some specific use cases.
For general uses, the single BWA binary still works like the old way.
Another major addition to BWA-MEM is HLA typing, which made possible with the Another major addition to BWA-MEM is HLA typing, which made possible with the
new ALT mapping strategy. Necessary data and programs are included in the new ALT mapping strategy. Necessary data and programs are included in the
binary release. The wrapper script also performs HLA typing when HLA genes are binary release. The wrapper script also optionally performs HLA typing when HLA
also included in the reference genome as additional ALT contigs. genes are included in the reference genome as additional ALT contigs.
Other notable changes to BWA-MEM: Other notable changes to BWA-MEM:
* Added option `-b` to `bwa index`. This option tunes the batch size used in * Added option `-b` to `bwa index`. This option tunes the batch size used in
the construction of BWT. It is advised to use large `-b` for huge reference the construction of BWT. It is advised to use large `-b` for huge reference
sequences such as the *nt* database. sequences such as the BLAST *nt* database.
* Optimized for PacBio data. This includes a change to the scoring based on a * Optimized for PacBio data. This includes a change to scoring based on a
mini-study done by Aaron Quinlan and a heuristic speedup. Further speedup is study done by Aaron Quinlan and a heuristic speedup. Further speedup is
possible, but needs more careful investigation. possible, but needs more careful investigation.
* Dropped PacBio read-to-read alignment for now. BWA-MEM is only good at * Dropped PacBio read-to-read alignment for now. BWA-MEM is good for finding
finding the best hit, not all hits. Option `-x pbread` is still available, the best hit, but is not very sensitive to suboptimal hits. Option `-x pbread`
but hidden on the command line. is still available, but hidden on the command line. This may be removed in
future releases.
* Added a new pre-setting for Oxford Nanopore 2D reads. LAST is still a little * Added a new pre-setting for Oxford Nanopore 2D reads. LAST is still a little
more sensitive on bacterial data, but bwa-mem is times faster on human data. more sensitive on older bacterial data, but bwa-mem is as good on more
recent data and is times faster for mapping against mammalian genomes.
* Added LAST-like seeding. This improves the accuracy for longer reads. * Added LAST-like seeding. This improves the accuracy for longer reads.
@ -44,7 +47,16 @@ Other notable changes to BWA-MEM:
writing SAM. This saves significant wall-clock time when reading from writing SAM. This saves significant wall-clock time when reading from
or writing to a slow Unix pipe. or writing to a slow Unix pipe.
(0.7.11: XX November 2014, rXXX) With the new release, the recommended way to map Illumina reads to GRCh38 is to
use the bwakit binary package:
bwa.kit/run-gen-ref hs38DH
bwa.kit/bwa index hs38DH.fa
bwa.kit/run-bwamem -t8 -H -o out-prefix hs38DH.fa read1.fq.gz read2.fq.gz | sh
Please check bwa.kit/README.md for details and command line options.
(0.7.11: 23 December 2014, r1034)

View File

@ -5,10 +5,10 @@
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \ wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf - | gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index # Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
bwa.kit/bwa index hs38d6.fa # create BWA index bwa.kit/bwa index hs38DH.fa # create BWA index
# mapping # mapping
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
``` ```
This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or
For a more complete reference genome, we compiled a new set of decoy sequences For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples. from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
The sequences are included in `hs38d6-extra.fa` from the [BWA binary The sequences are included in `hs38DH-extra.fa` from the [BWA binary
package][res]. package][res].
In addition to decoy, we also put multiple alleles of HLA genes in In addition to decoy, we also put multiple alleles of HLA genes in
`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb], `hs38DH-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
version 3.18.0 and are used to collect reads sequenced from these genes. version 3.18.0 and are used to collect reads sequenced from these genes.
### HLA typing ### HLA typing
@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses.
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878 To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
(hs38d6), and called small variants from the alignment. CHM1 is haploid. (hs38DH), and called small variants from the alignment. CHM1 is haploid.
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
true positive (TP) heterozygous calls from NA12878 are approximately equal true positive (TP) heterozygous calls from NA12878 are approximately equal
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
should yield higher TP and lower FP. The following table shows the numbers for should yield higher TP and lower FP. The following table shows the numbers for
these assemblies: these assemblies:
|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref| |Assembly|hs37 |hs38 |hs38DH|CHM1_1.1| huref|
|:------:|------:|------:|------:|------:|------:| |:------:|------:|------:|------:|------:|------:|
|FP | 255706| 168068| 142516|307172 | 575634| |FP | 255706| 168068| 142516|307172 | 575634|
|TP |2142260|2163113|2150844|2167235|2137053| |TP |2142260|2163113|2150844|2167235|2137053|
With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces With this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
from hs38 only and found the majority of them are associated with excessive read from hs38 only and found the majority of them are associated with excessive read
depth, clustered variants or weak alignment. We believe most hs38-only calls are depth, clustered variants or weak alignment. We believe most hs38-only calls are
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10 problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
with nearly identical library construction, the difference is ~140k, an order with nearly identical library construction, the difference is ~140k, an order
of magnitude higher than the difference between hs38 and hs38d6. ALT contigs, of magnitude higher than the difference between hs38 and hs38DH. ALT contigs,
decoy and HLA genes in hs38d6 improve variant calling and enable the analyses of decoy and HLA genes in hs38DH improve variant calling and enable the analyses of
ALT contigs and HLA typing at little cost. ALT contigs and HLA typing at little cost.
## Problems and Future Development ## Problems and Future Development

53
bwa.1
View File

@ -1,4 +1,4 @@
.TH bwa 1 "18 November 2014" "bwa-0.7.11-r999" "Bioinformatics tools" .TH bwa 1 "23 December 2014" "bwa-0.7.11-r1034" "Bioinformatics tools"
.SH NAME .SH NAME
.PP .PP
bwa - Burrows-Wheeler Alignment Tool bwa - Burrows-Wheeler Alignment Tool
@ -75,7 +75,7 @@ appropriate algorithm will be chosen automatically.
.TP .TP
.B mem .B mem
.B bwa mem .B bwa mem
.RB [ -aCHMpP ] .RB [ -aCHjMpP ]
.RB [ -t .RB [ -t
.IR nThreads ] .IR nThreads ]
.RB [ -k .RB [ -k
@ -88,6 +88,12 @@ appropriate algorithm will be chosen automatically.
.IR seedSplitRatio ] .IR seedSplitRatio ]
.RB [ -c .RB [ -c
.IR maxOcc ] .IR maxOcc ]
.RB [ -D
.IR chainShadow ]
.RB [ -m
.IR maxMateSW ]
.RB [ -W
.IR minSeedMatch ]
.RB [ -A .RB [ -A
.IR matchScore ] .IR matchScore ]
.RB [ -B .RB [ -B
@ -102,6 +108,8 @@ appropriate algorithm will be chosen automatically.
.IR unpairPen ] .IR unpairPen ]
.RB [ -R .RB [ -R
.IR RGline ] .IR RGline ]
.RB [ -H
.IR HDlines ]
.RB [ -v .RB [ -v
.IR verboseLevel ] .IR verboseLevel ]
.I db.prefix .I db.prefix
@ -193,9 +201,28 @@ Discard a MEM if it has more than
.I INT .I INT
occurence in the genome. This is an insensitive parameter. [500] occurence in the genome. This is an insensitive parameter. [500]
.TP .TP
.BI -D \ INT
Drop chains shorter than
.I FLOAT
fraction of the longest overlapping chain [0.5]
.TP
.BI -m \ INT
Perform at most
.I INT
rounds of mate-SW [50]
.TP
.BI -W \ INT
Drop a chain if the number of bases in seeds is smaller than
.IR INT .
This option is primarily used for longer contigs/reads. When positive, it also
affects seed filtering. [0]
.TP
.B -P .B -P
In the paired-end mode, perform SW to rescue missing hits only but do not try to find In the paired-end mode, perform SW to rescue missing hits only but do not try to find
hits that fit a proper pair. hits that fit a proper pair.
.TP
.B SCORING OPTIONS:
.TP .TP
.BI -A \ INT .BI -A \ INT
Matching score. [1] Matching score. [1]
@ -244,15 +271,30 @@ and will be converted to a TAB in the output SAM. The read group ID will be
attached to every read in the output. An example is '@RG\\tID:foo\\tSM:bar'. attached to every read in the output. An example is '@RG\\tID:foo\\tSM:bar'.
[null] [null]
.TP .TP
.BI -H \ ARG
If ARG starts with @, it is interpreted as a string and gets inserted into the
output SAM header; otherwise, ARG is interpreted as a file with all lines
starting with @ in the file inserted into the SAM header. [null]
.TP
.BI -T \ INT .BI -T \ INT
Don't output alignment with score lower than Don't output alignment with score lower than
.IR INT . .IR INT .
This option affects output and occasionally SAM flag 2. [30] This option affects output and occasionally SAM flag 2. [30]
.TP .TP
.BI -h \ INT .BI -j
Treat ALT contigs as part of the primary assembly (i.e. ignore the
.I db.prefix.alt
file).
.TP
.BI -h \ INT[,INT2]
If a query has not more than If a query has not more than
.I INT .I INT
hits with score higher than 80% of the best hit, output them all in the XA tag [5] hits with score higher than 80% of the best hit, output them all in the XA tag.
If
.I INT2
is specified, BWA-MEM outputs up to
.I INT2
hits if the list contains a hit to an ALT contig. [5,200]
.TP .TP
.B -a .B -a
Output all found alignments for single-end or unpaired paired-end reads. These Output all found alignments for single-end or unpaired paired-end reads. These
@ -572,6 +614,7 @@ R 0x0020 strand of the mate
s 0x0100 the alignment is not primary s 0x0100 the alignment is not primary
f 0x0200 QC failure f 0x0200 QC failure
d 0x0400 optical or PCR duplicate d 0x0400 optical or PCR duplicate
S 0x0800 supplementary alignment
.TE .TE
.PP .PP
@ -605,8 +648,6 @@ _
XS Suboptimal alignment score XS Suboptimal alignment score
XF Support from forward/reverse alignment XF Support from forward/reverse alignment
XE Number of supporting seeds XE Number of supporting seeds
_
XP Alt primary hits; format: /(chr,pos,CIGAR,mapQ,NM;)+/
.TE .TE
.PP .PP

View File

@ -17,14 +17,14 @@ other programs or use data in `bwa.kit`. The following shows an example about
how to use bwakit: how to use bwakit:
```sh ```sh
# Download bwakit (or from <http://sourceforge.net/projects/bio-bwa/files/bwakit/> manually) # Download the bwa-0.7.11 binary package (download link may change)
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \ wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf - | gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index # Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
bwa.kit/bwa index hs38d6.fa # create BWA index bwa.kit/bwa index hs38DH.fa # create BWA index
# mapping # mapping
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh
``` ```
The last mapping command line will generate the following files: The last mapping command line will generate the following files:
@ -44,13 +44,37 @@ Bwakit can be [downloaded here][res]. It is only available to x86_64-linux. The
scripts in the package are available in the [bwa/bwakit][kit] directory. scripts in the package are available in the [bwa/bwakit][kit] directory.
Packaging is done manually for now. Packaging is done manually for now.
## Contents ## Limitations
* HLA typing only works for high-coverage human data. The typing accuracy can
still be improved. We encourage researchers to develop better HLA typing tools
based on the intermediate output of bwakit (for each HLA gene included in the
index, bwakit writes all reads matching it in a separate file).
* Duplicate marking only works when all reads from a single paired-end library
are provided as the input. This limitation is the necessary tradeoff of fast
MarkDuplicate provided by samblaster.
* The adapter trimmer is chosen as it is fast, pipe friendly and does not
discard reads. However, it is conservative and suboptimal. If this is a
concern, it is recommended to preprocess input reads with a more sophisticated
adapter trimmer. We also hope existing trimmers can be modified to operate on
an interleaved FASTQ stream. We will replace trimadap once a better trimmer
meets our needs.
* Bwakit can be memory demanding depends on the functionality invoked. For 30X
human data, bwa-mem takes about 11GB RAM with 32 threads, samblaster uses
close to 10GB and BAM shuffling (if the input is sorted BAM) uses several GB.
In the current setting, sorting uses about 10GB.
## Package Contents
``` ```
bwa.kit bwa.kit
|-- README.md This README file. |-- README.md This README file.
|-- run-bwamem *Entry script* for the entire mapping pipeline. |-- run-bwamem *Entry script* for the entire mapping pipeline.
|-- bwa *BWA binary* |-- bwa *BWA binary*
|-- k8 Interpreter for *.js scripts. |-- k8 Interpretor for *.js scripts.
|-- bwa-postalt.js Post-process alignments to ALT contigs/decoys/HLA genes. |-- bwa-postalt.js Post-process alignments to ALT contigs/decoys/HLA genes.
|-- htsbox Used by run-bwamem for shuffling BAMs and BAM=>FASTQ. |-- htsbox Used by run-bwamem for shuffling BAMs and BAM=>FASTQ.
|-- samblaster MarkDuplicates for reads from the same library. v0.1.20 |-- samblaster MarkDuplicates for reads from the same library. v0.1.20
@ -60,10 +84,8 @@ bwa.kit
| |
|-- run-gen-ref *Entry script* for generating human reference genomes. |-- run-gen-ref *Entry script* for generating human reference genomes.
|-- resource-GRCh38 Resources for generating GRCh38 |-- resource-GRCh38 Resources for generating GRCh38
| |-- hs38d6-decoy.nt.anno Top decoy-to-nt hits. Not used by any scripts. | |-- hs38DH-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
| |-- hs38d6-decoy.rm.out RepeatMasker report. Not used. | `-- hs38DH.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
| |-- hs38d6-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
| `-- hs38d6.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
| |
|-- run-HLA HLA typing for sequences extracted by bwa-postalt.js. |-- run-HLA HLA typing for sequences extracted by bwa-postalt.js.
|-- typeHLA.sh Type one HLA-gene. Called by run-HLA. |-- typeHLA.sh Type one HLA-gene. Called by run-HLA.

View File

@ -18,29 +18,38 @@ Options: -o STR prefix for output files [inferred from
ont2d: Oxford Nanopore reads (~10kb query, higher error rate) ont2d: Oxford Nanopore reads (~10kb query, higher error rate)
-t INT number of threads [1] -t INT number of threads [1]
-H apply HLA typing
-a trim HiSeq2000/2500 PE resequencing adapters (via trimadap) -a trim HiSeq2000/2500 PE resequencing adapters (via trimadap)
-d mark duplicate (via samblaster) -d mark duplicate (via samblaster)
-S for SAM/BAM input, don\'t shuffle -S for BAM input, don\'t shuffle
-s sort the output alignment (requring more RAM) -s sort the output alignment (via samtools; requring more RAM)
-H skip HLA typing
-k keep temporary files generated by typeHLA -k keep temporary files generated by typeHLA
Examples: Examples:
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing: * Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
run-bwamem -o prefix -t8 -R"@RG\tID:foo\tSM:bar" hs38d6.fa read1.fq.gz read2.fq.gz run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38DH.fa read1.fq.gz read2.fq.gz
Note: HLA typing is only effective for high-coverage data. The typing accuracy varies
with the quality of input. It is only intended for research purpose, not for diagnostic.
* Remap coordinate-sorted BAM, transfer read groups tags, trim Illumina PE adapters and * Remap coordinate-sorted BAM, transfer read groups tags, trim Illumina PE adapters and
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
the two types. Specifying -R stops read group transfer. the two types. Specifying -R stops read group transfer.
run-bwamem -sao prefix hs38d6.fa old-srt.bam run-bwamem -sao prefix hs38DH.fa old-srt.bam
* Remap name-grouped BAM and mark duplicates. Note that in this case, all reads from Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current
a single library should be aligned at the same time. Paired-end only. mapping pipeline better. It is conservative and suboptimal. A more sophisticated
trimmer is recommended if this becomes a concern.
run-bwamem -Sdo prefix hs38d6.fa old-unsrt.bam * Remap name-grouped BAM and mark duplicates:
run-bwamem -Sdo prefix hs38DH.fa old-unsrt.bam
Note: streamed duplicate marking requires all reads from a single paired-end library
to be aligned at the same time.
Output files: Output files:
@ -84,7 +93,7 @@ if (defined $opts{o}) {
} elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) { } elsif ($ARGV[1] =~ /^(\S+)\.(fastq|fq|fasta|fa|mag|mag\.gz|fasta\.gz|fa\.gz|fastq\.gz|fq\.gz|bam)$/) {
$prefix = $1; $prefix = $1;
} }
die("ERROR: failed to identify the prefix for output. Please specify -p.\n") unless defined($prefix); die("ERROR: failed to identify the prefix for output. Please specify -o.\n") unless defined($prefix);
my $size = 0; my $size = 0;
my $comp_ratio = 3.; my $comp_ratio = 3.;
@ -156,9 +165,9 @@ if (-f "$ARGV[0].alt") {
my $t_sort = $opts{t} < 4? $opts{t} : 4; my $t_sort = $opts{t} < 4? $opts{t} : 4;
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n"; $cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";
if ($has_hla && !defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) { if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n"; $cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n"; $cmd .= "touch $prefix.hla.HLA-dummy.gt; cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k}); $cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
} }

View File

@ -6,16 +6,25 @@ url38="ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals
url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz" url37d5="ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
if [ $# -eq 0 ]; then if [ $# -eq 0 ]; then
echo "Usage: $0 <hs38|hs38a|hs38d6|hs37|hs37d5>" echo "Usage: $0 <hs38|hs38a|hs38DH|hs37|hs37d5>"
echo "Analysis sets:"
echo " hs38 primary assembly of GRCh38 (incl. chromosomes, unplaced and unlocalized contigs) and EBV"
echo " hs38a hs38 plus ALT contigs"
echo " hs38DH hs38a plus decoy contigs and HLA genes (recommended for GRCh38 mapping)"
echo " hs37 primary assembly of GRCh37 (used by 1000g phase 1) plus the EBV genome"
echo " hs37d5 hs37 plus decoy contigs (used by 1000g phase 3)"
echo ""
echo "Note: This script downloads human reference genomes. For hs38a and hs38DH, it needs additional"
echo " sequences and ALT-to-REF mapping included in the bwa.kit package."
exit 1; exit 1;
fi fi
if [ $1 == "hs38d6" ]; then if [ $1 == "hs38DH" ]; then
(wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38d6-extra.fa) > $1.fa (wget -O- $url38 | gzip -dc; cat $root/resource-GRCh38/hs38DH-extra.fa) > $1.fa
[ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38d6.fa.alt $1.fa.alt [ ! -f $1.fa.alt ] && cp $root/resource-GRCh38/hs38DH.fa.alt $1.fa.alt
elif [ $1 == "hs38a" ]; then elif [ $1 == "hs38a" ]; then
wget -O- $url38 | gzip -dc > $1.fa wget -O- $url38 | gzip -dc > $1.fa
[ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38d6.fa.alt > $1.fa.alt [ ! -f $1.fa.alt ] && grep _alt $root/resource-GRCh38/hs38DH.fa.alt > $1.fa.alt
elif [ $1 == "hs38" ]; then elif [ $1 == "hs38" ]; then
wget -O- $url38 | gzip -dc | awk '/^>/{f=/_alt/?0:1}f' > $1.fa wget -O- $url38 | gzip -dc | awk '/^>/{f=/_alt/?0:1}f' > $1.fa
elif [ $1 == "hs37d5" ]; then elif [ $1 == "hs37d5" ]; then

View File

@ -15,6 +15,7 @@ fi
preres="resource-human-HLA" preres="resource-human-HLA"
root=`dirname $0` root=`dirname $0`
pre=$1.$2 pre=$1.$2
touch $pre.gt
if [ ! -s $pre.fq ]; then if [ ! -s $pre.fq ]; then
echo '** Empty input file. Abort!' >&2 echo '** Empty input file. Abort!' >&2

View File

@ -268,7 +268,7 @@ int main_mem(int argc, char *argv[])
fprintf(stderr, " -p smart pairing (ignoring in2.fq)\n"); fprintf(stderr, " -p smart pairing (ignoring in2.fq)\n");
fprintf(stderr, " -R STR read group header line such as '@RG\\tID:foo\\tSM:bar' [null]\n"); fprintf(stderr, " -R STR read group header line such as '@RG\\tID:foo\\tSM:bar' [null]\n");
fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n"); fprintf(stderr, " -H STR/FILE insert STR to header if it starts with @; or insert lines in FILE [null]\n");
fprintf(stderr, " -j ignore ALT contigs\n"); fprintf(stderr, " -j treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)\n");
fprintf(stderr, "\n"); fprintf(stderr, "\n");
fprintf(stderr, " -v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose); fprintf(stderr, " -v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [%d]\n", bwa_verbose);
fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T); fprintf(stderr, " -T INT minimum score to output [%d]\n", opt->T);

2
main.c
View File

@ -4,7 +4,7 @@
#include "utils.h" #include "utils.h"
#ifndef PACKAGE_VERSION #ifndef PACKAGE_VERSION
#define PACKAGE_VERSION "0.7.10-r1027-dirty" #define PACKAGE_VERSION "0.7.11-r1034"
#endif #endif
int bwa_fa2pac(int argc, char *argv[]); int bwa_fa2pac(int argc, char *argv[]);