Commit Graph

294 Commits (main)

Author SHA1 Message Date
Heng Li a41afe4c97 These files were committed on a wrong branch 2014-09-18 10:49:35 -04:00
Heng Li c982443210 r854: improved the calculation of pa
and build pa filtering into BWA-MEM
2014-09-17 16:26:28 -04:00
Heng Li 825ae92e58 r849: the pa tag now gives a number
... which is the ratio of this hit to the best ALT hit.
2014-09-17 13:05:35 -04:00
Heng Li 6f37c14f26 r848: tag alignments with primary ALT 2014-09-16 18:52:49 -04:00
Heng Li 4b6eeb34c8 r830: optionally fixed chunk size 2014-09-15 23:42:24 -04:00
Heng Li 624687b072 r829: killed a harmless gcc warning 2014-09-15 23:33:22 -04:00
Heng Li b07587f806 r827: an alt hit as good as a pri hit as supp 2014-09-15 16:07:51 -04:00
Heng Li aee53f1334 r824: ALT mapping seems working 2014-09-15 00:29:05 -04:00
Heng Li 015ab3f6c3 r823: towards ALT support 2014-09-14 16:41:14 -04:00
Heng Li 8116bcc786 Merge branch 'dev' into alt 2014-09-14 15:40:52 -04:00
Heng Li 8d2b93156b r821: more relax on containing seeds 2014-09-12 10:35:49 -04:00
Heng Li 6739b713dd Merge branch 'hotfix-utgaln' into dev
Conflicts:
	main.c
2014-09-08 12:44:42 -04:00
Heng Li f4aedddee6 r819: bugfix - added too many sub-SMEMs 2014-09-08 11:32:48 -04:00
Heng Li ca61fe3ad5 code backup 2014-09-08 08:52:02 -04:00
Heng Li 1934f0cf24 code backup 2014-09-05 13:20:52 -04:00
Heng Li 35ac99b4f7 r815: optionally output ref fasta header
Also fixed a bug in reading .ann files
2014-08-29 10:51:23 -04:00
Heng Li b5cba257c1 r809: new strategy for the -a mode 2014-08-25 11:59:27 -04:00
Heng Li 7fd6a11569 r788: segfault when the last ref is "weird"
mem_patch_reg() did not check if two hits are on the same strand, which may
lead to an alignment bridging the forward-backward boundary.
2014-07-10 10:53:56 -04:00
Heng Li cffff4338f r787: use mem_seed_sw() also for non-PacBio reads
In the previous version, mem_seed_sw() is only used for PacBio reads to filter
bad seeds. For non-PacBio long queries, bwa-mem uses mem_chain2aln_short() for
a similar purpose. However, it turns out that mem_chain2aln_short() is not
effective given long near-tandem repeats. Bwa-mem still wastes a lot of time
of futile ref substring and extensions.

In this commit, mem_chain2aln_short() has been removed. mem_seed_sw() is used
if the query sequence is long enough (~700bp). For shorter reads, the results
should be almost identical to the previous version.
2014-07-10 10:30:22 -04:00
Heng Li e4752b321b Release bwa-0.7.9-r782 2014-05-19 09:08:07 -04:00
Heng Li f00cc94e1d r779: fixed a memory leak in SE 2014-05-16 00:06:34 -04:00
Heng Li a5ad0cff7f r778: reduced the number of alloc() calls a bit 2014-05-15 23:23:04 -04:00
Heng Li 061c63f36a r766: removed useless code 2014-05-13 13:09:29 -04:00
Heng Li 39a6cd5bb0 r762: cleanup for the new release; unfinished
It will take to make the documentation ready.
2014-05-11 15:15:44 -04:00
Heng Li cfe6996173 r760: removed commented code
It is slow and is not very effective. And I hate useless code.
2014-05-09 14:59:07 -04:00
Heng Li 43b498a37e r759: bugfix - frac_rep not working
Also added commented code for a 3rd round seeding. Not used.
2014-05-09 14:56:59 -04:00
Heng Li c9b33502f3 r758: fixed a typo
mostly negligible in practice
2014-05-07 15:07:29 -04:00
Heng Li ce3c198245 r749: max_hits tunable on CMD; default to 5 2014-05-04 10:17:03 -04:00
Heng Li f21d6498bc r748: reduced the default -m to 50 2014-05-02 16:49:19 -04:00
Heng Li e8f28cb529 r747: fixed a minor issue in the last (mis)commit 2014-05-02 16:17:50 -04:00
Heng Li 6db761e269 r746: tuned heuristic for GRCh38
Reduced -c to 500 by default. As a compensation, we choose up to 1000 positions
if a seed has 500 or more occurrences. In addition, a read with big portion
from such seeds will have lower mapping quality.
2014-05-02 16:06:27 -04:00
Heng Li fa20c71920 r742: further control the max bandwidth
I am looking at 6kb bandwidth...
2014-05-01 14:27:38 -04:00
Heng Li 4b2441069f r740: don't attempt merge if bandwidth too large
Sometimes the bandwidth can be >10k.
2014-05-01 11:01:52 -04:00
Heng Li c6c943f9d7 r738: output multi-map in the XA tag (SE only)
... PE support coming soon
2014-04-30 16:46:05 -04:00
Heng Li 88f89be60e r736: improved in low-complexity regions
Example: GGAGGGGAAGGGTGGGCTGGAGGGGACGGGTGGGCTGGAGGGGAAGGGTGTGCTGGAGGGAAAAGGTGGACTGGAGGGGAAGGGTGGGCTGGAGGGGAAGG

This read has 5 chains, two of which are:

weight=80  26;26;0,4591439948(10:-3095894)  23;23;27,4591439957(10:-3095888)  31;31;70,4591439964(10:-3095873)
weight=50  45;45;51,4591440017(10:-3095806) 50;50;51,4591440017(10:-3095801)  31;31;70,4591440090(10:-3095747)

Extension from the 26bp seed in the 1st chain gives an alignment [0,101) <=> [4591439948,4591440067), which
contains the 50bp seed in the second chain. However, if we extend the 50bp seed, it yields a better alignment
[0,101) <=> [4591439966,4591440067) with a different starting position. The 26bp seed is wrong. This commit
adds a heuristic to fix this issue.
2014-04-30 14:14:20 -04:00
Heng Li b603fed39c r733: bugfix - seed score unset when no -W 2014-04-29 14:58:53 -04:00
Heng Li dadd5d6281 r730: more permissive about merging overlapping 2014-04-28 10:01:54 -04:00
Heng Li 76bb49e01b r729: halved band width; doubled patch band width 2014-04-24 16:06:01 -04:00
Heng Li 6052d3015b r728: sorting the end in mem_sort_dedup_patch()
The older version does this, which is correct.
2014-04-24 15:44:59 -04:00
Heng Li df65893fb5 r727: extend seeds with SW 2014-04-24 14:28:40 -04:00
Heng Li b92bbb47e5 Merge branch '0.7.7-softclip' into layout
Conflicts:
	Makefile
	bwamem.h
	fastmap.c
	main.c
2014-04-24 12:24:49 -04:00
Heng Li 8c12ec4a4b r725: optionally disable hard clipping
as is reqested by the cancer group
2014-04-24 11:56:43 -04:00
Heng Li b93fca2b2e r723: merge adjacent hits 2014-04-16 16:38:50 -04:00
Heng Li 48847af2fc code backup 2014-04-16 12:00:13 -04:00
Heng Li 00a07f61bf r721: merge overlapping hits by default 2014-04-15 16:16:04 -04:00
Heng Li 45f24b4ae8 r720: improved overlap hit merging 2014-04-15 16:09:42 -04:00
Heng Li bdb7b000cd r719: more stringent overlap merge
Will consider to make it the default
2014-04-15 14:52:17 -04:00
Heng Li 4e22270eba r718: merge alnregs overlapping on both query/ref 2014-04-14 17:01:17 -04:00
Heng Li f02cd42679 dev-473: added a few assertions
to make sure the new change works as is expected
2014-04-10 21:03:13 -04:00
Heng Li 8638cfadc8 dev-472: get rid of bwa_fix_xref()
This function causes all kinds of problems when the reference genome consists
of many short reads/contigs/chromsomes. Some of the problems are nearly
unfixable at the point where bwa_fix_xref() gets called. This commit attempts
to fix the problem at the root. It disallows chains spanning multiple contigs
and never retrieves sequences bridging two adjacent contigs. Thus all the
chaining, extension, SW and global alignments are confined to on contig only.

This commit brings many changes. I have tested it on a couple examples
including Peter Field's PacBio example. It works well so far.
2014-04-10 20:54:27 -04:00
Heng Li 23e0e99ec0 dev-471: fixed a compiling error from last commit 2014-04-10 11:54:17 -04:00
Heng Li ccbbe48c4f dev-470: don't stop on bwa_fix_xref2() failures
Peter Field has sent me an example caused by an alignment bridging three
adjacent chromosomes/contigs. Bwa-mem always aligns the query to the contig
covering the middle point of the alignment. In this example, it chooses the
middle contig, which should not be aligned. This leads to weird things failing
bwa_fix_xref2(), which cannot be fixed unless we build the contig boundaries
into the FM-index.

In the old code, bwa-mem halts when bwa_fix_xref2() fails. With this commit,
bwa-mem will give a warning instead of halting.
2014-04-10 11:43:17 -04:00
Heng Li 99f6f9a0d1 dev-467: limit the max #chains to extend 2014-04-08 21:45:49 -04:00
Heng Li c0a308a8b6 dev-466: simplified chain filtering 2014-04-08 17:33:07 -04:00
Heng Li f12dfae772 dev-465: a new output format for read overlap
Also moved a few functions to bwamem_extra.c. File bwamem.c is becoming far too
long.
2014-04-08 16:29:36 -04:00
Heng Li 172ba83241 dev-463: added option -x to change multiple params
I hate to copy-paste long command line options.
2014-04-07 11:29:36 -04:00
Heng Li 114901b005 dev-r462: refined setting for PacBio; weight flt
The recommended setting in the last commit is wrong. If we can extend a random
seed hit to the full length, we will force the read aligned through break
points, which is wrong. The new setting is better but it may lead to a small
fraction of fragmented alignments.

In addition, I added a filter on the minimum chain weight and tied
min_HSP_score to this filter. It doubles the mapping speed.
2014-04-04 17:01:04 -04:00
Heng Li 41f720dfa7 dev-461: added a heuristic for PacBio data
See the comment above mem_test_chain_sw() for details.
2014-04-04 16:05:41 -04:00
Heng Li b6bd33b26c dev-459: don't hard code the drop ratio
In the old code, if a secondary alignment is 50% worse, it won't be outputted.
2014-04-03 18:58:49 -04:00
Heng Li b3225581be dev-458: simplified the smem iterator
simpler but less powful.
2014-04-03 15:23:48 -04:00
Heng Li acfe7613db dev-457: separated interval collection and seeding 2014-04-03 15:10:50 -04:00
Heng Li 9a5705289c added more debugging infomation
I can see a bug, but I do not know where it comes from.
2014-04-03 13:38:08 -04:00
Heng Li 9ce50a4e5e dev-450: support diff ins/del penalties. NO TEST!! 2014-03-28 14:54:06 -04:00
Heng Li 8f9aeef4ec Merge branch 'master' into dev
Conflicts:
	main.c
2014-03-17 00:03:52 -04:00
Heng Li e6931bec03 r445: unnecessarily large bandwidth in global 2014-03-17 00:01:00 -04:00
Heng Li 7d63e76245 r444: more debugging output in CIGAR generation
Also found a potential issue which should not affect accuracy but may hurt
speed. Will investigate later.
2014-03-16 23:25:04 -04:00
Heng Li 8929bd1c25 r443: more verbose debugging information 2014-03-16 15:18:58 -04:00
Heng Li 2e9463ebf1 dev-r442: suppress exact full-length matches 2014-02-26 22:04:19 -05:00
Heng Li 52391a9855 r437: print timing for each batch of reads 2014-02-19 10:54:26 -05:00
Heng Li f524c7d3d8 r431: added the MD tag to bwa-mem 2014-01-29 12:05:11 -05:00
Heng Li ea3dc2f003 r430: fix a bug producing incorrect alignment
Ksw uses two rounds of SSE2-SW to find the boundaries of an alignment. If the
second round gives a different score from the first round, it will fail. The
fix checks if this happens, though I have not dig into an example to understand
why this may happen in the first place.
2014-01-29 10:51:02 -05:00
Heng Li 10cb6b0507 r428: allow to change the default chain_drop_ratio 2013-12-30 16:18:45 -05:00
Heng Li 3afcdc7746 debugging code only: print seeds 2013-12-30 16:05:43 -05:00
Heng Li 74a1a53499 print debugging msg to stdout 2013-12-30 15:49:41 -05:00
Heng Li 4219e58623 r423: bugfix - SE hits not random 2013-11-23 09:36:26 -05:00
Heng Li ff4762f3c7 r421: bw doubling in the final alignment
In some cases, the band width used in the final alignment needs to be larger
than the band width in extension.
2013-11-20 10:04:16 -05:00
Heng Li 6e3fa0515a r420: inferred bandwidth is not used in the final 2013-11-20 09:50:46 -05:00
Heng Li deb19593aa r418: use the new mapQ estimator by default 2013-11-02 12:25:53 -04:00
Heng Li 19d33faa30 use kthread for multi-threading
Bwa-mem should have better performance with many CPU cores.
2013-11-02 12:13:11 -04:00
Heng Li 7144a0cefc r415: bug in the new (optional) mapQ computation
I may use the new method as the default. Testing needed.
2013-09-09 17:51:05 -04:00
Heng Li ebb7b02e9b r414: fixed a bug caused by the last commit 2013-09-09 16:57:55 -04:00
Heng Li b51a66e4c1 r413: fixed an issue causing redundant alignment
I have seen a fosmid aligned to the same position but with two slightly
different CIGARs: 30000M and 29900M50D100M, possibly caused by tandem repeats.
0.7.5a will regard them as two distinct alignments and generates a very small
mapping quality. However, these two are essentially the same. Although there is
ambiguity in aligning the end of the fosmid, we should not penalize the entire
alignment with a small mapQ. This commit fixes this issue. More testing is
needed, though.
2013-09-09 11:36:50 -04:00
Heng Li 1e2cff20ba more conservative mapQ 2013-09-09 08:57:45 -04:00
Heng Li 1346f03ff1 use the old mapQ by default
the new mapQ overestimate
2013-09-06 14:04:41 -04:00
Heng Li 451d60f3be slight modification 2013-09-06 12:37:38 -04:00
Heng Li 623da055e1 alternative way to estimate mapQ
the old mapQ estimate is too conservative
2013-09-06 12:31:47 -04:00
Heng Li 3b84c03c1e r406: allow to use diff clipping penalties
for 5'-end or for 3'-end
2013-08-28 15:59:05 -04:00
Heng Li bde5005f39 r396: er... the new tag is named SA not SP 2013-05-23 12:48:18 -04:00
Heng Li 3d2450ed97 r395: bugfix - hard clipping not applied on revaln 2013-05-23 12:45:14 -04:00
Heng Li 9441bb7f2a r394: added future plan 2013-05-22 20:02:53 -04:00
Heng Li 0e759bc1f5 removed a redundant flag 2013-05-22 19:55:07 -04:00
Heng Li 9735d7a31a conform to the latest (unpublished) SAM spec
for chimeric alignments
2013-05-22 19:45:16 -04:00
Heng Li 9a6abe51b6 r391: better method to resolve xref alignment
The old method does not work when the alignment bridges three chr. This may
actually happen often. The new method does not work all the time, either, but
should be better than the old one. It is also simpler, arguably.
2013-05-22 18:57:51 -04:00
Rob Davies 96e445d9e4 Reduce dependency on utils.h - new malloc wrapping scheme.
Remove xmalloc, xcalloc, xrealloc and xstrdup from utils.h and revert calls
to the normal malloc, calloc, realloc, strdup.  Add new files malloc_wrap.[ch]
with the wrapper functions.  malloc_wrap.h #defines malloc etc. to the
wrapper, but only if USE_MALLOC_WRAPPERS has been defined.

Put #include "malloc_wrap.h" in any file that uses *alloc or strdup.  This
is also in a #ifdef USE_MALLOC_WRAPPERS ... #endif block to make using the
wrappers optional.  Add -DUSE_MALLOC_WRAPPERS into the makefile so they
should normally get added.

This is an improvement on the previous method as we now don't need to
worry about stray function calls that were not changed to the wrapped version
and the code will still work even if the wrapping is disabled.

Other possible methods of doing this are using malloc_hook (glibc-specific),
adding -include malloc_wrap.h to the gcc command-line (somewhat
gcc-specific) or making our own malloc function and using dlopen (scary).
This way is probably the most portable.
2013-05-02 15:12:01 +01:00
Rob Davies e88529687f Merge branch 'master' into master_fixes. Merged up to r389.
Conflicts:
	bwamem.c
	kopen.c
2013-04-29 12:09:30 +01:00
Heng Li 19cb7cd7ed r388: cleanup mem_process_seqs() interface
Print output outside the function and allow to feed insert size distribution.
2013-04-26 12:31:18 -04:00
Rob Davies 4cb5110d03 Merge branch 'master' into master_fixes 2013-04-22 09:51:07 +01:00
Heng Li 2087dc162f r377: increased unpaired penalty from 9 to 17
This leads to more aggressive pairing - more properly paired reads. I have
found a few cases where, for example, read1 is umambiguously mapped to chr20
while its 100bp mate has a perfect match to another chr but has 3 mismatches
and 1 deletion when it is paired with read1 on chr20. With longer reads, it
seems that the chr20 hit is correct, although it is not obvious how this
happened in evolution.
2013-04-17 16:50:20 -04:00
Rob Davies 3dd10bd7db Merge branch 'master' into master_fixes 2013-04-12 16:20:13 +01:00
Rob Davies 90ecd344ba Merge branch 'master' into master_fixes. Merged up to master r375.
Conflicts:
	bwt.c
2013-04-11 11:15:39 +01:00