Commit Graph

294 Commits (main)

Author SHA1 Message Date
Heng Li 23e0e99ec0 dev-471: fixed a compiling error from last commit 2014-04-10 11:54:17 -04:00
Heng Li ccbbe48c4f dev-470: don't stop on bwa_fix_xref2() failures
Peter Field has sent me an example caused by an alignment bridging three
adjacent chromosomes/contigs. Bwa-mem always aligns the query to the contig
covering the middle point of the alignment. In this example, it chooses the
middle contig, which should not be aligned. This leads to weird things failing
bwa_fix_xref2(), which cannot be fixed unless we build the contig boundaries
into the FM-index.

In the old code, bwa-mem halts when bwa_fix_xref2() fails. With this commit,
bwa-mem will give a warning instead of halting.
2014-04-10 11:43:17 -04:00
Heng Li 99f6f9a0d1 dev-467: limit the max #chains to extend 2014-04-08 21:45:49 -04:00
Heng Li c0a308a8b6 dev-466: simplified chain filtering 2014-04-08 17:33:07 -04:00
Heng Li f12dfae772 dev-465: a new output format for read overlap
Also moved a few functions to bwamem_extra.c. File bwamem.c is becoming far too
long.
2014-04-08 16:29:36 -04:00
Heng Li 172ba83241 dev-463: added option -x to change multiple params
I hate to copy-paste long command line options.
2014-04-07 11:29:36 -04:00
Heng Li 114901b005 dev-r462: refined setting for PacBio; weight flt
The recommended setting in the last commit is wrong. If we can extend a random
seed hit to the full length, we will force the read aligned through break
points, which is wrong. The new setting is better but it may lead to a small
fraction of fragmented alignments.

In addition, I added a filter on the minimum chain weight and tied
min_HSP_score to this filter. It doubles the mapping speed.
2014-04-04 17:01:04 -04:00
Heng Li 41f720dfa7 dev-461: added a heuristic for PacBio data
See the comment above mem_test_chain_sw() for details.
2014-04-04 16:05:41 -04:00
Heng Li b6bd33b26c dev-459: don't hard code the drop ratio
In the old code, if a secondary alignment is 50% worse, it won't be outputted.
2014-04-03 18:58:49 -04:00
Heng Li b3225581be dev-458: simplified the smem iterator
simpler but less powful.
2014-04-03 15:23:48 -04:00
Heng Li acfe7613db dev-457: separated interval collection and seeding 2014-04-03 15:10:50 -04:00
Heng Li 9a5705289c added more debugging infomation
I can see a bug, but I do not know where it comes from.
2014-04-03 13:38:08 -04:00
Heng Li 9ce50a4e5e dev-450: support diff ins/del penalties. NO TEST!! 2014-03-28 14:54:06 -04:00
Heng Li 8f9aeef4ec Merge branch 'master' into dev
Conflicts:
	main.c
2014-03-17 00:03:52 -04:00
Heng Li e6931bec03 r445: unnecessarily large bandwidth in global 2014-03-17 00:01:00 -04:00
Heng Li 7d63e76245 r444: more debugging output in CIGAR generation
Also found a potential issue which should not affect accuracy but may hurt
speed. Will investigate later.
2014-03-16 23:25:04 -04:00
Heng Li 8929bd1c25 r443: more verbose debugging information 2014-03-16 15:18:58 -04:00
Heng Li 2e9463ebf1 dev-r442: suppress exact full-length matches 2014-02-26 22:04:19 -05:00
Heng Li 52391a9855 r437: print timing for each batch of reads 2014-02-19 10:54:26 -05:00
Heng Li f524c7d3d8 r431: added the MD tag to bwa-mem 2014-01-29 12:05:11 -05:00
Heng Li ea3dc2f003 r430: fix a bug producing incorrect alignment
Ksw uses two rounds of SSE2-SW to find the boundaries of an alignment. If the
second round gives a different score from the first round, it will fail. The
fix checks if this happens, though I have not dig into an example to understand
why this may happen in the first place.
2014-01-29 10:51:02 -05:00
Heng Li 10cb6b0507 r428: allow to change the default chain_drop_ratio 2013-12-30 16:18:45 -05:00
Heng Li 3afcdc7746 debugging code only: print seeds 2013-12-30 16:05:43 -05:00
Heng Li 74a1a53499 print debugging msg to stdout 2013-12-30 15:49:41 -05:00
Heng Li 4219e58623 r423: bugfix - SE hits not random 2013-11-23 09:36:26 -05:00
Heng Li ff4762f3c7 r421: bw doubling in the final alignment
In some cases, the band width used in the final alignment needs to be larger
than the band width in extension.
2013-11-20 10:04:16 -05:00
Heng Li 6e3fa0515a r420: inferred bandwidth is not used in the final 2013-11-20 09:50:46 -05:00
Heng Li deb19593aa r418: use the new mapQ estimator by default 2013-11-02 12:25:53 -04:00
Heng Li 19d33faa30 use kthread for multi-threading
Bwa-mem should have better performance with many CPU cores.
2013-11-02 12:13:11 -04:00
Heng Li 7144a0cefc r415: bug in the new (optional) mapQ computation
I may use the new method as the default. Testing needed.
2013-09-09 17:51:05 -04:00
Heng Li ebb7b02e9b r414: fixed a bug caused by the last commit 2013-09-09 16:57:55 -04:00
Heng Li b51a66e4c1 r413: fixed an issue causing redundant alignment
I have seen a fosmid aligned to the same position but with two slightly
different CIGARs: 30000M and 29900M50D100M, possibly caused by tandem repeats.
0.7.5a will regard them as two distinct alignments and generates a very small
mapping quality. However, these two are essentially the same. Although there is
ambiguity in aligning the end of the fosmid, we should not penalize the entire
alignment with a small mapQ. This commit fixes this issue. More testing is
needed, though.
2013-09-09 11:36:50 -04:00
Heng Li 1e2cff20ba more conservative mapQ 2013-09-09 08:57:45 -04:00
Heng Li 1346f03ff1 use the old mapQ by default
the new mapQ overestimate
2013-09-06 14:04:41 -04:00
Heng Li 451d60f3be slight modification 2013-09-06 12:37:38 -04:00
Heng Li 623da055e1 alternative way to estimate mapQ
the old mapQ estimate is too conservative
2013-09-06 12:31:47 -04:00
Heng Li 3b84c03c1e r406: allow to use diff clipping penalties
for 5'-end or for 3'-end
2013-08-28 15:59:05 -04:00
Heng Li bde5005f39 r396: er... the new tag is named SA not SP 2013-05-23 12:48:18 -04:00
Heng Li 3d2450ed97 r395: bugfix - hard clipping not applied on revaln 2013-05-23 12:45:14 -04:00
Heng Li 9441bb7f2a r394: added future plan 2013-05-22 20:02:53 -04:00
Heng Li 0e759bc1f5 removed a redundant flag 2013-05-22 19:55:07 -04:00
Heng Li 9735d7a31a conform to the latest (unpublished) SAM spec
for chimeric alignments
2013-05-22 19:45:16 -04:00
Heng Li 9a6abe51b6 r391: better method to resolve xref alignment
The old method does not work when the alignment bridges three chr. This may
actually happen often. The new method does not work all the time, either, but
should be better than the old one. It is also simpler, arguably.
2013-05-22 18:57:51 -04:00
Rob Davies 96e445d9e4 Reduce dependency on utils.h - new malloc wrapping scheme.
Remove xmalloc, xcalloc, xrealloc and xstrdup from utils.h and revert calls
to the normal malloc, calloc, realloc, strdup.  Add new files malloc_wrap.[ch]
with the wrapper functions.  malloc_wrap.h #defines malloc etc. to the
wrapper, but only if USE_MALLOC_WRAPPERS has been defined.

Put #include "malloc_wrap.h" in any file that uses *alloc or strdup.  This
is also in a #ifdef USE_MALLOC_WRAPPERS ... #endif block to make using the
wrappers optional.  Add -DUSE_MALLOC_WRAPPERS into the makefile so they
should normally get added.

This is an improvement on the previous method as we now don't need to
worry about stray function calls that were not changed to the wrapped version
and the code will still work even if the wrapping is disabled.

Other possible methods of doing this are using malloc_hook (glibc-specific),
adding -include malloc_wrap.h to the gcc command-line (somewhat
gcc-specific) or making our own malloc function and using dlopen (scary).
This way is probably the most portable.
2013-05-02 15:12:01 +01:00
Rob Davies e88529687f Merge branch 'master' into master_fixes. Merged up to r389.
Conflicts:
	bwamem.c
	kopen.c
2013-04-29 12:09:30 +01:00
Heng Li 19cb7cd7ed r388: cleanup mem_process_seqs() interface
Print output outside the function and allow to feed insert size distribution.
2013-04-26 12:31:18 -04:00
Rob Davies 4cb5110d03 Merge branch 'master' into master_fixes 2013-04-22 09:51:07 +01:00
Heng Li 2087dc162f r377: increased unpaired penalty from 9 to 17
This leads to more aggressive pairing - more properly paired reads. I have
found a few cases where, for example, read1 is umambiguously mapped to chr20
while its 100bp mate has a perfect match to another chr but has 3 mismatches
and 1 deletion when it is paired with read1 on chr20. With longer reads, it
seems that the chr20 hit is correct, although it is not obvious how this
happened in evolution.
2013-04-17 16:50:20 -04:00
Rob Davies 3dd10bd7db Merge branch 'master' into master_fixes 2013-04-12 16:20:13 +01:00
Rob Davies 90ecd344ba Merge branch 'master' into master_fixes. Merged up to master r375.
Conflicts:
	bwt.c
2013-04-11 11:15:39 +01:00
Heng Li 499cf4c00d r376: reduce wasteful seed extension
mainly for contig alignment
2013-04-10 12:18:56 -04:00
Heng Li 3d8a8c1e37 r374: fix - clipping penalty not always working
This only happens to gaps where mem underestimates the bandwidth without
considering the clipping penalty.
2013-04-10 01:09:37 -04:00
Heng Li d7ca0885eb r371: extend overlapping seeds
to avoid misalignment in tandem repeats
2013-04-04 00:43:43 -04:00
Heng Li 1e118e0823 r370: suppress "D" at the end of a cigar
This is caused by seeds in tandem repeats, in which case, bwa-mem may not
extend the true seed. The change in this commit is only a temporary cure.
2013-04-03 23:57:19 -04:00
Rob Davies c89756e2b0 Merge branch 'master' into master_fixes 2013-03-19 12:11:51 +00:00
Heng Li 8437cd4edd r369: bugfix - segfault caused by the last change
Sigh... Even the simplest change can lead to new bugs.
2013-03-19 01:04:57 -04:00
Heng Li 1e3cadbfc2 r368: bugfix - wrong CIGAR when bridging 3 contigs
In this case, bwa_fix_xref() will return insane coordinates. The old version
did not check the return status and write wrong CIGAR. This bug only happen to
very short assembly contigs.
2013-03-18 20:49:32 -04:00
Rob Davies c862a1a396 Merge branch 'master' into master_fixes 2013-03-18 13:35:12 +00:00
Heng Li 9346acde1b Release bwa-0.7.3a-r367
In 0.7.3, the wrong CIGAR bug was only fixed in one scenario, but not fixed
in another corner case.
2013-03-15 21:26:37 -04:00
Heng Li dd51177837 r365: bugfix - wrong alignment (right mapping)
The bug only happens when there is a 1bp del and 1bp ins which are close to the
end and there are no other substitutions or indels. In this case, bwa mem gave
a wrong band width.
2013-03-15 11:59:05 -04:00
Rob Davies cca27c1ef5 Merge branch 'master' into master_fixes
Conflicts:
	bwamem.c
	bwamem_pair.c
	example.c
2013-03-13 12:12:28 +00:00
Heng Li bdf34f6ce7 r363: XA=>XP; output mapQ in XP
In BWA, XA gives hits "shadowed" by the primary hit. In BWA-MEM, we output
primary hits only. Primary hits may have non-zero mapping quality.
2013-03-12 09:56:04 -04:00
Heng Li c29b176cb6 r362: bugfix - occasionally wrong TLEN
Use the 0.7.2 way to compute TLEN
2013-03-12 00:14:36 -04:00
Heng Li dab5b17c1a r360: output alternative primary alignments in XA 2013-03-11 23:43:58 -04:00
Heng Li 6c665189ad r359: identical output to 0.7.2 (without -a) 2013-03-11 23:16:18 -04:00
Heng Li 0f88103d2a SAM almost identical to 0.7.2 2013-03-11 23:01:51 -04:00
Heng Li 26f4c704ed drop the old SAM writer 2013-03-11 22:24:54 -04:00
Heng Li ebb45dc42e new code works for SE 2013-03-11 21:59:15 -04:00
Heng Li c7edaa8e84 to test the new sam writer... 2013-03-11 21:55:52 -04:00
Heng Li 47952b6f3f drop an unnecessary member from mem_aln_t 2013-03-11 21:35:32 -04:00
Heng Li 8f0d439913 prepare to replace the SAM printing code
This move is dangerous as SAM printing is very complex, but it will benefit in
the long run. The planned change will reduce the redundancy, improves clarity
and most importantly makes it much easier to output multiple primary hits in an
optional tag.
2013-03-11 21:25:17 -04:00
Rob Davies 9228e48efd Merge branch 'master' into master_fixes
Conflicts:
	Makefile
2013-03-11 13:50:49 +00:00
Heng Li 9ea7f83974 Emergent bugfix: wrong TLEN sign
It is interesting that Picard did not find the issue.
2013-03-09 18:03:15 -05:00
Heng Li 66c9783daf r345: bugfix in mem - wrong mate strand for unmap
Received a clean bill from Picard
2013-03-08 13:15:43 -05:00
Heng Li af7b4d8980 gcc wrongly thinks a variable may be uninitialized
It should always be initialized. To avoid a warning, made a change.
2013-03-08 12:45:50 -05:00
Heng Li 274c0ac96c r343: bugfix in mem - wrong mate info for unmap
SAM generation is always among the nastiest bits. I would need to refactor at
some point (hardly happening).
2013-03-08 12:40:31 -05:00
Rob Davies aabd990e8f Merge branch 'master' into master_fixes
Conflicts:
	Makefile
	bwape.c
	bwase.c
	bwtsw2_aux.c
	stdaln.c
2013-03-08 16:46:45 +00:00
Heng Li 5fbd454682 r332: added output threshold
Otherwise there are far too many short hits
2013-03-05 22:49:38 -05:00
Heng Li 07921659cf move mem_fill_scmat() to bwa.{h,c} 2013-03-05 09:38:12 -05:00
Rob Davies 8a078cc16d Merge branch 'master' into master_fixes
Conflicts:
	bntseq.c
	bwamem.c
2013-03-05 10:21:07 +00:00
Heng Li efd9769b07 r324: a little code cleanup
The changes after r317 aim to improve the performance and accuracy for very
long query alignment. The short-read alignment should not be affected. The
changes include:

1) Z-dropoff. This is a variant of blast's X-dropoff. I orginally thought this
   heuristic only improves speed, but now I realize it also reduces poor
   alignment with long good flanking alignments. The difference from blast's
   X-dropoff is that Z-dropoff allows big gaps, but X-dropoff does not.

2) Band width doubling. When band width is too small, we will get a poor
   alignment in the middle. Sometimes such alignments cannot be fully excluded
   with Z-dropoff. Band width doubling is an alternative heuristic. It is based
   on the observation that the existing of close-to-boundary high score
   possibly implies inadequate band width. When we see such a signal, we double
   the band width.
2013-03-05 00:57:16 -05:00
Heng Li e0991d6a45 r323: added Z-dropoff, a variant of blast's X-drop 2013-03-05 00:34:33 -05:00
Heng Li d6096c3f99 bugfix: caused by the latest change 2013-03-04 18:41:57 -05:00
Heng Li 59bc9341f6 code backup; more changes coming later 2013-03-04 17:29:07 -05:00
Heng Li 733410b50d r320: speed up very long sequence alignment
100-200bp read alignment should not be affected at all.
2013-03-04 14:43:49 -05:00
Heng Li 40f1214736 change to debugging code only 2013-03-04 11:52:11 -05:00
Heng Li 7e00dbcac5 r317: bugfix - out-of-range extension
This happens when target region crosses the forward-reverse boundary. This will
almost never happen to short-read alignment.
2013-03-04 11:35:23 -05:00
Heng Li 3e4a178e08 r314: cleanup bwamem API
Don't modify input sequences; more documentations
2013-03-01 11:14:51 -05:00
Rob Davies 6beab5f765 Merge branch 'master' into master_fixes
Merge changes to commit c5434ac (0.7.0 release)

Conflicts:
	Makefile
	bwamem.c
2013-03-01 10:22:49 +00:00
Rob Davies 3d33ab063e Merge branch 'master' into master_fixes
Merged to master version b621d3a

Conflicts:
	Makefile
	bntseq.c
	bwa.c
	bwase.c
	bwaseqio.c
	bwtaln.c
	bwtindex.c
	bwtio.c
	bwtmisc.c
	bwtsw2_aux.c
	cs2nt.c
	fastmap.c
	khash.h
	kseq.h
	ksw.c
	kvec.h
	simple_dp.c
	utils.c
	utils.h
2013-03-01 09:37:46 +00:00
Heng Li f3cff1c609 r311: even tighter bw for CIGAR 2013-02-27 23:59:50 -05:00
Heng Li a33b9c0633 tighter bw for cigar SW 2013-02-27 23:40:46 -05:00
Heng Li 6a4d8c79d8 r309: bugfix - soft clipping missing in example.c 2013-02-27 22:45:18 -05:00
Heng Li df7c3f0000 r308: added a new API to convert region to CIGAR
and an example program demonstrating how to do single-end alignment in <50
lines of C code.
2013-02-27 22:28:29 -05:00
Heng Li 4bb0bdddca r306: introduce clipping penalty
More clipping leads to more severe reference bias. We should not clip the
alignment unless necessary.
2013-02-27 21:13:39 -05:00
Heng Li 65e099df34 r300: fixed an out-of-boundary bug in rare case 2013-02-27 00:37:17 -05:00
Heng Li 0b533385ef r299: better way to exclude seed 2013-02-27 00:29:11 -05:00
Heng Li ee80fb8bd0 Test each seed to see if extension is needed
The old version wastefully extends many seeds contained in an aligned region
found before. While this wastes little time for short reads, it becomes a
serious defect for long query sequences.

This is an attempt to fix this problem, but more tuning are needed.
2013-02-26 22:55:44 -05:00
Heng Li acd1ab607b r297: reduce wasteful SW extension
This is particularly important for long sequences
2013-02-26 16:26:46 -05:00
Heng Li 98787f0ae0 r295: generate NM 2013-02-26 13:36:01 -05:00
Heng Li 32f2d60a2e r294: bugfix - -M not working 2013-02-26 13:14:33 -05:00
Heng Li 619ac4f93d r293: bugfix - wrong RG type in SAM output 2013-02-26 13:03:35 -05:00
Heng Li e70c7c2a71 r284: amend cross-reference hit
I really hate this: complex and twisted logic for a nasty scenario that almost
never happens to short reads - but it may become serious when the reference
genome consists of many contigs.

On toy examples, the code seems to work. Don't know if it really works...
2013-02-26 00:03:49 -05:00
Heng Li 77b5b586ad r282: set min split_len to read length 2013-02-25 17:29:35 -05:00
Heng Li d19e834d84 r280: align two ends in the same thread
Otherwise odd-number threads may be of different speed from even-number threads.
2013-02-25 15:40:15 -05:00
Heng Li 20aa848b3c r279: for PE mapq, consider the number of pairs
If there are a lot of proper pairs, it is more likely that the best pair is
wrong.
2013-02-25 13:00:35 -05:00
Heng Li 9957e04590 r278: don't perform too many mate-sw 2013-02-25 11:56:02 -05:00
Heng Li 5ead86acd3 optionally mark split hit as secondary 2013-02-25 11:18:35 -05:00
Heng Li 514563bd0a no poor hits with -a; reduce mapq for 2nd primary 2013-02-25 10:54:12 -05:00
Heng Li 29e41b592c bugfix: isize is off by 1 2013-02-24 23:00:51 -05:00
Heng Li 85775c3384 output multiple hits 2013-02-24 13:23:43 -05:00
Heng Li 6bdccf2a8a added a bit documentation 2013-02-24 13:09:29 -05:00
Heng Li ee59a13109 simplified bwamem.h
Hide mem_seed_t and mem_chain_t. Don't expose unnecessary routines.
2013-02-24 12:17:29 -05:00
Heng Li cda85be059 fixed a couple bugs identified by gcc
Recent gcc is better.
2013-02-23 17:15:07 -05:00
Heng Li b4c38bcc1c append fasta/q comment 2013-02-23 16:57:34 -05:00
Heng Li ee4540c394 support read group in bwa-mem 2013-02-23 16:41:44 -05:00
Heng Li 67543f19a1 code refactoring 2013-02-23 15:55:55 -05:00
Heng Li e613195e17 moved some common code to bwa.{c,h} 2013-02-23 15:30:46 -05:00
Heng Li d460f2ec9e bugfix in multi-threaded bwa-mem 2013-02-23 14:48:54 -05:00
Heng Li 904c3205c0 removed a few unused variables
These variables have been assigned but never actually used. Reported by
gcc-4.7. Lower version cannot give such warnings.
2013-02-23 13:26:50 -05:00
Heng Li 17c123d65a pring paired-end SAM 2013-02-22 16:38:48 -05:00
Heng Li ba15b787cb rework PE mapq; don't know if better 2013-02-22 14:47:57 -05:00
Heng Li c5ce72f593 scoring pairs by score, not by errors
This is important for bwa-mem which does local alignment. A short exact match
is worse than a long inexact match. Also fixed a bug in approximating mapping
quality.
2013-02-22 12:10:20 -05:00
Heng Li d4cf6d97a6 bugfix: memory leak 2013-02-21 15:04:31 -05:00
Heng Li a578688fa8 generate multiple alignments from one chain 2013-02-21 14:58:51 -05:00
Heng Li cfbc4c89e3 perform extension when there are, say, 20bp tandem 2013-02-21 14:34:10 -05:00
Heng Li 54da54ffd4 extend more seeds (and thus slower...) 2013-02-21 12:52:00 -05:00
Heng Li f8829318cf weakened the chain filter 2013-02-21 12:25:20 -05:00
Heng Li 84a328764a bugfix: mis-chaining caused by integer overflow
I really need to rewrite kbtree some time.
2013-02-21 11:42:30 -05:00
Heng Li ea8f4f4d34 clean bill from valgrind 2013-02-20 20:26:57 -05:00
Heng Li 5626fe29b7 Well, at least output sth 2013-02-20 19:11:44 -05:00
Heng Li a7d574d125 backup comments 2013-02-20 01:11:38 -05:00
Heng Li 688872fb1b code backup 2013-02-19 00:50:39 -05:00
Heng Li 66585b7982 code backup 2013-02-18 16:33:06 -05:00
Heng Li ea9fc7df48 keep the number of SW performed 2013-02-16 11:03:27 -05:00
Heng Li 5f8c6efbc3 forbid x-bounary bns_get_seq(); code backup 2013-02-16 09:48:44 -05:00
Heng Li 604e3d8da1 code backup; to upgrade ksw.{c,h} 2013-02-12 16:15:26 -05:00
Heng Li 325ba8213b move mark primary to worker1() 2013-02-12 15:54:55 -05:00
Heng Li cd0969332f keep track of the "parent" of a secondary 2013-02-12 15:52:23 -05:00
Heng Li 22b79b3475 mark primary, instead of dropping secondary 2013-02-12 15:34:44 -05:00
Heng Li 2fc469d0c9 code backup 2013-02-12 12:09:36 -05:00
Heng Li 95d18449b3 merge bseq.{h,c} to utils.{h,c}
I do not like many small files.
2013-02-12 10:36:15 -05:00
Heng Li 13288e2dcd code backup 2013-02-12 09:22:47 -05:00
Heng Li 99907c98fb separated and improved SAM printing code
This is for the PE mode. The routines may also be useful for bwa-sw, but
probably I won't change the old code.
2013-02-11 15:29:03 -05:00
Heng Li 987d4b4205 fixed a stupid bug in fastq reading 2013-02-11 11:27:35 -05:00
Heng Li 59eaf650ac code backup 2013-02-11 10:59:38 -05:00
Heng Li f4c0672800 move sort_and_dedup() to worker1() 2013-02-10 12:55:19 -05:00
Heng Li c310fb7424 a little refactoring for PE support 2013-02-10 12:24:33 -05:00
Heng Li 829664d6b5 missing identical hits; improved sub_n 2013-02-08 17:55:35 -05:00
Heng Li b2c7148dc9 consider the number of suboptimal hits 2013-02-08 17:20:44 -05:00