fast-bwa

Commit Graph

Author	SHA1	Message	Date
Heng Li	a41afe4c97	These files were committed on a wrong branch	2014-09-18 10:49:35 -04:00
Heng Li	c982443210	r854: improved the calculation of pa and build pa filtering into BWA-MEM	2014-09-17 16:26:28 -04:00
Heng Li	825ae92e58	r849: the pa tag now gives a number ... which is the ratio of this hit to the best ALT hit.	2014-09-17 13:05:35 -04:00
Heng Li	6f37c14f26	r848: tag alignments with primary ALT	2014-09-16 18:52:49 -04:00
Heng Li	4b6eeb34c8	r830: optionally fixed chunk size	2014-09-15 23:42:24 -04:00
Heng Li	624687b072	r829: killed a harmless gcc warning	2014-09-15 23:33:22 -04:00
Heng Li	b07587f806	r827: an alt hit as good as a pri hit as supp	2014-09-15 16:07:51 -04:00
Heng Li	aee53f1334	r824: ALT mapping seems working	2014-09-15 00:29:05 -04:00
Heng Li	015ab3f6c3	r823: towards ALT support	2014-09-14 16:41:14 -04:00
Heng Li	8116bcc786	Merge branch 'dev' into alt	2014-09-14 15:40:52 -04:00
Heng Li	8d2b93156b	r821: more relax on containing seeds	2014-09-12 10:35:49 -04:00
Heng Li	6739b713dd	Merge branch 'hotfix-utgaln' into dev Conflicts: main.c	2014-09-08 12:44:42 -04:00
Heng Li	f4aedddee6	r819: bugfix - added too many sub-SMEMs	2014-09-08 11:32:48 -04:00
Heng Li	ca61fe3ad5	code backup	2014-09-08 08:52:02 -04:00
Heng Li	1934f0cf24	code backup	2014-09-05 13:20:52 -04:00
Heng Li	35ac99b4f7	r815: optionally output ref fasta header Also fixed a bug in reading .ann files	2014-08-29 10:51:23 -04:00
Heng Li	b5cba257c1	r809: new strategy for the -a mode	2014-08-25 11:59:27 -04:00
Heng Li	7fd6a11569	r788: segfault when the last ref is "weird" mem_patch_reg() did not check if two hits are on the same strand, which may lead to an alignment bridging the forward-backward boundary.	2014-07-10 10:53:56 -04:00
Heng Li	cffff4338f	r787: use mem_seed_sw() also for non-PacBio reads In the previous version, mem_seed_sw() is only used for PacBio reads to filter bad seeds. For non-PacBio long queries, bwa-mem uses mem_chain2aln_short() for a similar purpose. However, it turns out that mem_chain2aln_short() is not effective given long near-tandem repeats. Bwa-mem still wastes a lot of time of futile ref substring and extensions. In this commit, mem_chain2aln_short() has been removed. mem_seed_sw() is used if the query sequence is long enough (~700bp). For shorter reads, the results should be almost identical to the previous version.	2014-07-10 10:30:22 -04:00
Heng Li	e4752b321b	Release bwa-0.7.9-r782	2014-05-19 09:08:07 -04:00
Heng Li	f00cc94e1d	r779: fixed a memory leak in SE	2014-05-16 00:06:34 -04:00
Heng Li	a5ad0cff7f	r778: reduced the number of alloc() calls a bit	2014-05-15 23:23:04 -04:00
Heng Li	061c63f36a	r766: removed useless code	2014-05-13 13:09:29 -04:00
Heng Li	39a6cd5bb0	r762: cleanup for the new release; unfinished It will take to make the documentation ready.	2014-05-11 15:15:44 -04:00
Heng Li	cfe6996173	r760: removed commented code It is slow and is not very effective. And I hate useless code.	2014-05-09 14:59:07 -04:00
Heng Li	43b498a37e	r759: bugfix - frac_rep not working Also added commented code for a 3rd round seeding. Not used.	2014-05-09 14:56:59 -04:00
Heng Li	c9b33502f3	r758: fixed a typo mostly negligible in practice	2014-05-07 15:07:29 -04:00
Heng Li	ce3c198245	r749: max_hits tunable on CMD; default to 5	2014-05-04 10:17:03 -04:00
Heng Li	f21d6498bc	r748: reduced the default -m to 50	2014-05-02 16:49:19 -04:00
Heng Li	e8f28cb529	r747: fixed a minor issue in the last (mis)commit	2014-05-02 16:17:50 -04:00
Heng Li	6db761e269	r746: tuned heuristic for GRCh38 Reduced -c to 500 by default. As a compensation, we choose up to 1000 positions if a seed has 500 or more occurrences. In addition, a read with big portion from such seeds will have lower mapping quality.	2014-05-02 16:06:27 -04:00
Heng Li	fa20c71920	r742: further control the max bandwidth I am looking at 6kb bandwidth...	2014-05-01 14:27:38 -04:00
Heng Li	4b2441069f	r740: don't attempt merge if bandwidth too large Sometimes the bandwidth can be >10k.	2014-05-01 11:01:52 -04:00
Heng Li	c6c943f9d7	r738: output multi-map in the XA tag (SE only) ... PE support coming soon	2014-04-30 16:46:05 -04:00
Heng Li	88f89be60e	r736: improved in low-complexity regions Example: GGAGGGGAAGGGTGGGCTGGAGGGGACGGGTGGGCTGGAGGGGAAGGGTGTGCTGGAGGGAAAAGGTGGACTGGAGGGGAAGGGTGGGCTGGAGGGGAAGG This read has 5 chains, two of which are: weight=80 26;26;0,4591439948(10:-3095894) 23;23;27,4591439957(10:-3095888) 31;31;70,4591439964(10:-3095873) weight=50 45;45;51,4591440017(10:-3095806) 50;50;51,4591440017(10:-3095801) 31;31;70,4591440090(10:-3095747) Extension from the 26bp seed in the 1st chain gives an alignment [0,101) <=> [4591439948,4591440067), which contains the 50bp seed in the second chain. However, if we extend the 50bp seed, it yields a better alignment [0,101) <=> [4591439966,4591440067) with a different starting position. The 26bp seed is wrong. This commit adds a heuristic to fix this issue.	2014-04-30 14:14:20 -04:00
Heng Li	b603fed39c	r733: bugfix - seed score unset when no -W	2014-04-29 14:58:53 -04:00
Heng Li	dadd5d6281	r730: more permissive about merging overlapping	2014-04-28 10:01:54 -04:00
Heng Li	76bb49e01b	r729: halved band width; doubled patch band width	2014-04-24 16:06:01 -04:00
Heng Li	6052d3015b	r728: sorting the end in mem_sort_dedup_patch() The older version does this, which is correct.	2014-04-24 15:44:59 -04:00
Heng Li	df65893fb5	r727: extend seeds with SW	2014-04-24 14:28:40 -04:00
Heng Li	b92bbb47e5	Merge branch '0.7.7-softclip' into layout Conflicts: Makefile bwamem.h fastmap.c main.c	2014-04-24 12:24:49 -04:00
Heng Li	8c12ec4a4b	r725: optionally disable hard clipping as is reqested by the cancer group	2014-04-24 11:56:43 -04:00
Heng Li	b93fca2b2e	r723: merge adjacent hits	2014-04-16 16:38:50 -04:00
Heng Li	48847af2fc	code backup	2014-04-16 12:00:13 -04:00
Heng Li	00a07f61bf	r721: merge overlapping hits by default	2014-04-15 16:16:04 -04:00
Heng Li	45f24b4ae8	r720: improved overlap hit merging	2014-04-15 16:09:42 -04:00
Heng Li	bdb7b000cd	r719: more stringent overlap merge Will consider to make it the default	2014-04-15 14:52:17 -04:00
Heng Li	4e22270eba	r718: merge alnregs overlapping on both query/ref	2014-04-14 17:01:17 -04:00
Heng Li	f02cd42679	dev-473: added a few assertions to make sure the new change works as is expected	2014-04-10 21:03:13 -04:00
Heng Li	8638cfadc8	dev-472: get rid of bwa_fix_xref() This function causes all kinds of problems when the reference genome consists of many short reads/contigs/chromsomes. Some of the problems are nearly unfixable at the point where bwa_fix_xref() gets called. This commit attempts to fix the problem at the root. It disallows chains spanning multiple contigs and never retrieves sequences bridging two adjacent contigs. Thus all the chaining, extension, SW and global alignments are confined to on contig only. This commit brings many changes. I have tested it on a couple examples including Peter Field's PacBio example. It works well so far.	2014-04-10 20:54:27 -04:00
Heng Li	23e0e99ec0	dev-471: fixed a compiling error from last commit	2014-04-10 11:54:17 -04:00
Heng Li	ccbbe48c4f	dev-470: don't stop on bwa_fix_xref2() failures Peter Field has sent me an example caused by an alignment bridging three adjacent chromosomes/contigs. Bwa-mem always aligns the query to the contig covering the middle point of the alignment. In this example, it chooses the middle contig, which should not be aligned. This leads to weird things failing bwa_fix_xref2(), which cannot be fixed unless we build the contig boundaries into the FM-index. In the old code, bwa-mem halts when bwa_fix_xref2() fails. With this commit, bwa-mem will give a warning instead of halting.	2014-04-10 11:43:17 -04:00
Heng Li	99f6f9a0d1	dev-467: limit the max #chains to extend	2014-04-08 21:45:49 -04:00
Heng Li	c0a308a8b6	dev-466: simplified chain filtering	2014-04-08 17:33:07 -04:00
Heng Li	f12dfae772	dev-465: a new output format for read overlap Also moved a few functions to bwamem_extra.c. File bwamem.c is becoming far too long.	2014-04-08 16:29:36 -04:00
Heng Li	172ba83241	dev-463: added option -x to change multiple params I hate to copy-paste long command line options.	2014-04-07 11:29:36 -04:00
Heng Li	114901b005	dev-r462: refined setting for PacBio; weight flt The recommended setting in the last commit is wrong. If we can extend a random seed hit to the full length, we will force the read aligned through break points, which is wrong. The new setting is better but it may lead to a small fraction of fragmented alignments. In addition, I added a filter on the minimum chain weight and tied min_HSP_score to this filter. It doubles the mapping speed.	2014-04-04 17:01:04 -04:00
Heng Li	41f720dfa7	dev-461: added a heuristic for PacBio data See the comment above mem_test_chain_sw() for details.	2014-04-04 16:05:41 -04:00
Heng Li	b6bd33b26c	dev-459: don't hard code the drop ratio In the old code, if a secondary alignment is 50% worse, it won't be outputted.	2014-04-03 18:58:49 -04:00
Heng Li	b3225581be	dev-458: simplified the smem iterator simpler but less powful.	2014-04-03 15:23:48 -04:00
Heng Li	acfe7613db	dev-457: separated interval collection and seeding	2014-04-03 15:10:50 -04:00
Heng Li	9a5705289c	added more debugging infomation I can see a bug, but I do not know where it comes from.	2014-04-03 13:38:08 -04:00
Heng Li	9ce50a4e5e	dev-450: support diff ins/del penalties. NO TEST!!	2014-03-28 14:54:06 -04:00
Heng Li	8f9aeef4ec	Merge branch 'master' into dev Conflicts: main.c	2014-03-17 00:03:52 -04:00
Heng Li	e6931bec03	r445: unnecessarily large bandwidth in global	2014-03-17 00:01:00 -04:00
Heng Li	7d63e76245	r444: more debugging output in CIGAR generation Also found a potential issue which should not affect accuracy but may hurt speed. Will investigate later.	2014-03-16 23:25:04 -04:00
Heng Li	8929bd1c25	r443: more verbose debugging information	2014-03-16 15:18:58 -04:00
Heng Li	2e9463ebf1	dev-r442: suppress exact full-length matches	2014-02-26 22:04:19 -05:00
Heng Li	52391a9855	r437: print timing for each batch of reads	2014-02-19 10:54:26 -05:00
Heng Li	f524c7d3d8	r431: added the MD tag to bwa-mem	2014-01-29 12:05:11 -05:00
Heng Li	ea3dc2f003	r430: fix a bug producing incorrect alignment Ksw uses two rounds of SSE2-SW to find the boundaries of an alignment. If the second round gives a different score from the first round, it will fail. The fix checks if this happens, though I have not dig into an example to understand why this may happen in the first place.	2014-01-29 10:51:02 -05:00
Heng Li	10cb6b0507	r428: allow to change the default chain_drop_ratio	2013-12-30 16:18:45 -05:00
Heng Li	3afcdc7746	debugging code only: print seeds	2013-12-30 16:05:43 -05:00
Heng Li	74a1a53499	print debugging msg to stdout	2013-12-30 15:49:41 -05:00
Heng Li	4219e58623	r423: bugfix - SE hits not random	2013-11-23 09:36:26 -05:00
Heng Li	ff4762f3c7	r421: bw doubling in the final alignment In some cases, the band width used in the final alignment needs to be larger than the band width in extension.	2013-11-20 10:04:16 -05:00
Heng Li	6e3fa0515a	r420: inferred bandwidth is not used in the final	2013-11-20 09:50:46 -05:00
Heng Li	deb19593aa	r418: use the new mapQ estimator by default	2013-11-02 12:25:53 -04:00
Heng Li	19d33faa30	use kthread for multi-threading Bwa-mem should have better performance with many CPU cores.	2013-11-02 12:13:11 -04:00
Heng Li	7144a0cefc	r415: bug in the new (optional) mapQ computation I may use the new method as the default. Testing needed.	2013-09-09 17:51:05 -04:00
Heng Li	ebb7b02e9b	r414: fixed a bug caused by the last commit	2013-09-09 16:57:55 -04:00
Heng Li	b51a66e4c1	r413: fixed an issue causing redundant alignment I have seen a fosmid aligned to the same position but with two slightly different CIGARs: 30000M and 29900M50D100M, possibly caused by tandem repeats. 0.7.5a will regard them as two distinct alignments and generates a very small mapping quality. However, these two are essentially the same. Although there is ambiguity in aligning the end of the fosmid, we should not penalize the entire alignment with a small mapQ. This commit fixes this issue. More testing is needed, though.	2013-09-09 11:36:50 -04:00
Heng Li	1e2cff20ba	more conservative mapQ	2013-09-09 08:57:45 -04:00
Heng Li	1346f03ff1	use the old mapQ by default the new mapQ overestimate	2013-09-06 14:04:41 -04:00
Heng Li	451d60f3be	slight modification	2013-09-06 12:37:38 -04:00
Heng Li	623da055e1	alternative way to estimate mapQ the old mapQ estimate is too conservative	2013-09-06 12:31:47 -04:00
Heng Li	3b84c03c1e	r406: allow to use diff clipping penalties for 5'-end or for 3'-end	2013-08-28 15:59:05 -04:00
Heng Li	bde5005f39	r396: er... the new tag is named SA not SP	2013-05-23 12:48:18 -04:00
Heng Li	3d2450ed97	r395: bugfix - hard clipping not applied on revaln	2013-05-23 12:45:14 -04:00
Heng Li	9441bb7f2a	r394: added future plan	2013-05-22 20:02:53 -04:00
Heng Li	0e759bc1f5	removed a redundant flag	2013-05-22 19:55:07 -04:00
Heng Li	9735d7a31a	conform to the latest (unpublished) SAM spec for chimeric alignments	2013-05-22 19:45:16 -04:00
Heng Li	9a6abe51b6	r391: better method to resolve xref alignment The old method does not work when the alignment bridges three chr. This may actually happen often. The new method does not work all the time, either, but should be better than the old one. It is also simpler, arguably.	2013-05-22 18:57:51 -04:00
Rob Davies	96e445d9e4	Reduce dependency on utils.h - new malloc wrapping scheme. Remove xmalloc, xcalloc, xrealloc and xstrdup from utils.h and revert calls to the normal malloc, calloc, realloc, strdup. Add new files malloc_wrap.[ch] with the wrapper functions. malloc_wrap.h #defines malloc etc. to the wrapper, but only if USE_MALLOC_WRAPPERS has been defined. Put #include "malloc_wrap.h" in any file that uses *alloc or strdup. This is also in a #ifdef USE_MALLOC_WRAPPERS ... #endif block to make using the wrappers optional. Add -DUSE_MALLOC_WRAPPERS into the makefile so they should normally get added. This is an improvement on the previous method as we now don't need to worry about stray function calls that were not changed to the wrapped version and the code will still work even if the wrapping is disabled. Other possible methods of doing this are using malloc_hook (glibc-specific), adding -include malloc_wrap.h to the gcc command-line (somewhat gcc-specific) or making our own malloc function and using dlopen (scary). This way is probably the most portable.	2013-05-02 15:12:01 +01:00
Rob Davies	e88529687f	Merge branch 'master' into master_fixes. Merged up to r389. Conflicts: bwamem.c kopen.c	2013-04-29 12:09:30 +01:00
Heng Li	19cb7cd7ed	r388: cleanup mem_process_seqs() interface Print output outside the function and allow to feed insert size distribution.	2013-04-26 12:31:18 -04:00
Rob Davies	4cb5110d03	Merge branch 'master' into master_fixes	2013-04-22 09:51:07 +01:00
Heng Li	2087dc162f	r377: increased unpaired penalty from 9 to 17 This leads to more aggressive pairing - more properly paired reads. I have found a few cases where, for example, read1 is umambiguously mapped to chr20 while its 100bp mate has a perfect match to another chr but has 3 mismatches and 1 deletion when it is paired with read1 on chr20. With longer reads, it seems that the chr20 hit is correct, although it is not obvious how this happened in evolution.	2013-04-17 16:50:20 -04:00
Rob Davies	3dd10bd7db	Merge branch 'master' into master_fixes	2013-04-12 16:20:13 +01:00
Rob Davies	90ecd344ba	Merge branch 'master' into master_fixes. Merged up to master r375. Conflicts: bwt.c	2013-04-11 11:15:39 +01:00

1 2 3 4 5 ...

294 Commits (main)