In the bwa.c and bwase.c calls, rlen is an int64_t returned from
bns_get_seq() and is the number of reference bases covered by the
alignment; l_query/len is an int and the query length of the alignment;
and the result is an int given to an int parameter of ksw_global[2]().
As even the result is int and as rlen is effectively bounded by the
maximum length of a reference sequence, we maintain the status quo in
this code and simply cast rlen to int to silence Clang's "use llabs()"
(llabs() would not be a great answer given an int64_t anyway).
The bwtsw2_pair.c call needs to remain fabs() so both divisions are
done in floating point; cast to double to prevent Clang suggesting
changing the call to integer abs().
This function causes all kinds of problems when the reference genome consists
of many short reads/contigs/chromsomes. Some of the problems are nearly
unfixable at the point where bwa_fix_xref() gets called. This commit attempts
to fix the problem at the root. It disallows chains spanning multiple contigs
and never retrieves sequences bridging two adjacent contigs. Thus all the
chaining, extension, SW and global alignments are confined to on contig only.
This commit brings many changes. I have tested it on a couple examples
including Peter Field's PacBio example. It works well so far.
Peter Field has sent me an example caused by an alignment bridging three
adjacent chromosomes/contigs. Bwa-mem always aligns the query to the contig
covering the middle point of the alignment. In this example, it chooses the
middle contig, which should not be aligned. This leads to weird things failing
bwa_fix_xref2(), which cannot be fixed unless we build the contig boundaries
into the FM-index.
In the old code, bwa-mem halts when bwa_fix_xref2() fails. With this commit,
bwa-mem will give a warning instead of halting.
The old method does not work when the alignment bridges three chr. This may
actually happen often. The new method does not work all the time, either, but
should be better than the old one. It is also simpler, arguably.
Remove xmalloc, xcalloc, xrealloc and xstrdup from utils.h and revert calls
to the normal malloc, calloc, realloc, strdup. Add new files malloc_wrap.[ch]
with the wrapper functions. malloc_wrap.h #defines malloc etc. to the
wrapper, but only if USE_MALLOC_WRAPPERS has been defined.
Put #include "malloc_wrap.h" in any file that uses *alloc or strdup. This
is also in a #ifdef USE_MALLOC_WRAPPERS ... #endif block to make using the
wrappers optional. Add -DUSE_MALLOC_WRAPPERS into the makefile so they
should normally get added.
This is an improvement on the previous method as we now don't need to
worry about stray function calls that were not changed to the wrapped version
and the code will still work even if the wrapping is disabled.
Other possible methods of doing this are using malloc_hook (glibc-specific),
adding -include malloc_wrap.h to the gcc command-line (somewhat
gcc-specific) or making our own malloc function and using dlopen (scary).
This way is probably the most portable.
In this case, bwa_fix_xref() will return insane coordinates. The old version
did not check the return status and write wrong CIGAR. This bug only happen to
very short assembly contigs.
I really hate this: complex and twisted logic for a nasty scenario that almost
never happens to short reads - but it may become serious when the reference
genome consists of many contigs.
On toy examples, the code seems to work. Don't know if it really works...