Define MM_CIGAR_STR to the full string of CIGAR operators (including
the 'B' operator as well) and use it throughout the C code.
It would be possible to use it from the Cython code too, but it's easier
to keep that as a Cython string literal to avoid adding extra runtime
code to handle locale conversion.
We may use a large --end-bonus to mimic end-to-end alignment. In the short-read
mode, the candidate alignment region may be out of the band, which leads to
truncated alignment.
Fix the logic that calculates the number of CIGAR entries when
match "M" entries are expanded into "=" and "X". The number
of entries depends not on the number of mismatches but rather
on the number of transitions between "=" to "X".
When aligning long reads, we would prefer to align through low-quality
regions. This requires a large Z-drop threshold. However, to find small
inversions, we need to use a small Z-drop. This commit address this
conflict with two Z-drop thresholds. When Z-drop exceeds the smaller
threshold, we perform a local alignment to check if there is a potential
inversion. If there is one, we break the alignment; otherwise we break
the alignment only if Z-drop excess the larger threshold.
This commit also fixes a bug that reported wrong coordinates when the
inversion is on the forward strand (#112).
[PMID:18688272] shows that the base following GT tends to be A or G (i.e. R) in
both human and yeast, and that the base preceeding AG tends to be C or T (i.e.
Y). In the new model, we pay no cost to GTr..yAG, but we pay half of the cost
if there is no r or y. This improves the junction accuracy when mapping to
human and mouse and decreases the accuacy when mapping to SIRV. My guess is
that SIRV does not honor this trend. Need to investigate in future.
Also in this commit, --cost-non-gt-ag is aliased to -C. The default is changed
to 9 instead of 5. I also added --splice-flank to enable the above model. This
may become the default once I confirm my hypothesis on SIRV.
It happened when the query HPC minimizer is longer than the reference HPC
minimizer close to the beginning of a contig. We may get a negative coordinate,
which causes an assertion failure.