r539: use --splice-flank=yes by default

In human/mouse, the GTr..yAG pattern occurs to 91/92% of all GT-AG introns.
Modeling r..y clearly leads to higher accuracy. However, in SIRV, this
percentage is reduced to ~60%. The default "--splice --splice-flank=yes"
leads to lower accuracy. If someone benchmark minimap2 on SIRV, this would be
bad, but minimap2 is developed for practical applications, not for benchmarks.
I will live with that.
This commit is contained in:
Heng Li 2017-10-28 22:29:55 -04:00
parent f22a94e868
commit 192217a10c
3 changed files with 27 additions and 7 deletions

2
main.c
View File

@ -6,7 +6,7 @@
#include "mmpriv.h"
#include "getopt.h"
#define MM_VERSION "2.3-r538-dirty"
#define MM_VERSION "2.3-r539-dirty"
#ifdef __linux__
#include <sys/resource.h>

2
map.c
View File

@ -105,7 +105,7 @@ int mm_set_opt(const char *preset, mm_idxopt_t *io, mm_mapopt_t *mo)
mo->mini_batch_size = 50000000;
} else if (strcmp(preset, "splice") == 0 || strcmp(preset, "cdna") == 0) {
io->is_hpc = 0, io->k = 15, io->w = 5;
mo->flag |= MM_F_SPLICE | MM_F_SPLICE_FOR | MM_F_SPLICE_REV;
mo->flag |= MM_F_SPLICE | MM_F_SPLICE_FOR | MM_F_SPLICE_REV | MM_F_SPLICE_FLANK;
mo->max_gap = 2000, mo->max_gap_ref = mo->bw = 200000;
mo->a = 1, mo->b = 2, mo->q = 2, mo->e = 1, mo->q2 = 32, mo->e2 = 0;
mo->noncan = 9;

View File

@ -220,7 +220,9 @@ costs
In the splice mode, the second gap penalties are not used.
.TP
.BI -C \ INT
Cost for a non-canonical GT-AG splicing [0]
Cost for a non-canonical GT-AG splicing (effective with
.BR --splice )
[0]
.TP
.BI -z \ INT
Break an alignment if the running score drops too quickly along the diagonal of
@ -243,7 +245,25 @@ both strands;
no attempt to match GT-AG [n]
.TP
.BI --end-bonus \ INT
Score bonus when alignment extends to the end of the query sequence [10].
Score bonus when alignment extends to the end of the query sequence [0].
.TP
.BR --splice-flank [= yes | no ]
Assume the next base to a
.B GT
donor site tends to be A/G (91% in human and 92% in mouse) and the preceding
base to a
.B AG
acceptor tends to be C/T [yes with
.BR --splice ].
This trend is evolutionarily conservative, all the way to S. cerevisiae
(PMID:18688272). Specifying this option generally leads to higher junction
accuracy by several percents, so it is applied by default with
.BR --splice .
However, the SIRV control does not honor this trend
(only ~60%). This option reduces accuracy. If you are benchmarking minimap2
on SIRV data, please add
.B --splice-flank=no
to the command line.
.SS Input/output options
.TP 10
.B -a
@ -261,7 +281,7 @@ the real CIGAR in memory.
.TP
.BI -R \ STR
SAM read group line in a format like
.RB @RG\\\\tID:foo\\\\tSM:bar
.B @RG\\\\tID:foo\\\\tSM:bar
[].
.TP
.B -c
@ -371,8 +391,8 @@ is that this preset is not using HPC minimizers.
.B splice
Long-read spliced alignment
.RB ( -k15
.B -w5 --splice -g2000 -G200k -A1 -B2 -O2,32 -E1,0 -C9 -z200
.BR -ub ).
.B -w5 --splice -g2000 -G200k -A1 -B2 -O2,32 -E1,0 -C9 -z200 -ub
.BR --splice-flank=yes ).
In the splice mode, 1) long deletions are taken as introns and represented as
the
.RB ` N '