Commit Graph

208 Commits (9ab95be1bb36d2c1925f3f31b92ea5fbf2bf9d25)

Author SHA1 Message Date
Heng Li 192217a10c r539: use --splice-flank=yes by default
In human/mouse, the GTr..yAG pattern occurs to 91/92% of all GT-AG introns.
Modeling r..y clearly leads to higher accuracy. However, in SIRV, this
percentage is reduced to ~60%. The default "--splice --splice-flank=yes"
leads to lower accuracy. If someone benchmark minimap2 on SIRV, this would be
bad, but minimap2 is developed for practical applications, not for benchmarks.
I will live with that.
2017-10-28 22:29:55 -04:00
Heng Li 79b0caca95 r537: model the next base to GT/AG
[PMID:18688272] shows that the base following GT tends to be A or G (i.e. R) in
both human and yeast, and that the base preceeding AG tends to be C or T (i.e.
Y). In the new model, we pay no cost to GTr..yAG, but we pay half of the cost
if there is no r or y. This improves the junction accuracy when mapping to
human and mouse and decreases the accuacy when mapping to SIRV. My guess is
that SIRV does not honor this trend. Need to investigate in future.

Also in this commit, --cost-non-gt-ag is aliased to -C. The default is changed
to 9 instead of 5. I also added --splice-flank to enable the above model. This
may become the default once I confirm my hypothesis on SIRV.
2017-10-28 00:25:01 -04:00
Heng Li d4b5dfc297 r533: added --no-pairing
to prevent the use of any pairing information for paired-end reads.
2017-10-23 14:09:32 -04:00
Heng Li 306e4541f8 Released minimap2-2.3 (r531) 2017-10-22 23:13:35 -04:00
Heng Li bd04372873 r524: reverted to bwa-mem end bonus
and reduced the cost of clipping when filtering by identity
2017-10-20 16:57:31 -04:00
Heng Li 04cf4ebf5e r518: increased the default -K to 500M
This helps multi-thread performance for ultra-long reads.
2017-10-17 13:21:29 -04:00
Heng Li e6f525edaf r512: option to filter poorly aligned reads 2017-10-16 10:38:22 -04:00
Heng Li 858213d513 r511: fixed wrong primary sam record 2017-10-12 23:02:18 -04:00
Heng Li 7c555f9b7e r508: use two I/O threads for mapping
-x sr applies this option by default
2017-10-12 14:56:01 -04:00
Heng Li 7345621759 r499: end bonus working; DP region needs improve! 2017-10-11 00:14:25 -04:00
Heng Li 9396d9e11b r452: typo in the last commit 2017-10-09 10:05:32 -04:00
Heng Li 198849a716 r491: an ambiguous base costs the same as gap ext 2017-10-09 09:59:42 -04:00
Heng Li 9fea4d16b3 r490: improved short-read extension heuristic
Now we find the best scoring ungapped seeded segment and then extend from it.
There is no gap filling for short reads.
2017-10-08 21:36:34 -04:00
Heng Li 61e56c941d r488: parameter to control max fragment length 2017-10-07 23:54:32 -04:00
Heng Li c6384ed2c8 r482: increased short-read bandwidth to 100
This has very minor effect on speed.
2017-10-06 10:20:32 -04:00
Heng Li e0baf1ad54 r479: a bit code cleanup 2017-10-05 16:15:14 -04:00
Heng Li f266092699 r478: simplied useless code, a tiny bit 2017-10-05 15:56:00 -04:00
Heng Li 9c5767f9ed r477: renamed multi_seg to frag_mode 2017-10-05 15:48:17 -04:00
Heng Li ae2adf04d4 r476: multi-file fragment mode working 2017-10-05 15:39:26 -04:00
Heng Li 5ab99eb26e more accurate SAM flag 2017-10-05 10:59:38 -04:00
Heng Li 7cc4f6f965 r469: first step towards PE SAM 2017-10-05 10:38:09 -04:00
Heng Li 7d50e646dd r466: detect multi-part index more smartly
though it might not work in an extremely rare case: the end of a sequence ends
at X*16384 and it is the last sequence in a batch. This can be resolved by
never letting the kstream_t buffer empty.
2017-10-04 17:32:58 -04:00
Heng Li 1554149158 r465: apply option -x before other options 2017-10-04 13:52:28 -04:00
Heng Li 2581c44a21 r463: optionally disable secondary hits 2017-10-04 13:24:41 -04:00
Heng Li 2a1e738a94 r461: randomize repetitive hits 2017-10-04 13:05:18 -04:00
Heng Li cf55c84056 r460: added option --no-long-join 2017-10-04 12:08:44 -04:00
Heng Li ee9b2773a8 r456: min chain score should >k-mer length
or chain_dp() wastes time on unnecessarily sorting chains with one k-mer.
2017-09-29 22:33:55 -04:00
Heng Li 04fb2c2ec0 r454: rechain with higher max_occ if no good chain 2017-09-29 19:24:32 -04:00
Heng Li 0d4ecd19ee r453: avoid duplicated strcmp() for ava 2017-09-28 15:52:05 -04:00
Heng Li 9541052564 r447: paired-end mapping quality
not as good as I would hope...
2017-09-27 15:39:25 -04:00
Heng Li 7e0d70bfd3 r445: pair coordinate adjustment working
Next: mapq adjustment, which will be tricky...
2017-09-27 15:38:18 -04:00
Heng Li a349d85280 r444: changed the way orientation is specified
The old model doesn't work with RF or RR orientation. The new model only works
with paired-end reads. For >2 segments, only FF is supported.
2017-09-27 12:33:10 -04:00
Heng Li f611edf6f2 r443: don't filter small cm for split seg 2017-09-26 16:17:58 -04:00
Heng Li 1b1dd0cd57 r442: default max_gap to 200 in the sr mode 2017-09-26 13:31:01 -04:00
Heng Li 55d1e4f638 r440: better chain filtering for PE reads 2017-09-26 11:03:36 -04:00
Heng Li 8f25cfa36e r437: fixed uninialized memory on rep_len 2017-09-25 14:22:45 -04:00
Heng Li 81008dd371 r436: working on short reads
The result is mixed - lots of room for tuning
2017-09-25 14:06:29 -04:00
Heng Li 3bb66e1ed3 multi-seg working on toy examples 2017-09-25 13:42:04 -04:00
Heng Li a742f10164 get multi-seg code ready; probably not working yet 2017-09-24 15:17:17 -04:00
Heng Li f0951141a1 allow to read multiple files interleaved 2017-09-24 14:33:05 -04:00
Heng Li 19d8eca3a1 moved array shrinking into chain_dp() 2017-09-20 14:58:57 -04:00
Heng Li 9943e5fdd0 backup 2017-09-20 14:35:46 -04:00
Heng Li 5b39a1b34b Merge branch 'master' into sr 2017-09-20 12:24:08 -04:00
Heng Li e3b5802b2e r424: reduce memory for long query seqs 2017-09-20 12:22:13 -04:00
Heng Li 03d6894517 backup 2017-09-20 11:47:46 -04:00
Heng Li 645db3350e Merge branch 'master' into sr 2017-09-20 11:15:14 -04:00
Heng Li 75e6bbc9f6 r421: removed the MM_F_SPLICE_BOTH mode
In the default splice mode, minimap2 applies two rounds of spliced alignment:
first assuming GT-AG to be the splice signal across all splicing sites and then
assuming CT-AC to be the signal. This is the idea strategy.

In the MM_F_SPLICE_BOTH mode, minimap2 applies one round of spliced alignment,
assuming GT-AG and CT-AC to be the splice signals AT THE SAME TIME. This will
be faster but less accurate. I don't think anyone would like to run minimap2 in
this mode, so I am removing it for clarity.
2017-09-20 11:11:53 -04:00
Heng Li 7a9b4db874 replaced --approx-ext with --sr
--sr disables Z-drop and may come with other heurstics
2017-09-20 10:51:18 -04:00
Heng Li fd14618e61 no effective changes 2017-09-20 10:11:05 -04:00
Heng Li 56014ba3db avoid assertion failure given 0-length reads 2017-09-19 22:30:32 -04:00
Heng Li b99c22840f r414: avoid assertion failure for 0-length reads 2017-09-19 22:21:27 -04:00
Heng Li c04420698e fixed an uninitialized value 2017-09-19 16:21:21 -04:00
Heng Li fb1bcc0084 early exploration 2017-09-19 16:18:28 -04:00
Heng Li e2823d4aee r367: index reader optionally writes index 2017-09-14 21:18:13 -04:00
Heng Li eb00521d9b redesigned indexing and option APIs 2017-09-14 17:02:01 -04:00
Heng Li 4d3768bf26 r364: improved the mapq heuristics
* use repetitive seed lengths, not counts
* compute n_sub to higher accuracy
* use bwa-mem mapq heuristic as a backup

For short single-end reads, minimap2's ROC is not as good as bwa-mem's, but is
close.
2017-09-14 12:37:03 -04:00
Heng Li 6a82a21dee r361: improved mapq for short reads 2017-09-13 15:32:39 -04:00
Heng Li 3c91d652dd r360: allow to set integer max occ 2017-09-13 11:37:00 -04:00
Heng Li c7c3585531 r347: merged mm_map_frag() into mm_map()
mm_map_frag() was separated due to an earlier design that has been rejected.
2017-09-10 15:02:55 -04:00
Heng Li 59c822b722 removed some commented code
which *might* return at some time later
2017-09-09 08:38:39 -04:00
Heng Li f422175e4e r344: avoid unnecessary refName retrieval 2017-09-08 22:44:14 -04:00
Heng Li 101b8bb97d r335: report an error if query can't be opened 2017-09-03 11:54:38 -04:00
Heng Li 0fe1a224ab r309: improved SAM header output 2017-08-25 10:35:58 +08:00
Heng Li 993a2bb521 r301: separate introns from deletions
When an intron is adjacent to a deletion, the old code count both as introns,
which lead to an inaccurate exon boundary.
2017-08-18 15:31:15 +08:00
Heng Li 64c1389e1a Merge branch 'master' into splice 2017-08-17 23:39:27 +08:00
Heng Li bbb37d95f2 support inserting RG lines 2017-08-17 23:34:09 +08:00
Heng Li 2cde8d257c r297: bidirectional RNA alignment 2017-08-17 06:02:44 -04:00
Heng Li d240318741 r287: refined CLI options and manpage 2017-08-12 12:26:04 -04:00
Heng Li 0f4c823b0c r286: ignore introns when computing max seg score 2017-08-12 10:58:16 -04:00
Heng Li c59b0781bc r280: output introns as "N" in the cdna mode 2017-08-09 11:45:02 -04:00
Heng Li 1a7d782131 r273: cdna mapping mode for testing
Differences from the typical mapping mode:

* banded alignment disabled
* log gap cost during chaining
* zero long-gap extension during alignment
* up to 100kb (by default) reference gap
* bad seeding not filtered (to tune later)
2017-08-08 11:31:49 -04:00
Heng Li 4c0713ee14 r235: optionally output tag cs in PAF
cs encodes the query, the reference sequence and CIGAR.
2017-07-31 12:06:49 -04:00
Heng Li 19d6ec885e r224: inversion alignment around Z-drop break 2017-07-29 13:09:10 -04:00
Heng Li 2179e9e24b r221: output SA in the SAM output 2017-07-28 23:08:39 -04:00
Heng Li 254280b8af r216: a bit cleanup; identical output to r215 2017-07-28 11:54:18 -04:00
Heng Li b927838495 r212: better heuristic to fix wrong seeding
but not good enough. Will explore more.
2017-07-27 11:24:51 -04:00
Heng Li e9dc1ce2b6 r205: when computing mapq, consider min_chain_sc
Not doing this was a mistake.
2017-07-26 11:34:14 -04:00
Heng Li 00c6db5073 r203: check more subopt aln if score small 2017-07-25 20:02:44 -04:00
Heng Li 71c988f6ab r188: renamed bseq* to mm_bseq*
to avoid naming collisions between minimap2 and bwa/fermi-lite/etc
2017-07-19 09:26:46 -04:00
Heng Li 71e2a97a4c r180: changed -x asm5 settings 2017-07-18 00:00:36 -04:00
Heng Li b4280d186f r176: removed seedcov_ratio; changed default opt
min_seedcov_ratio is not used
2017-07-12 12:47:46 -04:00
Heng Li 52caf79395 r175: halved max-chain-skip in the ava mode 2017-07-12 10:42:19 -04:00
Heng Li eeeb2ffb68 r174: make max-chain-skip work
The max-chain-skip heuristics did not work due to a bug. Without this
heuristics, chaining is too slow for long-read overlap.
2017-07-12 10:08:06 -04:00
Heng Li 33451aba45 r173: changed the debugging output format 2017-07-11 15:23:28 -04:00
Heng Li 826c8ba892 r170: added a debugging flag
something wrong with chaining
2017-07-11 14:47:35 -04:00
Heng Li 1ac48556ae r167: long join threshold depends on gap
also caught a bug for reverse strand join
2017-07-09 10:38:51 -04:00
Heng Li 42846ce65d r163: reduced long join score requirement
because the chaining score is generally smaller with the last few commits.
2017-07-08 15:51:52 -04:00
Heng Li 38b2830e18 r161: filter bad seeds; changed default -g/-r 2017-07-08 13:31:27 -04:00
Heng Li cc554aee43 r159: use two-piece gap penalty 2017-07-08 10:26:00 -04:00
Heng Li 9823317e8f r158: optionally ignore base quality 2017-07-05 18:23:50 -04:00
Heng Li e07daad7ad r153: sam primary record not set sometimes 2017-07-03 13:18:57 -04:00
Heng Li b625247300 r150: mm_sync_regs() doesn't work with negative id 2017-07-03 11:36:34 -04:00
Heng Li 53c4bf5e4f r149: introduced debugging flags on CLI 2017-07-03 11:02:32 -04:00
Heng Li 2e4fd9f1d0 r148: revamped regs handling after cigar 2017-07-03 10:44:26 -04:00
Heng Li 51cfb60520 r145: changed default -p from 2 to 0.8
For long reads, secondary alignments can be very information.
2017-07-02 22:51:45 -04:00
Heng Li 74d306a596 fixed bug when retaining 2ndary aln; still buggy 2017-07-02 19:08:30 -04:00
Heng Li 41efd03d7a r129: fixed memory leak caused by qualities 2017-06-30 23:48:00 -04:00
Heng Li 426c2975f6 r126: filter by fraction of seed coverage
otherwise we may get too many poor overlap mappings.
2017-06-30 22:15:45 -04:00
Heng Li 646a746cdc r122: filter contained aln after DP extension 2017-06-30 15:23:30 -04:00
Heng Li fce87ce7bd r121: output QUAL and unmapped to SAM 2017-06-30 14:40:54 -04:00