Commit Graph

156 Commits (c41518ae85ce39dcb8c9e2e8cd483d00a3d98eae)

Author SHA1 Message Date
Heng Li 154d2caf5b r784: support the =/X CIGAR operators (#156) 2018-05-30 16:11:22 -04:00
Heng Li 734ac379bb r770: matching N bases not working properly (#155) 2018-04-30 19:55:23 -04:00
Heng Li ee4cd089f7 r763: fine control long join flank len (#128) 2018-03-29 14:16:58 -04:00
Heng Li 08bd2123b6 r752: option to copy comments to output (#136) 2018-03-23 10:04:33 -04:00
Heng Li 8766d286df r751: optionally output MD (#118) 2018-03-22 14:15:33 -04:00
Heng Li bdc615c1d4 r741: added --min-occ-floor to improve #107 2018-03-12 14:32:27 -04:00
Heng Li 24a4808826 r718: retrieve sequence from the index 2018-02-23 10:18:26 -05:00
Heng Li 1372977a37 r708: implemented double Z-drop thresholds (#112)
When aligning long reads, we would prefer to align through low-quality
regions. This requires a large Z-drop threshold. However, to find small
inversions, we need to use a small Z-drop. This commit address this
conflict with two Z-drop thresholds. When Z-drop exceeds the smaller
threshold, we perform a local alignment to check if there is a potential
inversion. If there is one, we break the alignment; otherwise we break
the alignment only if Z-drop excess the larger threshold.

This commit also fixes a bug that reported wrong coordinates when the
inversion is on the forward strand (#112).
2018-02-15 10:50:49 -05:00
Heng Li 7ef5490884 r703: added --max-clip-ratio
still testing the option
2018-02-12 13:29:18 -05:00
Heng Li 29b4a1786c r685: tune end seed filter again 2018-02-05 11:48:22 -05:00
Heng Li dbf284b2d9 r684: separate end score from min_chain_score 2018-02-05 11:40:38 -05:00
Heng Li da6947cfa3 r671: cleanup command line options 2018-01-31 13:59:52 -05:00
Heng Li 46d6349af4 r670: added PE support to mappy
and minor code cleanup
2018-01-31 11:33:08 -05:00
Heng Li 123bc1d91d put option operations in another file 2018-01-26 08:38:37 -05:00
Heng Li 33f8157961 r655: options to map to one strand of the ref #91 2018-01-16 10:34:30 -05:00
Heng Li e420b17496 r629: API to construct index from strings 2017-12-18 22:29:46 -05:00
Heng Li ab345e600b r626: function to check incorrect scoring system 2017-12-13 12:23:43 -05:00
Heng Li 98a6e52c06 r618: heuristics to avoid tiny terminal exons 2017-12-11 00:57:55 -05:00
Heng Li 704ff9f4c6 r607: estimate sequence divergence
Currently using the simplest method. There may be a more accurate estimate.
2017-12-06 16:14:39 -05:00
Heng Li 2f463b1db0 r573: prepare to generalize index 2017-11-11 19:54:06 -05:00
mvdbeek 1cb0bf4bef Implement -Y for soft clipping of supp. alignments
I tried to base this on bwa-mem and it seems to work for sam alignments.
2017-11-09 19:22:36 +01:00
Heng Li b24d68ae9f r557: fixed another mapq underestimate
When a chain is split during base-level alignment, its chaining score is
reduced. However, the chaining score of its suboptimal chain remains the same.
This leads to underestimated mapping quality.
2017-11-07 23:20:49 -05:00
Heng Li fa5a645ca5 r552: fixed a tiny typo on struct packing
The old packing wastes memory, thought very small.
2017-11-05 08:27:26 -05:00
Heng Li cd24dc8834 r545: removed option -i, not working well 2017-10-31 22:23:27 -04:00
Heng Li 79b0caca95 r537: model the next base to GT/AG
[PMID:18688272] shows that the base following GT tends to be A or G (i.e. R) in
both human and yeast, and that the base preceeding AG tends to be C or T (i.e.
Y). In the new model, we pay no cost to GTr..yAG, but we pay half of the cost
if there is no r or y. This improves the junction accuracy when mapping to
human and mouse and decreases the accuacy when mapping to SIRV. My guess is
that SIRV does not honor this trend. Need to investigate in future.

Also in this commit, --cost-non-gt-ag is aliased to -C. The default is changed
to 9 instead of 5. I also added --splice-flank to enable the above model. This
may become the default once I confirm my hypothesis on SIRV.
2017-10-28 00:25:01 -04:00
Heng Li d4b5dfc297 r533: added --no-pairing
to prevent the use of any pairing information for paired-end reads.
2017-10-23 14:09:32 -04:00
Heng Li 306e4541f8 Released minimap2-2.3 (r531) 2017-10-22 23:13:35 -04:00
Heng Li 4683da2455 r520: added option -L to write long cigar to CG 2017-10-17 17:32:44 -04:00
Heng Li adf6cd7f52 r513: merged pre- and post-cigar blen and mlen
This saves a bit memory and is cleaner.
2017-10-16 10:55:18 -04:00
Heng Li e6f525edaf r512: option to filter poorly aligned reads 2017-10-16 10:38:22 -04:00
Heng Li 7c555f9b7e r508: use two I/O threads for mapping
-x sr applies this option by default
2017-10-12 14:56:01 -04:00
Heng Li 7345621759 r499: end bonus working; DP region needs improve! 2017-10-11 00:14:25 -04:00
Heng Li 61e56c941d r488: parameter to control max fragment length 2017-10-07 23:54:32 -04:00
Heng Li 9c5767f9ed r477: renamed multi_seg to frag_mode 2017-10-05 15:48:17 -04:00
Heng Li ae2adf04d4 r476: multi-file fragment mode working 2017-10-05 15:39:26 -04:00
Heng Li f4a5d3a692 r474: replaced -S and --cs-no-equal with --cs 2017-10-05 15:03:03 -04:00
Heng Li 5ab99eb26e more accurate SAM flag 2017-10-05 10:59:38 -04:00
Heng Li 9aba11769c r467: added : (equal length) and ^ (intron) ops 2017-10-04 21:55:37 -04:00
Heng Li 7d50e646dd r466: detect multi-part index more smartly
though it might not work in an extremely rare case: the end of a sequence ends
at X*16384 and it is the last sequence in a batch. This can be resolved by
never letting the kstream_t buffer empty.
2017-10-04 17:32:58 -04:00
Heng Li 2581c44a21 r463: optionally disable secondary hits 2017-10-04 13:24:41 -04:00
Heng Li 2a1e738a94 r461: randomize repetitive hits 2017-10-04 13:05:18 -04:00
Heng Li cf55c84056 r460: added option --no-long-join 2017-10-04 12:08:44 -04:00
Heng Li 04fb2c2ec0 r454: rechain with higher max_occ if no good chain 2017-09-29 19:24:32 -04:00
Heng Li 7e0d70bfd3 r445: pair coordinate adjustment working
Next: mapq adjustment, which will be tricky...
2017-09-27 15:38:18 -04:00
Heng Li a349d85280 r444: changed the way orientation is specified
The old model doesn't work with RF or RR orientation. The new model only works
with paired-end reads. For >2 segments, only FF is supported.
2017-09-27 12:33:10 -04:00
Heng Li f611edf6f2 r443: don't filter small cm for split seg 2017-09-26 16:17:58 -04:00
Heng Li 3bb66e1ed3 multi-seg working on toy examples 2017-09-25 13:42:04 -04:00
Heng Li f0951141a1 allow to read multiple files interleaved 2017-09-24 14:33:05 -04:00
Heng Li 645db3350e Merge branch 'master' into sr 2017-09-20 11:15:14 -04:00
Heng Li 75e6bbc9f6 r421: removed the MM_F_SPLICE_BOTH mode
In the default splice mode, minimap2 applies two rounds of spliced alignment:
first assuming GT-AG to be the splice signal across all splicing sites and then
assuming CT-AC to be the signal. This is the idea strategy.

In the MM_F_SPLICE_BOTH mode, minimap2 applies one round of spliced alignment,
assuming GT-AG and CT-AC to be the splice signals AT THE SAME TIME. This will
be faster but less accurate. I don't think anyone would like to run minimap2 in
this mode, so I am removing it for clarity.
2017-09-20 11:11:53 -04:00
Heng Li 7a9b4db874 replaced --approx-ext with --sr
--sr disables Z-drop and may come with other heurstics
2017-09-20 10:51:18 -04:00
Heng Li fb1bcc0084 early exploration 2017-09-19 16:18:28 -04:00
Heng Li 75ff7ceec5 r368: API documentation 2017-09-14 22:23:04 -04:00
Heng Li e2823d4aee r367: index reader optionally writes index 2017-09-14 21:18:13 -04:00
Heng Li eb00521d9b redesigned indexing and option APIs 2017-09-14 17:02:01 -04:00
Heng Li 0f7455cefa r365: documented the "sr" preset 2017-09-14 12:57:21 -04:00
Heng Li 3c91d652dd r360: allow to set integer max occ 2017-09-13 11:37:00 -04:00
Heng Li d7f2ac1d4f better parameters for short reads
It turns out the key problem is not the minimizer density. It is the max
occurrence that tends to affect results more, especially sensitivity. There is
still lots of work to do, but for now, it seems a good start.
2017-09-12 16:11:23 -04:00
Heng Li 0fe1a224ab r309: improved SAM header output 2017-08-25 10:35:58 +08:00
Heng Li 2cde8d257c r297: bidirectional RNA alignment 2017-08-17 06:02:44 -04:00
Heng Li b5f5929bf9 r296: expose splicing related options to CLI 2017-08-13 21:37:51 -04:00
Heng Li 43506edbc5 backup: preliminary boundary alignment 2017-08-12 23:10:14 -04:00
Heng Li d240318741 r287: refined CLI options and manpage 2017-08-12 12:26:04 -04:00
Heng Li 1a7d782131 r273: cdna mapping mode for testing
Differences from the typical mapping mode:

* banded alignment disabled
* log gap cost during chaining
* zero long-gap extension during alignment
* up to 100kb (by default) reference gap
* bad seeding not filtered (to tune later)
2017-08-08 11:31:49 -04:00
Heng Li 4c0713ee14 r235: optionally output tag cs in PAF
cs encodes the query, the reference sequence and CIGAR.
2017-07-31 12:06:49 -04:00
Heng Li 19d6ec885e r224: inversion alignment around Z-drop break 2017-07-29 13:09:10 -04:00
Heng Li f81f37fef1 r197: allocate index seq names from kalloc
to reduce malloc() overhead.
2017-07-24 19:36:05 -04:00
Heng Li 5c4d040b13 r191: warning if CLI index opt diff from prebuilt
Also added index testing API (moved from main.c to index.c)
2017-07-19 10:25:11 -04:00
Heng Li 71c988f6ab r188: renamed bseq* to mm_bseq*
to avoid naming collisions between minimap2 and bwa/fermi-lite/etc
2017-07-19 09:26:46 -04:00
Heng Li b4280d186f r176: removed seedcov_ratio; changed default opt
min_seedcov_ratio is not used
2017-07-12 12:47:46 -04:00
Heng Li 801bc84b01 r169: output more accurate col. 10&11 to PAF
In r168, col.10 is smaller than what it should be. This confuses miniasm.
2017-07-11 14:09:51 -04:00
Heng Li cc554aee43 r159: use two-piece gap penalty 2017-07-08 10:26:00 -04:00
Heng Li 9823317e8f r158: optionally ignore base quality 2017-07-05 18:23:50 -04:00
Heng Li 53c4bf5e4f r149: introduced debugging flags on CLI 2017-07-03 11:02:32 -04:00
Heng Li 632b8638d2 r144: adjust primary aln after cigar 2017-07-02 22:43:02 -04:00
Heng Li 74d306a596 fixed bug when retaining 2ndary aln; still buggy 2017-07-02 19:08:30 -04:00
Heng Li 426c2975f6 r126: filter by fraction of seed coverage
otherwise we may get too many poor overlap mappings.
2017-06-30 22:15:45 -04:00
Heng Li d11049eb32 r120: use max-scoring seg to control output
much better now
2017-06-30 14:21:44 -04:00
Heng Li 52b4d8e2c9 r115: set primary tag; still buggy 2017-06-29 23:48:35 -04:00
Heng Li 11167f511b r112: output z-drop 2017-06-29 22:08:46 -04:00
Heng Li c8d122bcdb backup 2017-06-29 11:11:15 -04:00
Heng Li bcd9b1c621 r93: fixed various small issues 2017-06-28 10:35:21 -04:00
Heng Li fa80177e58 r89: added minimal number of minimizer counts 2017-06-27 18:43:15 -04:00
Heng Li 640b1a1727 command-line option to control CIGAR output 2017-06-26 11:41:09 -04:00
Heng Li b1077ff14c sam output 2017-06-25 22:05:20 -04:00
Heng Li aa5881e7bb backup 2017-06-24 22:51:31 -04:00
Heng Li 35b84f88c6 backup 2017-06-23 22:42:15 -04:00
Heng Li 4fea3d778a backup 2017-06-23 18:57:00 -04:00
Heng Li 6c8368c24c get the left-extension sequence correctly 2017-06-23 18:25:47 -04:00
Heng Li 990f7b0b71 backup 2017-06-23 15:13:53 -04:00
Heng Li 4ae0b46972 min_ksw_len 2017-06-23 14:38:28 -04:00
Heng Li 9cd313eae1 sequence retrieval working 2017-06-23 14:11:56 -04:00
Heng Li 326d91deb0 backup 2017-06-23 14:06:00 -04:00
Heng Li 44cdd18de0 start to work on alignment 2017-06-23 13:44:45 -04:00
Heng Li b04e4b9215 r36: bring back primary; don't output all mappings 2017-06-08 15:28:19 -04:00
Heng Li 19e43571c1 r34: removed a bit unused code 2017-06-07 14:35:57 -04:00
Heng Li 8ad5cfde42 output PAF 2017-06-07 14:18:32 -04:00
Heng Li 6d4348db44 dp chaining mostly works, but fails sometimes
which means there are bugs that need to be fixed
2017-06-06 14:19:50 -04:00
Heng Li 1a9fc04cf0 backup 2017-06-06 10:16:33 -04:00
Heng Li acc7382a30 backup 2017-06-04 16:09:45 -04:00