added algorithm overview

This commit is contained in:
Heng Li 2017-07-27 18:50:39 -04:00
parent 2c79580649
commit 667b32a516
2 changed files with 52 additions and 9 deletions

View File

@ -40,11 +40,56 @@ will run a little slower. At present, minimap2 does not work with non-x86 CPUs
or ancient CPUs that do not support SSE2. SSE2 is critical to the performance
of minimap2.
## Algorithm Overview
In the following, minimap2 command line options have a dash ahead and are
highlighted in bold.
1. Read **-I** [=*4G*] reference bases, extract (**-k**,**-w**)-minimizers and
index them in a hash table.
2. Read **-K** [=*200M*] query bases. For each query sequence, do step 3
through 7:
3. For each (**-k**,**-w**)-minimizer on the query, check against the reference
index. If a reference minimizer is not among the top **-f** [=*2e-4*] most
frequent, collect its the occurrences in the reference, which are called
*seeds*.
4. Sort seeds by position in the reference. Chain them with dynamic
programming. Each chain represents a potential mapping. For read
overlapping, report all chains and then go to step 8. For reference mapping,
do step 5 through 7:
5. Let *P* be the set of primary mappings, which is an empty set initially. For
each chain from the best to the worst according to their chaining scores: if
on the query, the chain overlaps with a chain in *P* by **--mask-level**
[=*0.5*] or higher fraction of the shorter chain, mark the chain as
*secondary* to the chain in *P*; otherwise, add the chain to *P*.
6. Retain all primary mappings. Also retain up to **-N** [=*5*] top secondary
mappings if their chaining scores are higher than **-p** [=*0.8*] of their
corresponding primary mappings.
7. If alignment is requested, filter out an internal seed if it potentially
leads to both a long insertion and a long deletion. Extend from the
left-most seed. Perform global alignments between internal seeds. Split the
chain if the accumulative score along the global alignment drops by **-z**
[=*400*], disregarding long gaps. Extend from the right-most seed. Output
chains and their alignments.
8. If there are more query sequences in the input, go to step 2 until no more
queries are left.
9. If there are more reference sequences, reopen the query file from the start
and go to step 1; otherwise stop.
## Limitations
* At the alignment phase, minimap2 performs global alignments between minimizer
hits. If the positions of these minimizer hits are incorrect, the final
alignment may be suboptimal or unnecessarily fragmented.
alignment may be suboptimal or unnecessarily fragmented. This should happen
rarely with the latest version.
* Minimap2 may produce poor alignments that may need post-filtering. We are
still exploring a reliable and consistent way to report good alignments.
@ -54,9 +99,9 @@ of minimap2.
* Minimap2 requires SSE2 instructions to compile. It is possible to add
non-SSE2 support, but it would make minimap2 slower by several times.
In general, minimap2 is a young project with most code written since June,
2017. It may have bugs and room for improvements. Bug reports and suggestions
are warmly welcomed.
In general, minimap2 is a young project with most code written since June, 2017.
It may have bugs and room for improvements. Bug reports and suggestions are
warmly welcomed.

View File

@ -1,4 +1,4 @@
.TH minimap2 1 "19 July 2017" "minimap2-2.0-r190-dirty" "Bioinformatics tools"
.TH minimap2 1 "27 July 2017" "minimap2-2.0-r213-dirty" "Bioinformatics tools"
.SH NAME
.PP
minimap2 - mapping and alignment between collections of DNA sequences
@ -137,10 +137,8 @@ number of minimizers [3]
.BI -m \ INT
Discard chains with chaining score
.RI < INT
[40]. Chaining score equals the approximate number of matching bases (exact if
not using
.BR -H )
minus base-2 logarithm gap penalty. It is computed with dynamic programming.
[40]. Chaining score equals the approximate number of matching bases minus a
linear gap penalty. It is computed with dynamic programming.
.TP
.B -X
Perform all-vs-all mapping. In this mode, if the query sequence name is