diff --git a/README.md b/README.md index 982a4b9..a64021d 100644 --- a/README.md +++ b/README.md @@ -40,11 +40,56 @@ will run a little slower. At present, minimap2 does not work with non-x86 CPUs or ancient CPUs that do not support SSE2. SSE2 is critical to the performance of minimap2. +## Algorithm Overview + +In the following, minimap2 command line options have a dash ahead and are +highlighted in bold. + +1. Read **-I** [=*4G*] reference bases, extract (**-k**,**-w**)-minimizers and + index them in a hash table. + +2. Read **-K** [=*200M*] query bases. For each query sequence, do step 3 + through 7: + +3. For each (**-k**,**-w**)-minimizer on the query, check against the reference + index. If a reference minimizer is not among the top **-f** [=*2e-4*] most + frequent, collect its the occurrences in the reference, which are called + *seeds*. + +4. Sort seeds by position in the reference. Chain them with dynamic + programming. Each chain represents a potential mapping. For read + overlapping, report all chains and then go to step 8. For reference mapping, + do step 5 through 7: + +5. Let *P* be the set of primary mappings, which is an empty set initially. For + each chain from the best to the worst according to their chaining scores: if + on the query, the chain overlaps with a chain in *P* by **--mask-level** + [=*0.5*] or higher fraction of the shorter chain, mark the chain as + *secondary* to the chain in *P*; otherwise, add the chain to *P*. + +6. Retain all primary mappings. Also retain up to **-N** [=*5*] top secondary + mappings if their chaining scores are higher than **-p** [=*0.8*] of their + corresponding primary mappings. + +7. If alignment is requested, filter out an internal seed if it potentially + leads to both a long insertion and a long deletion. Extend from the + left-most seed. Perform global alignments between internal seeds. Split the + chain if the accumulative score along the global alignment drops by **-z** + [=*400*], disregarding long gaps. Extend from the right-most seed. Output + chains and their alignments. + +8. If there are more query sequences in the input, go to step 2 until no more + queries are left. + +9. If there are more reference sequences, reopen the query file from the start + and go to step 1; otherwise stop. + ## Limitations * At the alignment phase, minimap2 performs global alignments between minimizer hits. If the positions of these minimizer hits are incorrect, the final - alignment may be suboptimal or unnecessarily fragmented. + alignment may be suboptimal or unnecessarily fragmented. This should happen + rarely with the latest version. * Minimap2 may produce poor alignments that may need post-filtering. We are still exploring a reliable and consistent way to report good alignments. @@ -54,9 +99,9 @@ of minimap2. * Minimap2 requires SSE2 instructions to compile. It is possible to add non-SSE2 support, but it would make minimap2 slower by several times. -In general, minimap2 is a young project with most code written since June, -2017. It may have bugs and room for improvements. Bug reports and suggestions -are warmly welcomed. +In general, minimap2 is a young project with most code written since June, 2017. +It may have bugs and room for improvements. Bug reports and suggestions are +warmly welcomed. diff --git a/minimap2.1 b/minimap2.1 index 8d76e49..ceb2c6c 100644 --- a/minimap2.1 +++ b/minimap2.1 @@ -1,4 +1,4 @@ -.TH minimap2 1 "19 July 2017" "minimap2-2.0-r190-dirty" "Bioinformatics tools" +.TH minimap2 1 "27 July 2017" "minimap2-2.0-r213-dirty" "Bioinformatics tools" .SH NAME .PP minimap2 - mapping and alignment between collections of DNA sequences @@ -137,10 +137,8 @@ number of minimizers [3] .BI -m \ INT Discard chains with chaining score .RI < INT -[40]. Chaining score equals the approximate number of matching bases (exact if -not using -.BR -H ) -minus base-2 logarithm gap penalty. It is computed with dynamic programming. +[40]. Chaining score equals the approximate number of matching bases minus a +linear gap penalty. It is computed with dynamic programming. .TP .B -X Perform all-vs-all mapping. In this mode, if the query sequence name is