minimap2/python/README.rst

126 lines
3.6 KiB
ReStructuredText
Raw Normal View History

===============================
Mmappy: Minimap2 Python Binding
===============================
2017-09-17 07:50:52 +08:00
2017-09-17 10:46:30 +08:00
`Minimap2 <https://github.com/lh3/minimap2>`_ is a fast and accurate pairwise
2017-09-17 10:29:52 +08:00
aligner for genomic and transcribed nucleotide sequences. This module wraps
minimap2 and provides a convenient interface to calling minimap2 in Python.
2017-09-17 07:50:52 +08:00
2017-09-17 10:29:52 +08:00
Installation
------------
2017-09-17 07:50:52 +08:00
2017-09-17 10:52:26 +08:00
The minimap2 module can be installed directly with:
2017-09-17 10:29:52 +08:00
.. code:: shell
git clone https://github.com/lh3/minimap2
cd minimap2
python setup.py install
2017-09-17 10:46:30 +08:00
or with `pip <https://en.wikipedia.org/wiki/Pip_(package_manager)>`_:
2017-09-17 10:29:52 +08:00
.. code:: shell
pip install --user mmappy
2017-09-17 10:29:52 +08:00
Usage
-----
2017-09-17 07:50:52 +08:00
The following Python program shows the key functionality of this module:
2017-09-17 10:29:52 +08:00
.. code:: python
import mmappy as mm
2017-09-17 10:52:26 +08:00
a = mm.Aligner("test/MT-human.fa") # load or build index
2017-09-17 10:29:52 +08:00
if not a: raise Exception("ERROR: failed to load/build index")
for hit in a.map("GGTTAAATACAGACCAAGAGCCTTCAAAGCCCTCAGTAAGTTGCAATACTTAATTTCTGT"):
print("{}\t{}\t{}\t{}".format(hit.ctg, hit.r_st, hit.r_en, hit.cigar_str))
2017-09-17 07:55:33 +08:00
It builds an index from the specified sequence file (or loads an index if a
2017-09-17 10:52:26 +08:00
pre-built index is specified), aligns a sequence against it, traverses each hit
2017-09-17 07:55:33 +08:00
and prints them out.
2017-09-17 07:50:52 +08:00
2017-09-17 10:29:52 +08:00
APIs
----
Class minimap2.Aligner
~~~~~~~~~~~~~~~~~~~~~~
.. code:: python
2017-09-17 07:50:52 +08:00
2017-09-17 10:29:52 +08:00
Aligner(fn_idx_in, preset=None, ...)
2017-09-17 07:50:52 +08:00
Arguments:
2017-09-17 10:36:19 +08:00
* **fn_idx_in**: index or sequence file name. Minimap2 automatically tests the
2017-09-17 07:50:52 +08:00
file type. If a sequence file is provided, minimap2 builds an index. The
sequence file can be optionally gzip'd.
2017-09-17 10:36:19 +08:00
* **preset**: minimap2 preset. Currently, minimap2 supports the following
presets: **sr** for single-end short reads; **map-pb** for PacBio
read-to-reference mapping; **map-ont** for Oxford Nanopore read mapping;
**splice** for long-read spliced alignment; **asm5** for assembly-to-assembly
alignment; **asm10** for full genome alignment of closely related species. Note
2017-09-17 07:50:52 +08:00
that the Python module does not support all-vs-all read overlapping.
2017-09-17 10:36:19 +08:00
* **k**: k-mer length, no larger than 28
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **w**: minimizer window size, no larger than 255
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **min_cnt**: mininum number of minimizers on a chain
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **min_chain_score**: minimum chaing score
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **bw**: chaining and alignment band width
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **best_n**: max number of alignments to return
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **n_threads**: number of indexing threads; 3 by default
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **fn_idx_out**: name of file to which the index is written
2017-09-17 07:50:52 +08:00
2017-09-17 10:29:52 +08:00
.. code:: python
2017-09-17 10:52:26 +08:00
map(seq)
2017-09-17 10:29:52 +08:00
2017-09-17 10:52:26 +08:00
This method maps :code:`seq` against the index. It *yields* a generator,
2017-09-17 10:36:19 +08:00
generating a series of :code:`Alignment` objects.
2017-09-17 07:50:52 +08:00
Class mmappy.Alignment
2017-09-17 10:29:52 +08:00
~~~~~~~~~~~~~~~~~~~~~~~~
2017-09-17 07:50:52 +08:00
This class has the following properties:
2017-09-17 10:36:19 +08:00
* **ctg**: name of the reference sequence the query is mapped to
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **ctg_len**: total length of the reference sequence
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **r_st** and **r_en**: start and end positions on the reference
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **q_st** and **q_en**: start and end positions on the query
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **strand**: +1 if on the forward strand; -1 if on the reverse strand
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **mapq**: mapping quality
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **NM**: number of mismatches and gaps in the alignment
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **blen**: length of the alignment, including both alignment matches and gaps
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **trans_strand**: transcript strand. +1 if on the forward strand; -1 if on the
2017-09-17 07:50:52 +08:00
reverse strand; 0 if unknown
2017-09-17 10:36:19 +08:00
* **is_primary**: if the alignment is primary (typically the best and the first
2017-09-17 07:50:52 +08:00
to generate)
2017-09-17 10:36:19 +08:00
* **cigar_str**: CIGAR string
2017-09-17 07:50:52 +08:00
2017-09-17 10:36:19 +08:00
* **cigar**: CIGAR returned as an array of shape :code:`(n_cigar,2)`. The two
numbers give the length and the operator of each CIGAR operation.
2017-09-17 07:50:52 +08:00
2017-09-17 10:52:26 +08:00
An :code:`Alignment` object can be converted to a string in the following format:
2017-09-17 07:50:52 +08:00
2017-09-17 10:29:52 +08:00
::
2017-09-17 07:50:52 +08:00
2017-09-17 10:29:52 +08:00
q_st q_en strand ctg ctg_len r_st r_en blen-NM blen mapq cg:Z:cigar_str