2017-09-17 12:05:30 +08:00
|
|
|
==============================
|
|
|
|
|
Mappy: Minimap2 Python Binding
|
|
|
|
|
==============================
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
Mappy provides a convenient interface to `minimap2
|
|
|
|
|
<https://github.com/lh3/minimap2>`_, a fast and accurate C program to align
|
|
|
|
|
genomic and transcribe nucleotide sequences.
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:29:52 +08:00
|
|
|
Installation
|
|
|
|
|
------------
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
Mappy depends on `zlib <http://zlib.net>`_. It can be installed with `pip
|
|
|
|
|
<https://en.wikipedia.org/wiki/Pip_(package_manager)>`_:
|
2017-09-17 10:29:52 +08:00
|
|
|
|
|
|
|
|
.. code:: shell
|
|
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
pip install --user mappy
|
2017-09-17 10:29:52 +08:00
|
|
|
|
2017-09-18 08:08:47 +08:00
|
|
|
or from the minimap2 github repo (`Cython <http://cython.org>`_ required):
|
2017-09-17 10:29:52 +08:00
|
|
|
|
|
|
|
|
.. code:: shell
|
|
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
git clone https://github.com/lh3/minimap2
|
|
|
|
|
cd minimap2
|
|
|
|
|
python setup.py install
|
2017-09-17 10:29:52 +08:00
|
|
|
|
|
|
|
|
Usage
|
|
|
|
|
-----
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-18 08:08:47 +08:00
|
|
|
The following Python script demonstrates the key functionality of mappy:
|
2017-09-17 10:29:52 +08:00
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
2017-09-17 12:05:30 +08:00
|
|
|
import mappy as mp
|
|
|
|
|
a = mp.Aligner("test/MT-human.fa") # load or build index
|
2017-09-17 10:29:52 +08:00
|
|
|
if not a: raise Exception("ERROR: failed to load/build index")
|
2018-02-23 23:18:26 +08:00
|
|
|
s = a.seq("MT_human", 100, 200) # retrieve a subsequence from the index
|
|
|
|
|
print(mp.revcomp(s)) # reverse complement
|
2017-09-18 02:41:59 +08:00
|
|
|
for name, seq, qual in mp.fastx_read("test/MT-orang.fa"): # read a fasta/q sequence
|
|
|
|
|
for hit in a.map(seq): # traverse alignments
|
|
|
|
|
print("{}\t{}\t{}\t{}".format(hit.ctg, hit.r_st, hit.r_en, hit.cigar_str))
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:29:52 +08:00
|
|
|
APIs
|
|
|
|
|
----
|
|
|
|
|
|
2018-07-25 11:29:55 +08:00
|
|
|
Mappy implements two classes and two global function.
|
2017-09-18 02:41:59 +08:00
|
|
|
|
2017-09-17 12:05:30 +08:00
|
|
|
Class mappy.Aligner
|
2017-09-18 02:41:59 +08:00
|
|
|
~~~~~~~~~~~~~~~~~~~
|
2017-09-17 10:29:52 +08:00
|
|
|
|
|
|
|
|
.. code:: python
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2018-08-06 09:32:05 +08:00
|
|
|
mappy.Aligner(fn_idx_in=None, preset=None, ...)
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
This constructor accepts the following arguments:
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **fn_idx_in**: index or sequence file name. Minimap2 automatically tests the
|
2017-09-17 07:50:52 +08:00
|
|
|
file type. If a sequence file is provided, minimap2 builds an index. The
|
2018-08-06 09:32:05 +08:00
|
|
|
sequence file can be optionally gzip'd. This option has no effect if **seq**
|
|
|
|
|
is set.
|
|
|
|
|
|
|
|
|
|
* **seq**: a single sequence to index. The sequence name will be set to
|
|
|
|
|
:code:`N/A`.
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **preset**: minimap2 preset. Currently, minimap2 supports the following
|
|
|
|
|
presets: **sr** for single-end short reads; **map-pb** for PacBio
|
|
|
|
|
read-to-reference mapping; **map-ont** for Oxford Nanopore read mapping;
|
|
|
|
|
**splice** for long-read spliced alignment; **asm5** for assembly-to-assembly
|
|
|
|
|
alignment; **asm10** for full genome alignment of closely related species. Note
|
2017-09-17 07:50:52 +08:00
|
|
|
that the Python module does not support all-vs-all read overlapping.
|
|
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **k**: k-mer length, no larger than 28
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **w**: minimizer window size, no larger than 255
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **min_cnt**: mininum number of minimizers on a chain
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **min_chain_score**: minimum chaing score
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **bw**: chaining and alignment band width
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **best_n**: max number of alignments to return
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **n_threads**: number of indexing threads; 3 by default
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2018-08-06 09:32:05 +08:00
|
|
|
* **extra_flags**: additional flags defined in minimap.h
|
|
|
|
|
|
|
|
|
|
* **fn_idx_out**: name of file to which the index is written. This parameter
|
|
|
|
|
has no effect if **seq** is set.
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:29:52 +08:00
|
|
|
.. code:: python
|
|
|
|
|
|
2018-07-25 11:29:55 +08:00
|
|
|
mappy.Aligner.map(seq, seq2=None, cs=False, MD=False)
|
2017-09-17 10:29:52 +08:00
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
This method aligns :code:`seq` against the index. It is a generator, *yielding*
|
2018-02-01 00:33:08 +08:00
|
|
|
a series of :code:`mappy.Alignment` objects. If :code:`seq2` is present, mappy
|
|
|
|
|
performs paired-end alignment, assuming the two ends are in the FR orientation.
|
|
|
|
|
Alignments of the two ends can be distinguished by the :code:`read_num` field
|
2018-07-25 11:29:55 +08:00
|
|
|
(see Class mappy.Alignment below). Argument :code:`cs` asks mappy to generate
|
|
|
|
|
the :code:`cs` tag; :code:`MD` is similar. These two arguments might slightly
|
|
|
|
|
degrade performance and are not enabled by default.
|
2018-02-23 23:18:26 +08:00
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
mappy.Aligner.seq(name, start=0, end=0x7fffffff)
|
|
|
|
|
|
|
|
|
|
This method retrieves a (sub)sequence from the index and returns it as a Python
|
|
|
|
|
string. :code:`None` is returned if :code:`name` is not present in the index or
|
|
|
|
|
the start/end coordinates are invalid.
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 12:05:30 +08:00
|
|
|
Class mappy.Alignment
|
2017-09-18 02:41:59 +08:00
|
|
|
~~~~~~~~~~~~~~~~~~~~~
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
This class describes an alignment. An object of this class has the following
|
|
|
|
|
properties:
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **ctg**: name of the reference sequence the query is mapped to
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **ctg_len**: total length of the reference sequence
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **r_st** and **r_en**: start and end positions on the reference
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **q_st** and **q_en**: start and end positions on the query
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **strand**: +1 if on the forward strand; -1 if on the reverse strand
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **mapq**: mapping quality
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **blen**: length of the alignment, including both alignment matches and gaps
|
2017-10-16 23:15:07 +08:00
|
|
|
but excluding ambiguous bases.
|
|
|
|
|
|
|
|
|
|
* **mlen**: length of the matching bases in the alignment, excluding ambiguous
|
|
|
|
|
base matches.
|
|
|
|
|
|
|
|
|
|
* **NM**: number of mismatches, gaps and ambiguous poistions in the alignment
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **trans_strand**: transcript strand. +1 if on the forward strand; -1 if on the
|
2017-09-17 07:50:52 +08:00
|
|
|
reverse strand; 0 if unknown
|
|
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **is_primary**: if the alignment is primary (typically the best and the first
|
2017-09-17 07:50:52 +08:00
|
|
|
to generate)
|
|
|
|
|
|
2018-02-01 00:33:08 +08:00
|
|
|
* **read_num**: read number that the alignment corresponds to; 1 for the first
|
|
|
|
|
read and 2 for the second read
|
|
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **cigar_str**: CIGAR string
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:36:19 +08:00
|
|
|
* **cigar**: CIGAR returned as an array of shape :code:`(n_cigar,2)`. The two
|
|
|
|
|
numbers give the length and the operator of each CIGAR operation.
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2018-07-25 11:29:55 +08:00
|
|
|
* **MD**: the :code:`MD` tag as in the SAM format. It is an empty string unless
|
|
|
|
|
the :code:`MD` argument is applied when calling :code:`mappy.Aligner.map()`.
|
|
|
|
|
|
|
|
|
|
* **cs**: the :code:`cs` tag.
|
|
|
|
|
|
2017-09-18 02:41:59 +08:00
|
|
|
An :code:`Alignment` object can be converted to a string with :code:`str()` in
|
|
|
|
|
the following format:
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-09-17 10:29:52 +08:00
|
|
|
::
|
2017-09-17 07:50:52 +08:00
|
|
|
|
2017-10-16 23:15:07 +08:00
|
|
|
q_st q_en strand ctg ctg_len r_st r_en mlen blen mapq cg:Z:cigar_str
|
2017-09-18 02:41:59 +08:00
|
|
|
|
|
|
|
|
It is effectively the PAF format without the QueryName and QueryLength columns
|
|
|
|
|
(the first two columns in PAF).
|
|
|
|
|
|
2018-02-20 22:41:25 +08:00
|
|
|
Miscellaneous Functions
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
2017-09-18 02:41:59 +08:00
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
2018-02-23 23:18:26 +08:00
|
|
|
mappy.fastx_read(fn, read_comment=False)
|
2017-09-18 02:41:59 +08:00
|
|
|
|
|
|
|
|
This generator function opens a FASTA/FASTQ file and *yields* a
|
|
|
|
|
:code:`(name,seq,qual)` tuple for each sequence entry. The input file may be
|
2018-02-23 23:18:26 +08:00
|
|
|
optionally gzip'd. If :code:`read_comment` is True, this generator yields
|
|
|
|
|
a :code:`(name,seq,qual,comment)` tuple instead.
|
2018-02-20 22:41:25 +08:00
|
|
|
|
|
|
|
|
.. code:: python
|
|
|
|
|
|
|
|
|
|
mappy.revcomp(seq)
|
|
|
|
|
|
|
|
|
|
Return the reverse complement of DNA string :code:`seq`. This function
|
|
|
|
|
recognizes IUB code and preserves the letter cases. Uracil :code:`U` is
|
|
|
|
|
complemented to :code:`A`.
|