various minor improvements

2017-08-25 17:56:16 +08:00 · 2017-08-25 17:56:16 +08:00 · 2641613686
parent 5f96d851a8
commit 2641613686
1 changed files with 24 additions and 23 deletions
--- a/tex/minimap2.tex
+++ b/tex/minimap2.tex
@ -177,8 +177,8 @@ based on Eq.~(\ref{eq:ae86}) can achieve 16-way parallelization for short
 sequences, but only 4-way parallelization when the peak alignment score reaches
 32767. Long sequence alignment may exceed this threshold. Inspired by
 \citet{Wu:1996aa} and the following work, \citet{Suzuki:2016} proposed a
-difference-based formulation that lifted this limitation. In case of 2-piece
-gap cost, define
+difference-based formulation that lifted this limitation.
+In case of 2-piece gap cost, define
 \[
 \left\{\begin{array}{ll}
 u_{ij}\triangleq H_{ij}-H_{i-1,j} & v_{ij}\triangleq H_{ij}-H_{i,j-1} \\
@ -199,9 +199,10 @@ y_{ij}&=&\max\{0,y_{i,j-1}+u_{i,j-1}-z_{ij}+q\}-q-e\\
 \tilde{y}_{ij}&=&\max\{0,\tilde{y}_{i,j-1}+u_{i,j-1}-z_{ij}+\tilde{q}\}-\tilde{q}-\tilde{e}
 \end{array}\right.
 \end{equation}
-where $z_{ij}$ is a temporary variable that does not need to be stored.  An
-important property of Eq.~(\ref{eq:suzuki}) is that all values are bounded. To
-see that,
+where $z_{ij}$ is a temporary variable that does not need to be stored.
+
+An important property of Eq.~(\ref{eq:suzuki}) is that all values are bounded
+by scoring parameters. To see that,
 \[
 x_{ij}=E_{i+1,j}-H_{ij}=\max\{-q,E_{ij}-H_{ij}\}-e
 \]
@ -245,8 +246,8 @@ each other. This allows us to fully vectorize the computation of all cells on
 the same anti-diagonal in one inner loop. It also simplifies banded alignment,
 which would be difficult with striped vectorization~\citep{Farrar:2007hs}.

-On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the boundary
-condition of the equation above is
+On the condition that $q+e<\tilde{q}+\tilde{e}$ and $e>\tilde{e}$, the initial
+values in the diagonal-antidiagonal formuation is
 \[
 \left\{\begin{array}{l}
 x_{r-1,-1}=y_{r-1,r}=-q-e\\
@ -263,7 +264,7 @@ r\cdot(e-\tilde{e})-(\tilde{q}-q)-\tilde{e} & (r=\lceil\frac{\tilde{q}-q}{e-\til
 -\tilde{e} & (r>\lceil\frac{\tilde{q}-q}{e-\tilde{e}}-1\rceil)
 \end{array}\right.
 \]
-These can be derived from the initial conditions of Eq.~(\ref{eq:ae86}).
+These can be derived from the initial values for Eq.~(\ref{eq:ae86}).

 In practice, our 16-way vectorized implementation of global alignment is three
 times as fast as Parasail's 4-way vectorization~\citep{Daily:2016aa}.  Without
@ -285,9 +286,9 @@ $j'<j$, such that
 S(i',j')-S(i,j)>Z+e\cdot|(i-i')-(j-j')|
 \]
 where $e$ is the gap extension cost and $Z$ is an arbitrary threshold.
-This strategy is similar to X-drop employed in BLAST~\citep{Altschul:1997vn}.
-However, unlike X-drop, it would not break the alignment in the presence of a
-single long gap.
+This strategy is first used in BWA-MEM. It is similar to X-drop employed in
+BLAST~\citep{Altschul:1997vn}, but unlike X-drop, it would not break the
+alignment in the presence of a single long gap.

 When minimap2 breaks a global alignment between two anchors, it performs local
 alignment between the two subsequences involved in the global alignment, but
@ -326,7 +327,7 @@ F_{i,j+1}= \max\{H_{ij}-q,F_{ij}\}-e\\
 \end{array}\right.
 \end{equation}
 Let $T$ be the reference sequence. $d(i)$ is the cost of a non-canonical donor
-site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a postive number $p$
+site, which takes 0 if $T[i+1,i+2]={\tt GT}$, or a positive number $p$
 otherwise. Similarly, $a(i)$ is the cost of a non-canonical acceptor site, which
 takes 0 if $T[i-1,i]={\tt AG}$, or $p$ otherwise. Eq.~(\ref{eq:splice}) is
 almost equivalent to the equation used by EXALIN~\citep{Zhang:2006aa} except
@ -361,18 +362,18 @@ alignment.
 \centering
 \includegraphics[width=.5\textwidth]{roc-color.pdf}
 \caption{Evaluation on simulated SMRT reads aligned against human genome
-GRCh38. (a) ROC-like curve. Alignments are sorted by mapping quality in the
-descending order. For each mapping quality threshold, the fraction of
-alignments with mapping quality above the threshold and their error rate
-are plotted. (b) Accumulative mapping error rate as a function of mapping
-quality. 33,088 $\ge$1000bp reads were simulated using pbsim~\citep{Ono:2013aa}
+GRCh38.  33,088 $\ge$1000bp reads were simulated using pbsim~\citep{Ono:2013aa}
 with error profile sampled from file `m131017\_060208\_42213\_*.1.*' downloaded
 at \href{http://bit.ly/chm1p5c3}{http://bit.ly/chm1p5c3}. The N50 read length
 is 11,628. A read is considered correctly mapped if the true position overlaps
 with the best mapping position by 10\% of the read length. All aligners were
-run under the default setting for SMRT reads. Kart outputted all alignments at
-mapping quality 60, so is not shown in the figure. It mapped nearly all reads
-with 4.1\% of alignments being wrong, less accurate than others.}\label{fig:eval}
+run under the default setting for SMRT reads.  (a) ROC-like curve. Alignments
+are sorted by mapping quality in the descending order. For each mapping quality
+threshold, the fraction of alignments with mapping quality above the threshold
+and their error rate are plotted.  Kart outputted all alignments at mapping
+quality 60, so is not shown in the figure. It mapped nearly all reads with
+4.1\% of alignments being wrong, less accurate than others.  (b) Accumulative
+mapping error rate as a function of mapping quality.}\label{fig:eval}
 \end{figure}

 As a sanity check, we evaluated minimap2 on simulated human reads along with
@ -383,7 +384,7 @@ Kart~(v2.2.5; \citealp{Lin:2017aa}),
 minialign~(v0.5.3; \citealp{Suzuki:2016}) and
 NGMLR~(v0.2.5; \citealp{Sedlazeck169557}). We excluded rHAT~\citep{Liu:2016ab}
 and LAMSA~\citep{Liu:2017aa} because they either
-crashed or produced malformatted output. In this evaluation, Minimap2 has
+crashed or produced malformatted output. In this evaluation, minimap2 has
 higher power to distinguish unique and repetitive hits, and achieves overall
 higher mapping accuracy (Fig.~\ref{fig:eval}a). It is still the most accurate
 even if we skip DP-based alignment (data not shown), confirming chaining alone
@ -500,8 +501,8 @@ necessary to justify the use of minimap2 for such applications.
 We owe a debt of gratitude to Hajime Suzuki for releasing his masterpiece and
 insightful notes before formal publication. We thank M. Schatz, P. Rescheneder
 and F.  Sedlazeck for pointing out the limitation of BWA-MEM. We are also
-grateful to early minimap2 testers who have greatly helped to fix various
-issues.
+grateful to early minimap2 testers who have greatly helped to suggest features
+and to fix various issues.

 \bibliography{minimap2}