a new section on estimating sequence divergence

2017-12-24 12:57:38 -05:00 · 2017-12-24 12:57:38 -05:00 · 99879e9e75
parent c969d1a1ce
commit 99879e9e75
1 changed files with 28 additions and 1 deletions
--- a/tex/minimap2.tex
+++ b/tex/minimap2.tex
@ -157,6 +157,33 @@ add the chain to $Q$. In the end, $Q$ contains all the primary chains. We did
 not choose a more sophisticated data structure (e.g. range tree or k-d tree)
 because this step is not the performance bottleneck.

+\subsubsection{Estimating per-base sequence divergence}
+Suppose a query sequence harbors $n$ seeds of length $k$, $m$ of which are
+present in a chain. We want to estimate the sequence divergence $\epsilon$
+between the query and the reference sequences in the chain. This is useful
+when base-level alignment is too expensive to perform.
+
+If we model substitutions with a homogeneous Poisson process along the query
+sequence, the probablity of seeing $k$ consecutive bases without substitutions
+is $e^{-k\epsilon}$. On the assumption that all $k$-mers are independent of
+each other, the likelihood function of $\epsilon$ is
+\[
+\mathcal{L}(\epsilon|n,m,k)=e^{-m\cdot k\epsilon}(1-e^{-k\epsilon})^{n-m}
+\]
+The maximum likelihood estimate of $\epsilon$ is
+\[
+\hat{\epsilon}=\frac{1}{k}\log\frac{n}{m}
+\]
+In reality, sequencing errors are sometimes clustered and $k$-mers are not
+independent of each other, especially when we take minimizers as seeds. These
+violate the assumptions in the derivation above. As a result, $\hat{\epsilon}$
+is only approximate and can be biased. It also ignores long deletions from the
+reference sequence. In practice, fortunately, $\hat{\epsilon}$ is often close
+to and strongly correlated with the sequence divergence estimated from
+base-level alignments. On the several datasets used in
+Section~\ref{sec:long-genomic}, the Spearman correlation coefficient is around
+$0.9$.
+
 \subsection{Aligning genomic DNA}\label{sec:genomic}

 \subsubsection{Alignment with 2-piece affine gap cost}
@ -397,7 +424,7 @@ consistent paired-end alignments.

 \section{Results}

-\subsection{Aligning long genomic reads}
+\subsection{Aligning long genomic reads}\label{sec:long-genomic}

 \begin{figure}[!tb]
 \centering