new section on HPC k-mers

2017-12-24 19:11:23 -05:00 · 2017-12-24 19:11:23 -05:00 · f159e1c2d3
parent 3a375d3436
commit f159e1c2d3
2 changed files with 28 additions and 0 deletions
--- a/tex/minimap2.bib
+++ b/tex/minimap2.bib
@ -313,3 +313,11 @@
 	note = {doi:10.1101/223297},
 	journal = {bioRxiv}
 }
+
+@article{Berlin:2015xy,
+	Author = {Berlin, Konstantin and others},
+	Journal = {Nat Biotechnol},
+	Pages = {623-30},
+	Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing},
+	Volume = {33},
+	Year = {2015}}
--- a/tex/minimap2.tex
+++ b/tex/minimap2.tex
@ -184,6 +184,26 @@ base-level alignments. On the several datasets used in
 Section~\ref{sec:long-genomic}, the Spearman correlation coefficient is around
 $0.9$.

+\subsubsection{Indexing with homopolymer compressed $k$-mers}
+SmartDenovo
+(\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo};
+J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
+$k$-mers and finds the strategy improves overlap sensitivity for SMRT reads.
+Minimap2 adopts the same heuristic.
+
+The HPC string of a string $s$, denoted by ${\rm HPC}(s)$, is constructed by
+contracting homopolymers in $s$ to a single base.  An HPC $k$-mer of $s$ is a
+$k$-long substring of ${\rm HPC}(s)$. For example, suppose $s={\tt GGATTTTCCA}$,
+${\rm HPC}(s)={\tt GATCA}$ and the first HPC 4-mer is ${\tt GATC}$.
+
+To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping
+for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using
+different types of $k$-mers. With normal 15bp minimizers per 5bp window,
+minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference
+alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this
+higher sensitivity by indexing 1/3 fewer minimizers, which further helps
+performance. HPC-based indexing reduces the sensitivity for ONT reads, though.
+
 \subsection{Aligning genomic DNA}\label{sec:genomic}

 \subsubsection{Alignment with 2-piece affine gap cost}