From f159e1c2d35686e95d40f9153540f45743069a76 Mon Sep 17 00:00:00 2001 From: Heng Li Date: Sun, 24 Dec 2017 19:11:23 -0500 Subject: [PATCH] new section on HPC k-mers --- tex/minimap2.bib | 8 ++++++++ tex/minimap2.tex | 20 ++++++++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/tex/minimap2.bib b/tex/minimap2.bib index 5e2a0f1..49b4c13 100644 --- a/tex/minimap2.bib +++ b/tex/minimap2.bib @@ -313,3 +313,11 @@ note = {doi:10.1101/223297}, journal = {bioRxiv} } + +@article{Berlin:2015xy, + Author = {Berlin, Konstantin and others}, + Journal = {Nat Biotechnol}, + Pages = {623-30}, + Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing}, + Volume = {33}, + Year = {2015}} diff --git a/tex/minimap2.tex b/tex/minimap2.tex index eac1886..689a328 100644 --- a/tex/minimap2.tex +++ b/tex/minimap2.tex @@ -184,6 +184,26 @@ base-level alignments. On the several datasets used in Section~\ref{sec:long-genomic}, the Spearman correlation coefficient is around $0.9$. +\subsubsection{Indexing with homopolymer compressed $k$-mers} +SmartDenovo +(\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo}; +J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC) +$k$-mers and finds the strategy improves overlap sensitivity for SMRT reads. +Minimap2 adopts the same heuristic. + +The HPC string of a string $s$, denoted by ${\rm HPC}(s)$, is constructed by +contracting homopolymers in $s$ to a single base. An HPC $k$-mer of $s$ is a +$k$-long substring of ${\rm HPC}(s)$. For example, suppose $s={\tt GGATTTTCCA}$, +${\rm HPC}(s)={\tt GATCA}$ and the first HPC 4-mer is ${\tt GATC}$. + +To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping +for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using +different types of $k$-mers. With normal 15bp minimizers per 5bp window, +minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference +alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this +higher sensitivity by indexing 1/3 fewer minimizers, which further helps +performance. HPC-based indexing reduces the sensitivity for ONT reads, though. + \subsection{Aligning genomic DNA}\label{sec:genomic} \subsubsection{Alignment with 2-piece affine gap cost}