new section on HPC k-mers

This commit is contained in:
Heng Li 2017-12-24 19:11:23 -05:00
parent 3a375d3436
commit f159e1c2d3
2 changed files with 28 additions and 0 deletions

View File

@ -313,3 +313,11 @@
note = {doi:10.1101/223297},
journal = {bioRxiv}
}
@article{Berlin:2015xy,
Author = {Berlin, Konstantin and others},
Journal = {Nat Biotechnol},
Pages = {623-30},
Title = {Assembling large genomes with single-molecule sequencing and locality-sensitive hashing},
Volume = {33},
Year = {2015}}

View File

@ -184,6 +184,26 @@ base-level alignments. On the several datasets used in
Section~\ref{sec:long-genomic}, the Spearman correlation coefficient is around
$0.9$.
\subsubsection{Indexing with homopolymer compressed $k$-mers}
SmartDenovo
(\href{https://github.com/ruanjue/smartdenovo}{https://github.com/ruanjue/smartdenovo};
J Ruan, personal communication) indexes reads with homopolymer-compressed (HPC)
$k$-mers and finds the strategy improves overlap sensitivity for SMRT reads.
Minimap2 adopts the same heuristic.
The HPC string of a string $s$, denoted by ${\rm HPC}(s)$, is constructed by
contracting homopolymers in $s$ to a single base. An HPC $k$-mer of $s$ is a
$k$-long substring of ${\rm HPC}(s)$. For example, suppose $s={\tt GGATTTTCCA}$,
${\rm HPC}(s)={\tt GATCA}$ and the first HPC 4-mer is ${\tt GATC}$.
To demonstrate the effectiveness of HPC $k$-mers, we performed read overlapping
for the example {\it E. coli} SMRT reads from PBcR~\citep{Berlin:2015xy}, using
different types of $k$-mers. With normal 15bp minimizers per 5bp window,
minimap2 finds 90.9\% of $\ge$2kb overlaps inferred from the read-to-reference
alignment. With HPC 19-mers, minimap2 finds 97.4\% of overlaps. It achieves this
higher sensitivity by indexing 1/3 fewer minimizers, which further helps
performance. HPC-based indexing reduces the sensitivity for ONT reads, though.
\subsection{Aligning genomic DNA}\label{sec:genomic}
\subsubsection{Alignment with 2-piece affine gap cost}