updated manuscript

2021-10-01 11:42:36 -04:00 · 2021-10-01 11:42:36 -04:00 · 7ee62dae1d
parent 05a8a45d44
commit 7ee62dae1d
3 changed files with 43 additions and 20 deletions
--- a/misc/paftools.js
+++ b/misc/paftools.js
@ -2951,7 +2951,7 @@ function paf_pafcmp(args)
 {
 	var c, opt = { min_len:5000, min_mapq:10, min_ovlp:0.5 };
 	while ((c = getopt(args, "q:")) != null) {
-		if (c == 'q') opt.min_mapq = parseInt(opt.arg);
+		if (c == 'q') opt.min_mapq = parseInt(getopt.arg);
 	}

 	var buf = new Bytes();
--- a/tex/minimap2.bib
+++ b/tex/minimap2.bib
@ -448,3 +448,13 @@
 	Title = {A synthetic-diploid benchmark for accurate variant-calling evaluation},
 	Volume = {15},
 	Year = {2018}}
+
+@article{Gu:1995wt,
+	author = {Gu, X and Li, W H},
+	journal = {J Mol Evol},
+	month = {Apr},
+	number = {4},
+	pages = {464-73},
+	title = {The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment},
+	volume = {40},
+	year = {1995}}
--- a/tex/mm2-update.tex
+++ b/tex/mm2-update.tex
@ -57,7 +57,7 @@ in v2.19 through v2.22 to improve mapping results.
 \begin{methods}
 \section{Methods}

-\subsection{Rescuing high-occurrence $k$-mers}
+\subsection{Rescuing high-occurrence $k$-mers}\label{sec:high-occ}
 Minimap2 keeps all $k$-mer minimizers~\citep{Roberts:2004fv} during indexing. Its original
 implementation only selected low-occurrence minimizers during mapping. The
 cutoff is a few hundred for mapping long reads against a human genome. If a
@ -66,9 +66,10 @@ chaining due to insufficient anchors.

 To resolve this issue, we implemented a new heuristic to add additional
 minimizers. Suppose we are looking at two adjacent low-occurence $k$-mers
-located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge500$,
-minimap2 v2.22 additionally selects $\lfloor|x_1-x_2|/500\rfloor$ minimizers
-of the lowest occurrence among minimizers between $x_1$ and $x_2$.
+located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge L$,
+minimap2 v2.22 additionally selects $\lfloor|x_1-x_2|/L\rfloor$ minimizers
+of the lowest occurrence among minimizers between $x_1$ and $x_2$. Here
+parameter $L$ controls the frequency of sampling. It defaults to 500.
 This strategy adds necessary anchors at the cost of increasing total alignment
 time by a few percent on real data.

@ -106,7 +107,7 @@ the best scoring alignment is sometimes not the correct alignment.
 \citet{Jain2020.11.01.363887} resolved this dilemma by altering the mapping
 algorithm.

-In our view, this problem is rooted in impropriate scoring: affine-gap penalty
+In our view, this problem is rooted in inapropriate scoring: affine-gap penalty
 over-penalizes a long INDEL that was often evolutionarily created in one event.
 We should not penalize a SV by a function linear in the SV length. Minimap2 v2.22 instead rescores
 an alignment with the following scoring function. Suppose an alignment consists
@ -122,7 +123,7 @@ $$
 It approximates per-base sequence divergence except with the smallest value set
 to 2\%. As an analogy to affine-gap scoring, the matching score in our scheme
 is 1, the mismatch and gap open penalties are both $1/2d$ and the gap extension
-penalty is a logarithm function of the gap length. Our scoring gives a long SV
+penalty is a logarithm function of the gap length~\citep{Gu:1995wt}. Our scoring gives a long SV
 a much milder penalty. In terms of time complexity, scoring an alignment is
 linear in the length of the alignment. The time spent on rescoring is negligible in
 practice.
@ -144,13 +145,15 @@ practice.
 \toprule
 $[$Benchmark$]$ Metric & v2.22 & v2.18 & Winno & lra \\
 \midrule
-$[$sim-map$]$ \% mapped reads at Q10      & 97.9 & 97.6 & {\bf 99.0} & 97.3 \\
-$[$sim-map$]$ err. rate at Q10 (phredQ)   & {\bf 52}   & {\bf 52}   & 38   & 24 \\
-$[$winno-cmp$]$  rate of diff. (phredQ)   & {\bf 41}   & 37   & N/A  & 18 \\
-$[$sim-sv$]$  \% false negative rate      & {\bf 0.5}  & 2.0  & {\bf 0.5}  & 1.4  \\
-$[$sim-sv$]$  \% false discovery rate     & {\bf 0.0}  & 0.1  & {\bf 0.0}  & 0.1  \\
-$[$real-sv-1k$]$ \% false negative rate   & {\bf 7.3}  & 20.0 & 13.0 & N/A \\
-$[$real-sv-1k$]$ \% false discovery rate  & 2.7  & {\bf 2.4}  & 2.7  & N/A \\
+$[$sim-map$]$ \% mapped reads at Q10      & 97.9       & 97.6      & {\bf 99.0}& 97.3 \\
+$[$sim-map$]$ err. rate at Q10 (phredQ)   & {\bf 52}   & {\bf 52}  & 38        & 24 \\
+$[$winno-cmp$]$ rate of diff. (phredQ)    & {\bf 41}   & 37        & truth     & 18 \\
+$[$winno-cmp$]$ CPU time (hour)           & {\bf 5.0}  & 5.3       & 71.8      & 13.1 \\
+$[$winno-cmp$]$ peak RAM (Gb)             & 17.1       & 14.4      & {\bf 9.6} & 12.4 \\
+$[$sim-sv$]$  \% false negative rate      & {\bf 0.5}  & 2.0       & {\bf 0.5} & 1.4  \\
+$[$sim-sv$]$  \% false discovery rate     & {\bf 0.0}  & 0.1       & {\bf 0.0} & 0.1  \\
+$[$real-sv-1k$]$ \% false negative rate   & {\bf 7.3}  & 20.0      & 13.0      & N/A \\
+$[$real-sv-1k$]$ \% false discovery rate  & 2.7        & {\bf 2.4} & 2.7       & N/A \\
 \botrule
 \end{tabular}}
 {In $[$sim-map$]$, 152,713 reads were simulated from the CHM13 telomere-to-telomere assembly v1.1
@ -172,7 +175,8 @@ GCA\_018852615.1) and compared to the GIAB truth~\citep{Zook:2020aa} using ``tru
 \end{table}

 We evaluated minimap2 v2.22 along with v2.18, Winnowmap2 v2.03 and lra v1.3.2
-(Table~\ref{tab:1}). Both versions of minimap2 achieved high mapping accuracy on
+(Table~\ref{tab:1}), using the default setting of each mapper according to the input data types.
+Both versions of minimap2 achieved high mapping accuracy on
 simulated Nanopore reads (sim-map). Winnowmap2 aligned more reads at mapping
 quality 10 or higher (mapQ10). However, it may occasionally assign a high mapping
 quality to a read with multiple identical best alignments. This reduced its
@ -184,9 +188,19 @@ alignments by Winnowmap2, minimap2 v2.22 could map all of them. 118 reads, less
 than 0.01\% of all reads, were mapped differently by v2.22. 51 of them have
 multiple identical best alignments. We believe these are more likely to be
 Winnowmap2 errors. Most of the remaining 67 (=118-51) reads have multiple
-highly similar but not identical alignments. We are not sure how many are real
-mapping errors. Minimap2 v2.18 is less consistent with Winnowmap2. Most of the
-differences are located in highly repetitive regions.
+highly similar but not identical alignments.
+Minimap2 v2.18 is less consistent with 275 differences including 30 unmapped
+reads mappable by both Winnowmap2 and v2.22.
+
+For the minimizer rescuing parameter $L$ in Section~\ref{sec:high-occ},
+we set its default to 500 such that v2.22 has comparable performance to v2.18 given simulated PacBio and Nanopore human reads.
+To see the effect of this parameter on real data, we tried several different $L$ values.
+v2.22 gave 99 mapping differences at $L=200$,
+118 at $L=500$ (default), 167 at $L=750$ and 224 differences at $L=1000$ in comparison to Winnowmap2.
+$L=200$ is 28\% slower than the default while $L=1000$ is 9\% faster.
+Changing the default minimizer window size (option ``-w'')
+and the initial minimizer occurrence cutoff (option ``-f'')
+also affects performance and accuracy to a similar magnitude.

 The two benchmarks above only evaluate read mappings when there are no variations between the reads and the reference.
 To measure the mapping accuracy in the presence of SVs (sim-sv), we reproduced
@ -206,8 +220,7 @@ remains a question.
 To see if minimap2 v2.22 could improve long INDEL alignment, we ran dipcall on
 contig-to-reference alignments and focused on INDELs longer than 1kb
 (real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its
-advantage in more contiguous alignment. lra is supposed to handle long INDELs
-well, too. However, we could not get dipcall to work well with lra,
+advantage in more contiguous alignment. We could not get dipcall to work well with lra,
 so did not report the numbers.

 Minimap2 spends most computing time on base alignment. As recent improvements