finished results

2021-08-07 00:06:28 -04:00 · 2021-08-07 00:06:28 -04:00 · e37f5ffe39
parent 8a1d52bcbe
commit e37f5ffe39
2 changed files with 84 additions and 20 deletions
--- a/tex/minimap2.bib
+++ b/tex/minimap2.bib
@ -431,3 +431,21 @@
 	Title = {A robust benchmark for detection of germline large deletions and insertions},
 	Volume = {38},
 	Year = {2020}}
@article{Harpak:2017aa,
 	Author = {Harpak, Arbel and others},
 	Journal = {Proc Natl Acad Sci U S A},
 	Pages = {12779-12784},
 	Title = {Frequent nonallelic gene conversion on the human lineage and its effect on the divergence of gene duplicates},
 	Volume = {114},
 	Year = {2017}}
@article{Li:2018aa,
 	Author = {Li, Heng and others},
 	Journal = {Nat Methods},
 	Month = {Aug},
 	Number = {8},
 	Pages = {595-597},
 	Title = {A synthetic-diploid benchmark for accurate variant-calling evaluation},
 	Volume = {15},
 	Year = {2018}}
--- a/tex/mm2-update.tex
+++ b/tex/mm2-update.tex
@ -41,18 +41,18 @@ minimap2 v2.18 or earlier.
 \end{abstract}
 \section{Introduction}
-Minimap2~\citep{Li:2018ab} is a widely used aligner for maping long sequence
+Minimap2~\citep{Li:2018ab} is widely used for maping long sequence
-reads and assembly contigs. \citet{Jain:2020aa} found minimap2 occasionally
+reads and assembly contigs. \citet{Jain:2020aa} found minimap2 v2.18 or earlier occasionally
 misaligned reads from highly repetitive regions as minimap2 ignored seeds of
 high occurrence. They also noticed minimap2 may misplace reads with structural
 variations (SVs) in such regions~\citep{Jain2020.11.01.363887}. These
 misalignments have become a pressing issue in the advent of
-temolere-to-telomore human assembly~\citep{Miga:2020aa}. Meanwhile, minimap2
+temolere-to-telomore human assembly~\citep{Miga:2020aa}. Meanwhile, old minimap2
 was unable to efficiently align long insertions/deletions (INDELs) and often
 breaks an alignment around variable-number tandem repeats (VNTRs). This has
 inspired new chaining algorithms~\citep{Li:2020aa,Ren:2021aa} which are not
-integrated into minimap2. Here we will describe recent improvements implemented
+integrated into minimap2. Here we will describe recent efforts implemented
-in v2.19 through v2.22.
+in v2.19 through v2.22 to improve mapping results.
 \begin{methods}
 \section{Methods}
@ -66,9 +66,10 @@ chaining due to insufficient anchors.
 To resolve this issue, we implemented a new heuristic to add additional
 minimizers. Suppose we are looking at two adjacent low-occurence $k$-mers
-located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge500$, the
+located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge500$,
-new minimap2 adds $\lfloor|x_1-x_2|/500\rfloor$ minimizers among
+minimap2 v2.22 additionally selects $\lfloor|x_1-x_2|/500\rfloor$ minimizers
-high-occurrence minimizers between $x_1$ and $x_2$. We use a binary heap data
+of the lowest occurrence among minimizers between $x_1$ and $x_2$.
 We use a binary heap data
 structure to select minimizers of the lowest occurrence in this interval.
 This strategy adds necessary anchors at the cost of increasing total alignment
 time by a few percent on real data.
@ -76,11 +77,11 @@ time by a few percent on real data.
 \subsection{Aligning through longer INDELs}
 The original minimap2 may fail to align long INDELs due to its chaining
 heuristics. Briefly, minimap2 applies dynamic programming (DP) to chain
-minimizer anchors. It is a quadratic algorithm, which is slow for chaining
+minimizer anchors. This is a quadratic algorithm, which is slow for chaining
 contigs. For acceptable performance, the original minimap2 uses a 500bp band by
-default. If there is an INDEL longer than 500bp and the two chains around it
+default. If there is an INDEL longer than 500bp and the two chains around the INDEL
 have no overlaps on either the query or the reference sequence, minimap2 may
-join the two short chains later as a postprocessing step. We call it the
+join the two short chains later at a later step. We call it the
 long-join heuristic. This heuristic may fail around VNTRs because short chains
 often have overlaps in VNTRs. More subtly, minimap2 may escape the inner DP
 loop early, again for performance, if the chaining result is not improved for
@ -91,10 +92,10 @@ specify a large band.
 In minigraph~\citep{Li:2020aa}, we developed a new chaining algorithm that
 finds short INDELs with DP-based chaining and goes through long INDELs with a
 subquadratic algorithm~\citep{DBLP:conf/wabi/AbouelhodaO03}. We ported the same
-algorithm to minimap2 for contig mapping. For read mapping, the minigraph
+algorithm to minimap2 for contig mapping. For long-read mapping, the minigraph
-algorithm is slower. The updated minimap2 still uses the DP-based algorithm to
+algorithm is slower. Minimap2 v2.22 now still uses the DP-based algorithm to
-find short chains and then uses the minigraph algorithm to rechain anchors in
+find short chains and then invokes the minigraph algorithm to rechain anchors in
-these short chains. The rechaining steps achieves the same goal as long-join
+these short chains. The rechaining step achieves the same goal as long-join
 but is more reliable as it can resolve overlaps between short chains. The old
 long-join heuristic has since been removed.
@ -107,7 +108,7 @@ algorithm.
 In our view, this problem is rooted in impropriate scoring: affine-gap penalty
 over-penalizes a long INDEL that was often evolutionarily created in one event.
-We should not penalize a SV linearly in its length. The new minimap2 rescores
+We should not penalize a SV linearly in its length. Minimap2 v2.22 rescores
 an alignment with the following scoring function. Suppose an alignment consists
 of $M$ matching bases, $N$ substitutions and $G$ gap opens, we empirically
 score the alignment with
@ -139,7 +140,7 @@ practice.
 \begin{table}
 \processtable{Evaluation of minimap2 v2.22}
-{\footnotesize\begin{tabular}{p{4.2cm}rrrr}
+{\footnotesize\label{tab:1}\begin{tabular}{p{4.2cm}rrrr}
 \toprule
 $[$Benchmark$]$ Metric & v2.22 & v2.18 & Winno & lra \\
 \midrule
@ -165,14 +166,59 @@ minimap2 and lra with ``paftools.js pafcmp''. $[$sim-sv$]$ simulated 1,000
 reads at 30 folds with the same pbsim2 command line. SVs were called with
 ``sniffles -q 10''~\citep{Sedlazeck:2018ab} and compared to the simulated truth with ``SURVIVOR eval
 call.vcf truth.bed 50''. In $[$real-sv-1k$]$, small and long variants were
-called by dipcall-0.3 for HG002 assemblies (AC: GCA\_018852605.1 and
+called by dipcall-0.3~\citep{Li:2018aa} for HG002 assemblies (AC: GCA\_018852605.1 and
 GCA\_018852615.1) and compared to the GIAB truth~\citep{Zook:2020aa} using ``truvari -r 2000 -s
 1000 -S 400 -{}-multimatch -{}-passonly'' which sets the minimum INDEL size to 1kb in evaluation. }
 \end{table}
-\section*{Acknowledgements}
+We evaluated minimap2 v2.22 along with v2.18, Winnowmap2 v2.03 and lra v1.3.2
 (Table~\ref{tab:1}). Both versions of minimap2 achieved high mapping accuracy on
 simulated Nanopore reads (sim-map). Winnowmap2 aligned more reads at mapping
 quality 10 or higher (mapQ10). However, it may occasionally assign a high mapping
 quality to a read with multiple identical best alignments. This reduced its
 mapping accuracy.
-\paragraph{Funding\textcolon} NHGRI R01HG010040
+In lack of groud truth for real data, so we took Winnowmap2 mapping as ground
 truth to evaluate other mappers (winno-cmp). Out of 1,378,092 reads with mapQ10
 alignments by Winnowmap2, minimap2 v2.22 could map all of them. 118 reads, less
 than 0.01\% of all reads, were mapped differently by v2.22. 51 of them have
 multiple identical best alignments. We believe these are more likely to be
 Winnowmap2 errors.
 The two benchmarks above only evaluate read mappings without variations.
 To measure the mapping accuracy in the presence of SVs (sim-sv), we reproduced
 the results by~\citep{Jain2020.11.01.363887}. Minimap2 v2.22 is as good as
 Winnowmap2 now. Note that we were setting the Sniffles mapping quality
 threshold to 10 in consistent with the benchmarks above. If we used the
 default threshold 20, v2.22 would miss additional 0.5\% SVs, suggesting
 minimap2 v2.22 could map variant reads correctly but with conservative mapping
 quality. This observation is more about the interaction between mappers and
 callers. Furthermore, the simulation here only considers a simple scenario in
 evolution. Non-allelic gene conversions, which happen often in segmental
 duplications~\citep{Harpak:2017aa}, would obscure the optimal mapping
 strategies. How much such simple SV simulation informs real-world SV calling
 remains a question.
 To see if minimap2 v2.22 could improve long INDEL alignment, we ran dipcall on
 contig-to-reference alignments and focused on INDELs longer than 1kb
 (real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its
 advantage in more contiguous alignment. lra is supposed to handle long INDELs
 better, too. However, we could not get lra to work well with dipcall, so did
 not report the numbers.
 Minimap2 spends most computing time on base alignment. As recent improvements
 in v2.22 do not change the base alignment algorithm, the new version has similar
 performance to older verions. Minimap2 is consistently faster than Winnowmap2
 by several times.
 \section*{Acknowledgements}
 We thank Arang Rhie and Chirag Jain for providing motivating examples where
 older minimap2 underperforms.
 \paragraph{Funding\textcolon} This work is funded by NHGRI grant R01HG010040.
 ~\\*
 \bibliography{minimap2}