added an evaluation section

This commit is contained in:
Heng Li 2014-11-17 14:54:24 -05:00
parent cee8149b12
commit eb664c2fe8
3 changed files with 34 additions and 7 deletions

View File

@ -135,6 +135,31 @@ other programs for typing such as [Warren et al (2012)][hla4], [Liu et al
(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though
most of them are distributed under restrictive licenses.
## Preliminary Evaluation
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
(hs38d6), and called small variants from the alignment. CHM1 is haploid.
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
true positive (TP) heterozygous calls from NA12878 are approximately equal
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
should yield higher TP and lower FP. The following table shows the numbers for
these assemblies:
|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref|
|:------:|------:|------:|------:|------:|------:|
|FP | 255706| 168068| 142516|307172 | 575634|
|TP |2142260|2163113|2150844|2167235|2137053|
With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
from hs38 only and found the majority of them are associated with excessive read
depth, clustered variants or weak alignment. We believe most hs38-only calls are
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
with nearly identical library construction, the difference is ~140k, an order
of magnitude higher than the difference between hs38 and hs38d6. ALT contigs,
decoy and HLA genes in hs38d6 improve variant calling at little cost.
## Problems and Future Development
There are some uncertainties about ALT mappings - we are not sure whether they

View File

@ -104,17 +104,17 @@
\f0\fs24 \cf2 Read: A\cf0 TCAGCATC\
\cf2 \
ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATCA
ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATGGTCA
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf3 GAACTAGTCACAT\cf2 \
\cf3 |||||\cf5 (high div) \cf3 |||\cf5 (novel ins)\cf3 ||||||||||\cf2 \
\cf3 |||||\cf5 (high div) \cf3 ||||||\cf5 (novel ins)\cf3 ||||||||||\cf2 \
Chromosome:\cf3 GCGTACATGATACGA
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf3 ATC-------------CTAGTCACATCGTAATCGA\
\cf2 \cf3 |||||||||||| |||||||\cf5 (novel ins) \cf3 ||||||||||\
\f0\b0 \cf3 ATGGTC-------------CTAGTCACATCGTAATC\
\cf2 \cf3 |||||||||||| ||||||||||\cf5 (novel ins) \cf3 ||||||||||\
\cf2 ALT ctg 2:\cf3 TGATACGA
\f1\b \cf7 ATCgcCATC
\f0\b0 \cf3 ATCA
\f0\b0 \cf3 ATGGTCA
\f1\b \cf8 ATCgcCAgC
\f0\b0 \cf3 GAACTAGTCACAT\
\
@ -140,7 +140,9 @@ Chromosome:\cf3 GCGTACATGATACGA
\cf0 Hits considered in mapQ:
\f1\b \cf4 ATCAGCATC
\f0\b0 \cf0 and
\f1\b \cf6 ATCgGCATC\
\f1\b \cf6 ATCgGCATC
\f0\b0 \cf2 (best from each group)
\f1\b \cf6 \
\pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural
\f0\b0 \cf3 \
@ -217,7 +219,7 @@ Chromosome:\cf3 GCGTACATGATACGA
<key>MasterSheets</key>
<array/>
<key>ModificationDate</key>
<string>2014-11-17 18:01:49 +0000</string>
<string>2014-11-17 18:28:10 +0000</string>
<key>Modifier</key>
<string>Heng Li</string>
<key>NotesVisible</key>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

After

Width:  |  Height:  |  Size: 47 KiB