diff --git a/README-alt.md b/README-alt.md index e05a225..14c5d58 100644 --- a/README-alt.md +++ b/README-alt.md @@ -135,6 +135,31 @@ other programs for typing such as [Warren et al (2012)][hla4], [Liu et al (2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though most of them are distributed under restrictive licenses. +## Preliminary Evaluation + +To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878 +unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy +(hs38d6), and called small variants from the alignment. CHM1 is haploid. +Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The +true positive (TP) heterozygous calls from NA12878 are approximately equal +to the difference between NA12878 and CHM1 heterozygous calls. A better assembly +should yield higher TP and lower FP. The following table shows the numbers for +these assemblies: + +|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref| +|:------:|------:|------:|------:|------:|------:| +|FP | 255706| 168068| 142516|307172 | 575634| +|TP |2142260|2163113|2150844|2167235|2137053| + +With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces +FP by ~25k but also reduces TP by ~12k. We manually inspected variants called +from hs38 only and found the majority of them are associated with excessive read +depth, clustered variants or weak alignment. We believe most hs38-only calls are +problematic. In addition, if we compare two NA12878 replicates from HiSeq X10 +with nearly identical library construction, the difference is ~140k, an order +of magnitude higher than the difference between hs38 and hs38d6. ALT contigs, +decoy and HLA genes in hs38d6 improve variant calling at little cost. + ## Problems and Future Development There are some uncertainties about ALT mappings - we are not sure whether they diff --git a/extras/alt-demo.graffle b/extras/alt-demo.graffle index ff47a30..32a8f5f 100644 --- a/extras/alt-demo.graffle +++ b/extras/alt-demo.graffle @@ -104,17 +104,17 @@ \f0\fs24 \cf2 Read: A\cf0 TCAGCATC\ \cf2 \ - ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATCA + ALT ctg 1: \cf3 TGA\cf3 AA---CGAATGCAAATGGTCA \f1\b \cf4 ATCAGCATC \f0\b0 \cf3 GAACTAGTCACAT\cf2 \ - \cf3 |||||\cf5 (high div) \cf3 |||\cf5 (novel ins)\cf3 ||||||||||\cf2 \ + \cf3 |||||\cf5 (high div) \cf3 ||||||\cf5 (novel ins)\cf3 ||||||||||\cf2 \ Chromosome:\cf3 GCGTACATGATACGA \f1\b \cf6 ATCgGCATC -\f0\b0 \cf3 ATC-------------CTAGTCACATCGTAATCGA\ -\cf2 \cf3 |||||||||||| |||||||\cf5 (novel ins) \cf3 ||||||||||\ +\f0\b0 \cf3 ATGGTC-------------CTAGTCACATCGTAATC\ +\cf2 \cf3 |||||||||||| ||||||||||\cf5 (novel ins) \cf3 ||||||||||\ \cf2 ALT ctg 2:\cf3 TGATACGA \f1\b \cf7 ATCgcCATC -\f0\b0 \cf3 ATCA +\f0\b0 \cf3 ATGGTCA \f1\b \cf8 ATCgcCAgC \f0\b0 \cf3 GAACTAGTCACAT\ \ @@ -140,7 +140,9 @@ Chromosome:\cf3 GCGTACATGATACGA \cf0 Hits considered in mapQ: \f1\b \cf4 ATCAGCATC \f0\b0 \cf0 and -\f1\b \cf6 ATCgGCATC\ +\f1\b \cf6 ATCgGCATC +\f0\b0 \cf2 (best from each group) +\f1\b \cf6 \ \pard\tx560\tx1120\tx1680\tx2240\tx2800\tx3360\tx3920\tx4480\tx5040\tx5600\tx6160\tx6720\pardirnatural \f0\b0 \cf3 \ @@ -217,7 +219,7 @@ Chromosome:\cf3 GCGTACATGATACGA MasterSheets ModificationDate - 2014-11-17 18:01:49 +0000 + 2014-11-17 18:28:10 +0000 Modifier Heng Li NotesVisible diff --git a/extras/alt-demo.png b/extras/alt-demo.png index efd247c..71f4976 100644 Binary files a/extras/alt-demo.png and b/extras/alt-demo.png differ