From cc3df8f11a454cdabdd139b4e52f70447ce5756d Mon Sep 17 00:00:00 2001
From: Mauricio Carneiro
+ * Genotype and Validate is a tool to evaluate the quality of a dataset for calling SNPs
+ * and Indels given a secondary (validation) data source. The data sources are BAM or VCF
+ * files. You can use them interchangeably (i.e. a BAM to validate calls in a VCF or a VCF
+ * to validate calls on a BAM).
+ *
+ * The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you
+ * want to know how well a particular technology performs calling these snps. With a
+ * dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you
+ * can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's
+ * dataset.
+ *
+ * Another option is to validate the calls on a VCF file, using a deep coverage BAM file
+ * that you trust the calls on. The GenotypeAndValidate walker will make calls using the
+ * reads in the BAM file and take them as truth, then compare to the calls in the VCF file
+ * and produce a truth table.
+ *
+ * A BAM file to make calls on and a VCF file to use as truth validation dataset.
+ *
+ * You also have the option to invert the roles of the files using the command line options listed below.
+ *
+ * GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a
+ * 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true
+ * positive or a false positive). The table should look like this:
+ *
+ * The positive predictive value (PPV) is the proportion of subjects with positive test results
+ * who are correctly diagnosed.
+ *
+ * The negative predictive value (NPV) is the proportion of subjects with a negative test result
+ * who are correctly diagnosed.
+ *
+ * The VCF file will contain only the variants that were called or not called, excluding the ones that
+ * were uncovered or didn't pass the filters. This file is useful if you are trying to compare
+ * the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to
+ * apples).
+ *
+ * Here is an example of an annotated VCF file (info field clipped for clarity)
+ *
+ * Input
+ * Output
+ *
+ *
+ *
+ *
+ *
+ * ALT
+ * REF
+ * Predictive Value
+ *
+ *
+ * called alt
+ * True Positive (TP)
+ * False Positive (FP)
+ * Positive PV
+ *
+ *
+ * called ref
+ * False Negative (FN)
+ * True Negative (TN)
+ * Negative PV
+ *
+ * #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878
+ * 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1
+ * 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./.
+ * 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./.
+ * 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./.
+ *
+ *
+ *
+ * java + * -jar /GenomeAnalysisTK.jar + * -T GenotypeAndValidate + * -R human_g1k_v37.fasta + * -I myNewTechReads.bam + * -alleles handAnnotatedVCF.vcf + * -BTI alleles + *+ * + *
+ * java + * -jar /GenomeAnalysisTK.jar + * -T GenotypeAndValidate + * -R human_g1k_v37.fasta + * -I myTruthDataset.bam + * -alleles callsToValidate.vcf + * -BTI alleles + * -bt + * -o gav.vcf + *+ * + * + * @author Mauricio Carneiro + * @since ${DATE} + */ + +@Requires(value={DataSource.READS, DataSource.REFERENCE}) +@Allows(value={DataSource.READS, DataSource.REFERENCE}) + +@By(DataSource.REFERENCE) +@Reference(window=@Window(start=-200,stop=200)) + + +public class GenotypeAndValidateWalker extends RodWalker