From 2e35592295267f9832647a43658e6972a290289c Mon Sep 17 00:00:00 2001
From: Mark DePristo
+ * A very common question about a NGS set of reads is what areas of the genome are considered callable. The system
+ * considers the coverage at each locus and emits either a per base state or a summary interval BED file that
+ * partitions the genomic intervals into the following callable states:
+ *
+ *
+ *
+ * A BAM file containing exactly one sample. + *
+ * + *+ *
+ * -T CallableLociWalker \ + * -I my.bam \ + * -summary my.summary \ + * -o my.bed + *+ * + * would produce a BED file (my.bed) that looks like: + * + *
+ * 20 10000000 10000864 CALLABLE + * 20 10000865 10000985 POOR_MAPPING_QUALITY + * 20 10000986 10001138 CALLABLE + * 20 10001139 10001254 POOR_MAPPING_QUALITY + * 20 10001255 10012255 CALLABLE + * 20 10012256 10012259 POOR_MAPPING_QUALITY + * 20 10012260 10012263 CALLABLE + * 20 10012264 10012328 POOR_MAPPING_QUALITY + * 20 10012329 10012550 CALLABLE + * 20 10012551 10012551 LOW_COVERAGE + * 20 10012552 10012554 CALLABLE + * 20 10012555 10012557 LOW_COVERAGE + * 20 10012558 10012558 CALLABLE + * et cetera... + *+ * as well as a summary table that looks like: + * + *
+ * state nBases + * REF_N 0 + * CALLABLE 996046 + * NO_COVERAGE 121 + * LOW_COVERAGE 928 + * EXCESSIVE_COVERAGE 0 + * POOR_MAPPING_QUALITY 2906 + *+ * + * @author Mark DePristo + * @since May 7, 2010 */ @By(DataSource.REFERENCE) public class CallableLociWalker extends LocusWalker
+ * Genotype and Validate is a tool to evaluate the quality of a dataset for calling SNPs + * and Indels given a secondary (validation) data source. The data sources are BAM or VCF + * files. You can use them interchangeably (i.e. a BAM to validate calls in a VCF or a VCF + * to validate calls on a BAM). + *
+ * + *+ * The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you + * want to know how well a particular technology performs calling these snps. With a + * dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you + * can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's + * dataset. + *
+ * + *+ * Another option is to validate the calls on a VCF file, using a deep coverage BAM file + * that you trust the calls on. The GenotypeAndValidate walker will make calls using the + * reads in the BAM file and take them as truth, then compare to the calls in the VCF file + * and produce a truth table. + *
+ * + * + *+ * A BAM file to make calls on and a VCF file to use as truth validation dataset. + * + * You also have the option to invert the roles of the files using the command line options listed below. + *
+ * + *+ * GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a + * 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true + * positive or a false positive). The table should look like this: + *
+ *| + * | ALT | + *REF | + *Predictive Value | + *
|---|---|---|---|
| called alt | + *True Positive (TP) | + *False Positive (FP) | + *Positive PV | + *
| called ref | + *False Negative (FN) | + *True Negative (TN) | + *Negative PV | + *
+ * The positive predictive value (PPV) is the proportion of subjects with positive test results + * who are correctly diagnosed. + *
+ *+ * The negative predictive value (NPV) is the proportion of subjects with a negative test result + * who are correctly diagnosed. + *
+ *+ * The VCF file will contain only the variants that were called or not called, excluding the ones that + * were uncovered or didn't pass the filters. This file is useful if you are trying to compare + * the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to + * apples). + *
+ * + *+ * Here is an example of an annotated VCF file (info field clipped for clarity) + * + *
+ * #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 + * 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 + * 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. + * 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. + * 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./. + *+ * + * + * + *
+ * java + * -jar /GenomeAnalysisTK.jar + * -T GenotypeAndValidate + * -R human_g1k_v37.fasta + * -I myNewTechReads.bam + * -alleles handAnnotatedVCF.vcf + * -BTI alleles + *+ * + *
+ * java + * -jar /GenomeAnalysisTK.jar + * -T GenotypeAndValidate + * -R human_g1k_v37.fasta + * -I myTruthDataset.bam + * -alleles callsToValidate.vcf + * -BTI alleles + * -bt + * -o gav.vcf + *+ * + * + * @author Mauricio Carneiro + * @since ${DATE} + */ + +@Requires(value={DataSource.READS, DataSource.REFERENCE}) +@Allows(value={DataSource.READS, DataSource.REFERENCE}) + +@By(DataSource.REFERENCE) +@Reference(window=@Window(start=-200,stop=200)) + + +public class GenotypeAndValidateWalker extends RodWalker
+ * After counting covariates in either the initial BAM File or again in the recalibrated BAM File, an analysis tool is available which + * reads the .csv file and outputs several PDF (and .dat) files for each read group in the given BAM. These PDF files graphically + * show the various metrics and characteristics of the reported quality scores (often in relation to the empirical qualities). + * In order to show that any biases in the reported quality scores have been generally fixed through recalibration one should run + * CountCovariates again on a bam file produced by TableRecalibration. In this way users can compare the analysis plots generated + * by pre-recalibration and post-recalibration .csv files. Our usual chain of commands that we use to generate plots of residual + * error is: CountCovariates, TableRecalibrate, samtools index on the recalibrated bam file, CountCovariates again on the recalibrated + * bam file, and then AnalyzeCovariates on both the before and after recal_data.csv files to see the improvement in recalibration. + * + *
+ * The color coding along with the RMSE is included in the plots to give some indication of the number of observations that went into + * each of the quality score estimates. It is defined as follows for N, the number of observations: + * + *
+ * NOTE: For those running this tool externally from the Broad, it is crucial to note that both the -Rscript and -resources options + * must be changed from the default. -Rscript needs to point to your installation of Rscript (this is the scripting version of R, + * not the interactive version) while -resources needs to point to the folder holding the R scripts that are used. For those using + * this tool as part of the Binary Distribution the -resources should point to the resources folder that is part of the tarball. + * For those using this tool by building from the git repository the -resources should point to the R/ subdirectory of the Sting checkout. + * + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration + * + *
+ * The recalibration table file in CSV format that was generated by the CountCovariates walker. + *
+ * + *+ * java -Xmx4g -jar AnalyzeCovariates.jar \ + * -recalFile /path/to/recal.table.csv \ + * -outputDir /path/to/output_dir/ \ + * -resources resources/ \ + * -ignoreQ 5 + *+ * */ +@DocumentedGATKFeature( + groupName = "AnalyzeCovariates", + summary = "Package to plot residual accuracy versus error covariates for the base quality score recalibrator") public class AnalyzeCovariates extends CommandLineProgram { ///////////////////////////// // Command Line Arguments ///////////////////////////// - + /** + * After the header, data records occur one per line until the end of the file. The first several items on a line are the + * values of the individual covariates and will change depending on which covariates were specified at runtime. The last + * three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, + * and the raw empirical quality score calculated by phred-scaling the mismatch rate. + */ @Input(fullName = "recal_file", shortName = "recalFile", doc = "The input recal csv file to analyze", required = false) private String RECAL_FILE = "output.recal_data.csv"; @Argument(fullName = "output_dir", shortName = "outputDir", doc = "The directory in which to output all the plots and intermediate data files", required = false) @@ -67,11 +121,20 @@ public class AnalyzeCovariates extends CommandLineProgram { private int IGNORE_QSCORES_LESS_THAN = 5; @Argument(fullName = "numRG", shortName = "numRG", doc = "Only process N read groups. Default value: -1 (process all read groups)", required = false) private int NUM_READ_GROUPS_TO_PROCESS = -1; // -1 means process all read groups + + /** + * Combinations of covariates in which there are zero mismatches technically have infinite quality. We get around this situation + * by capping at the specified value. We've found that Q40 is too low when using a more completely database of known variation like dbSNP build 132 or later. + */ @Argument(fullName="max_quality_score", shortName="maxQ", required = false, doc="The integer value at which to cap the quality scores, default is 50") private int MAX_QUALITY_SCORE = 50; + + /** + * This argument is useful for comparing before/after plots and you want the axes to match each other. + */ @Argument(fullName="max_histogram_value", shortName="maxHist", required = false, doc="If supplied, this value will be the max value of the histogram plots") private int MAX_HISTOGRAM_VALUE = 0; - @Argument(fullName="do_indel_quality", shortName="indels", required = false, doc="If supplied, this value will be the max value of the histogram plots") + @Argument(fullName="do_indel_quality", shortName="indels", required = false, doc="If supplied, do indel quality plotting") private boolean DO_INDEL_QUALITY = false; diff --git a/public/java/src/org/broadinstitute/sting/analyzecovariates/package-info.java b/public/java/src/org/broadinstitute/sting/analyzecovariates/package-info.java new file mode 100644 index 000000000..9350e4a66 --- /dev/null +++ b/public/java/src/org/broadinstitute/sting/analyzecovariates/package-info.java @@ -0,0 +1,4 @@ +/** + * Package to plot residual accuracy versus error covariates for the base quality score recalibrator. + */ +package org.broadinstitute.sting.analyzecovariates; \ No newline at end of file diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java index fc6b3daee..9b0824ed0 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java @@ -1,3 +1,28 @@ +/* + * Copyright (c) 2011 The Broad Institute + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, + * copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following + * conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES + * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT + * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR + * THE USE OR OTHER DEALINGS IN THE SOFTWARE. + */ + package org.broadinstitute.sting.gatk.walkers.recalibration; import org.broadinstitute.sting.commandline.Gatherer; @@ -12,11 +37,8 @@ import java.util.List; import java.util.regex.Pattern; /** - * Created by IntelliJ IDEA. * User: carneiro * Date: 3/29/11 - * Time: 3:54 PM - * To change this template use File | Settings | File Templates. */ diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java index b4739f366..5ffc61fe3 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java @@ -50,22 +50,47 @@ import java.util.List; import java.util.Map; /** - * First pass of the recalibration. Generates recalibration table based on various user-specified covariates (such as reported quality score, cycle, and dinucleotide). + * First pass of the base quality score recalibration -- Generates recalibration table based on various user-specified covariates (such as reported quality score, cycle, and dinucleotide). * - * This walker is designed to work as the first pass in a two-pass processing step. - * It does a by-locus traversal operating only at sites that are not in dbSNP. - * We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. - * This walker generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and dinucleotide) - * Since there is a large amount of data one can then calculate an empirical probability of error - * given the particular covariates seen at this site, where p(error) = num mismatches / num observations - * The output file is a CSV list of (the several covariate values, num observations, num mismatches, empirical quality score) - * The first non-comment line of the output file gives the name of the covariates that were used for this calculation. + *
+ * This walker is designed to work as the first pass in a two-pass processing step. It does a by-locus traversal operating + * only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative + * of poor base quality. This walker generates tables based on various user-specified covariates (such as read group, + * reported quality score, cycle, and dinucleotide). Since there is a large amount of data one can then calculate an empirical + * probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. + * The output file is a CSV list of (the several covariate values, num observations, num mismatches, empirical quality score). + *
+ * Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added for the user regardless of whether or not they were specified. * - * Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added for the user regardless of whether or not they were specified - * Note: This walker is designed to be used in conjunction with TableRecalibrationWalker. + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration + * + *
+ * A database of known polymorphic sites to skip over. + *
+ * + *+ * A recalibration table file in CSV format that is used by the TableRecalibration walker. + *
+ * + *+ * java -Xmx4g -jar GenomeAnalysisTK.jar \ + * -R resources/Homo_sapiens_assembly18.fasta \ + * -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ + * -knownSites another/optional/setOfSitesToMask.vcf \ + * -I my_reads.bam \ + * -T CountCovariates \ + * -cov ReadGroupCovariate \ + * -cov QualityScoreCovariate \ + * -cov CycleCovariate \ + * -cov DinucCovariate \ + * -recalFile my_reads.recal_data.csv + ** - * @author rpoplin - * @since Nov 3, 2009 */ @BAQMode(ApplicationTime = BAQ.ApplicationTime.FORBIDDEN) @@ -96,8 +121,13 @@ public class CountCovariatesWalker extends LocusWalker
+ * This walker is designed to work as the second pass in a two-pass processing step, doing a by-read traversal. For each + * base in each read this walker calculates various user-specified covariates (such as read group, reported quality score, + * cycle, and dinuc). Using these values as a key in a large hashmap the walker calculates an empirical base quality score + * and overwrites the quality score currently in the read. This walker then outputs a new bam file with these updated (recalibrated) reads. * - * For each base in each read this walker calculates various user-specified covariates (such as read group, reported quality score, cycle, and dinuc) - * Using these values as a key in a large hashmap the walker calculates an empirical base quality score and overwrites the quality score currently in the read. - * This walker then outputs a new bam file with these updated (recalibrated) reads. + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration * - * Note: This walker expects as input the recalibration table file generated previously by CovariateCounterWalker. - * Note: This walker is designed to be used in conjunction with CovariateCounterWalker. + *
+ * The recalibration table file in CSV format that was generated by the CountCovariates walker. + *
+ * + *+ * A bam file in which the quality scores in each read have been recalibrated. + *
+ * + *+ * java -Xmx4g -jar GenomeAnalysisTK.jar \ + * -R resources/Homo_sapiens_assembly18.fasta \ + * -I my_reads.bam \ + * -T TableRecalibration \ + * -o my_reads.recal.bam \ + * -recalFile my_reads.recal_data.csv + ** - * @author rpoplin - * @since Nov 3, 2009 */ @BAQMode(QualityMode = BAQ.QualityMode.ADD_TAG, ApplicationTime = BAQ.ApplicationTime.ON_OUTPUT) @@ -79,24 +98,54 @@ public class TableRecalibrationWalker extends ReadWalker
but rather just write them out unmodified in the recalibrated BAM file. This is useful + * because Solexa writes Q2 and Q3 bases when the machine has really gone wrong. This would be fine in and of itself, + * but when you select a subset of these reads based on their ability to align to the reference and their dinucleotide effect, + * your Q2 and Q3 bins can be elevated to Q8 or Q10, leading to issues downstream. With the default value of 5, all Q0-Q4 bases + * are unmodified during recalibration, so they don't get inappropriately evaluated. + */ + @Argument(fullName="preserve_qscores_less_than", shortName="pQ", doc="Bases with quality scores less than this threshold won't be recalibrated. In general it's unsafe to change qualities scores below < 5, since base callers use these values to indicate random or bad bases", required=false) private int PRESERVE_QSCORES_LESS_THAN = 5; - @Argument(fullName="smoothing", shortName="sm", required = false, doc="Number of imaginary counts to add to each bin in order to smooth out bins with few data points, default=1") + + /** + * By default TableRecalibration applies a Yates' correction to account for overfitting when it calculates the empirical + * quality score, in particular, ( # mismatches + 1 ) / ( # observations + 1 ). TableRecalibration accepts a --smoothing / -sm+ * argument which sets how many unobserved counts to add to every bin. Use --smoothing 0 to turn off all smoothing or, for example, + * --smoothing 15 for a large amount of smoothing. + */ + @Argument(fullName="smoothing", shortName="sm", required = false, doc="Number of imaginary counts to add to each bin in order to smooth out bins with few data points") private int SMOOTHING = 1; - @Argument(fullName="max_quality_score", shortName="maxQ", required = false, doc="The integer value at which to cap the quality scores, default=50") + + /** + * Combinations of covariates in which there are zero mismatches technically have infinite quality. We get around this situation + * by capping at the specified value. We've found that Q40 is too low when using a more completely database of known variation like dbSNP build 132 or later. + */ + @Argument(fullName="max_quality_score", shortName="maxQ", required = false, doc="The integer value at which to cap the quality scores") private int MAX_QUALITY_SCORE = 50; + + /** + * By default TableRecalibration emits the OQ field -- so you can go back and look at the original quality scores, rerun + * the system using the OQ flags, etc, on the output BAM files; to turn off emission of the OQ field use this flag. + */ @Argument(fullName="doNotWriteOriginalQuals", shortName="noOQs", required=false, doc="If true, we will not write the original quality (OQ) tag for each read") private boolean DO_NOT_WRITE_OQ = false; From a45498150a419d1e3a0fd8814e13793cdc5f7412 Mon Sep 17 00:00:00 2001 From: Eric Banks Date: Thu, 18 Aug 2011 11:18:29 -0400 Subject: [PATCH 13/13] Remove non-ascii char --- .../sting/gatk/walkers/indels/RealignerTargetCreator.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java index 08ed1af52..145d0327c 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java @@ -57,7 +57,7 @@ import java.util.List; * * The local realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases * is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion - * or deletion (indels) in the individualÕs genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching + * or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching * the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, * it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are * correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel,