From 2e35592295267f9832647a43658e6972a290289c Mon Sep 17 00:00:00 2001
From: Mark DePristo
+ * A very common question about a NGS set of reads is what areas of the genome are considered callable. The system
+ * considers the coverage at each locus and emits either a per base state or a summary interval BED file that
+ * partitions the genomic intervals into the following callable states:
+ *
+ *
+ *
+ * A BAM file containing exactly one sample. + *
+ * + *+ *
+ * -T CallableLociWalker \ + * -I my.bam \ + * -summary my.summary \ + * -o my.bed + *+ * + * would produce a BED file (my.bed) that looks like: + * + *
+ * 20 10000000 10000864 CALLABLE + * 20 10000865 10000985 POOR_MAPPING_QUALITY + * 20 10000986 10001138 CALLABLE + * 20 10001139 10001254 POOR_MAPPING_QUALITY + * 20 10001255 10012255 CALLABLE + * 20 10012256 10012259 POOR_MAPPING_QUALITY + * 20 10012260 10012263 CALLABLE + * 20 10012264 10012328 POOR_MAPPING_QUALITY + * 20 10012329 10012550 CALLABLE + * 20 10012551 10012551 LOW_COVERAGE + * 20 10012552 10012554 CALLABLE + * 20 10012555 10012557 LOW_COVERAGE + * 20 10012558 10012558 CALLABLE + * et cetera... + *+ * as well as a summary table that looks like: + * + *
+ * state nBases + * REF_N 0 + * CALLABLE 996046 + * NO_COVERAGE 121 + * LOW_COVERAGE 928 + * EXCESSIVE_COVERAGE 0 + * POOR_MAPPING_QUALITY 2906 + *+ * + * @author Mark DePristo + * @since May 7, 2010 */ @By(DataSource.REFERENCE) public class CallableLociWalker extends LocusWalker
+ * Genotype and Validate is a tool to evaluate the quality of a dataset for calling SNPs + * and Indels given a secondary (validation) data source. The data sources are BAM or VCF + * files. You can use them interchangeably (i.e. a BAM to validate calls in a VCF or a VCF + * to validate calls on a BAM). + *
+ * + *+ * The simplest scenario is when you have a VCF of hand annotated SNPs and Indels, and you + * want to know how well a particular technology performs calling these snps. With a + * dataset (BAM file) generated by the technology in test, and the hand annotated VCF, you + * can run GenotypeAndValidate to asses the accuracy of the calls with the new technology's + * dataset. + *
+ * + *+ * Another option is to validate the calls on a VCF file, using a deep coverage BAM file + * that you trust the calls on. The GenotypeAndValidate walker will make calls using the + * reads in the BAM file and take them as truth, then compare to the calls in the VCF file + * and produce a truth table. + *
+ * + * + *+ * A BAM file to make calls on and a VCF file to use as truth validation dataset. + * + * You also have the option to invert the roles of the files using the command line options listed below. + *
+ * + *+ * GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a + * 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true + * positive or a false positive). The table should look like this: + *
+ *| + * | ALT | + *REF | + *Predictive Value | + *
|---|---|---|---|
| called alt | + *True Positive (TP) | + *False Positive (FP) | + *Positive PV | + *
| called ref | + *False Negative (FN) | + *True Negative (TN) | + *Negative PV | + *
+ * The positive predictive value (PPV) is the proportion of subjects with positive test results + * who are correctly diagnosed. + *
+ *+ * The negative predictive value (NPV) is the proportion of subjects with a negative test result + * who are correctly diagnosed. + *
+ *+ * The VCF file will contain only the variants that were called or not called, excluding the ones that + * were uncovered or didn't pass the filters. This file is useful if you are trying to compare + * the PPV and NPV of two different technologies on the exact same sites (so you can compare apples to + * apples). + *
+ * + *+ * Here is an example of an annotated VCF file (info field clipped for clarity) + * + *
+ * #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA12878 + * 1 20568807 . C T 0 HapMapHet AC=1;AF=0.50;AN=2;DP=0;GV=T GT 0/1 + * 1 22359922 . T C 282 WG-CG-HiSeq AC=2;AF=0.50;GV=T;AN=4;DP=42 GT:AD:DP:GL:GQ 1/0 ./. 0/1:20,22:39:-72.79,-11.75,-67.94:99 ./. + * 13 102391461 . G A 341 Indel;SnpCluster AC=1;GV=F;AF=0.50;AN=2;DP=45 GT:AD:DP:GL:GQ ./. ./. 0/1:32,13:45:-50.99,-13.56,-112.17:99 ./. + * 1 175516757 . C G 655 SnpCluster,WG AC=1;AF=0.50;AN=2;GV=F;DP=74 GT:AD:DP:GL:GQ ./. ./. 0/1:52,22:67:-89.02,-20.20,-191.27:99 ./. + *+ * + * + * + *
+ * java + * -jar /GenomeAnalysisTK.jar + * -T GenotypeAndValidate + * -R human_g1k_v37.fasta + * -I myNewTechReads.bam + * -alleles handAnnotatedVCF.vcf + * -BTI alleles + *+ * + *
+ * java + * -jar /GenomeAnalysisTK.jar + * -T GenotypeAndValidate + * -R human_g1k_v37.fasta + * -I myTruthDataset.bam + * -alleles callsToValidate.vcf + * -BTI alleles + * -bt + * -o gav.vcf + *+ * + * + * @author Mauricio Carneiro + * @since ${DATE} + */ + +@Requires(value={DataSource.READS, DataSource.REFERENCE}) +@Allows(value={DataSource.READS, DataSource.REFERENCE}) + +@By(DataSource.REFERENCE) +@Reference(window=@Window(start=-200,stop=200)) + + +public class GenotypeAndValidateWalker extends RodWalker
+ * After counting covariates in either the initial BAM File or again in the recalibrated BAM File, an analysis tool is available which + * reads the .csv file and outputs several PDF (and .dat) files for each read group in the given BAM. These PDF files graphically + * show the various metrics and characteristics of the reported quality scores (often in relation to the empirical qualities). + * In order to show that any biases in the reported quality scores have been generally fixed through recalibration one should run + * CountCovariates again on a bam file produced by TableRecalibration. In this way users can compare the analysis plots generated + * by pre-recalibration and post-recalibration .csv files. Our usual chain of commands that we use to generate plots of residual + * error is: CountCovariates, TableRecalibrate, samtools index on the recalibrated bam file, CountCovariates again on the recalibrated + * bam file, and then AnalyzeCovariates on both the before and after recal_data.csv files to see the improvement in recalibration. + * + *
+ * The color coding along with the RMSE is included in the plots to give some indication of the number of observations that went into + * each of the quality score estimates. It is defined as follows for N, the number of observations: + * + *
+ * NOTE: For those running this tool externally from the Broad, it is crucial to note that both the -Rscript and -resources options + * must be changed from the default. -Rscript needs to point to your installation of Rscript (this is the scripting version of R, + * not the interactive version) while -resources needs to point to the folder holding the R scripts that are used. For those using + * this tool as part of the Binary Distribution the -resources should point to the resources folder that is part of the tarball. + * For those using this tool by building from the git repository the -resources should point to the R/ subdirectory of the Sting checkout. + * + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration + * + *
+ * The recalibration table file in CSV format that was generated by the CountCovariates walker. + *
+ * + *+ * java -Xmx4g -jar AnalyzeCovariates.jar \ + * -recalFile /path/to/recal.table.csv \ + * -outputDir /path/to/output_dir/ \ + * -resources resources/ \ + * -ignoreQ 5 + *+ * */ +@DocumentedGATKFeature( + groupName = "AnalyzeCovariates", + summary = "Package to plot residual accuracy versus error covariates for the base quality score recalibrator") public class AnalyzeCovariates extends CommandLineProgram { ///////////////////////////// // Command Line Arguments ///////////////////////////// - + /** + * After the header, data records occur one per line until the end of the file. The first several items on a line are the + * values of the individual covariates and will change depending on which covariates were specified at runtime. The last + * three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches, + * and the raw empirical quality score calculated by phred-scaling the mismatch rate. + */ @Input(fullName = "recal_file", shortName = "recalFile", doc = "The input recal csv file to analyze", required = false) private String RECAL_FILE = "output.recal_data.csv"; @Argument(fullName = "output_dir", shortName = "outputDir", doc = "The directory in which to output all the plots and intermediate data files", required = false) @@ -67,11 +121,20 @@ public class AnalyzeCovariates extends CommandLineProgram { private int IGNORE_QSCORES_LESS_THAN = 5; @Argument(fullName = "numRG", shortName = "numRG", doc = "Only process N read groups. Default value: -1 (process all read groups)", required = false) private int NUM_READ_GROUPS_TO_PROCESS = -1; // -1 means process all read groups + + /** + * Combinations of covariates in which there are zero mismatches technically have infinite quality. We get around this situation + * by capping at the specified value. We've found that Q40 is too low when using a more completely database of known variation like dbSNP build 132 or later. + */ @Argument(fullName="max_quality_score", shortName="maxQ", required = false, doc="The integer value at which to cap the quality scores, default is 50") private int MAX_QUALITY_SCORE = 50; + + /** + * This argument is useful for comparing before/after plots and you want the axes to match each other. + */ @Argument(fullName="max_histogram_value", shortName="maxHist", required = false, doc="If supplied, this value will be the max value of the histogram plots") private int MAX_HISTOGRAM_VALUE = 0; - @Argument(fullName="do_indel_quality", shortName="indels", required = false, doc="If supplied, this value will be the max value of the histogram plots") + @Argument(fullName="do_indel_quality", shortName="indels", required = false, doc="If supplied, do indel quality plotting") private boolean DO_INDEL_QUALITY = false; diff --git a/public/java/src/org/broadinstitute/sting/analyzecovariates/package-info.java b/public/java/src/org/broadinstitute/sting/analyzecovariates/package-info.java new file mode 100644 index 000000000..9350e4a66 --- /dev/null +++ b/public/java/src/org/broadinstitute/sting/analyzecovariates/package-info.java @@ -0,0 +1,4 @@ +/** + * Package to plot residual accuracy versus error covariates for the base quality score recalibrator. + */ +package org.broadinstitute.sting.analyzecovariates; \ No newline at end of file diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java index fc6b3daee..9b0824ed0 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesGatherer.java @@ -1,3 +1,28 @@ +/* + * Copyright (c) 2011 The Broad Institute + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, + * copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following + * conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES + * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT + * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR + * THE USE OR OTHER DEALINGS IN THE SOFTWARE. + */ + package org.broadinstitute.sting.gatk.walkers.recalibration; import org.broadinstitute.sting.commandline.Gatherer; @@ -12,11 +37,8 @@ import java.util.List; import java.util.regex.Pattern; /** - * Created by IntelliJ IDEA. * User: carneiro * Date: 3/29/11 - * Time: 3:54 PM - * To change this template use File | Settings | File Templates. */ diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java index b4739f366..5ffc61fe3 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/CountCovariatesWalker.java @@ -50,22 +50,47 @@ import java.util.List; import java.util.Map; /** - * First pass of the recalibration. Generates recalibration table based on various user-specified covariates (such as reported quality score, cycle, and dinucleotide). + * First pass of the base quality score recalibration -- Generates recalibration table based on various user-specified covariates (such as reported quality score, cycle, and dinucleotide). * - * This walker is designed to work as the first pass in a two-pass processing step. - * It does a by-locus traversal operating only at sites that are not in dbSNP. - * We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. - * This walker generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and dinucleotide) - * Since there is a large amount of data one can then calculate an empirical probability of error - * given the particular covariates seen at this site, where p(error) = num mismatches / num observations - * The output file is a CSV list of (the several covariate values, num observations, num mismatches, empirical quality score) - * The first non-comment line of the output file gives the name of the covariates that were used for this calculation. + *
+ * This walker is designed to work as the first pass in a two-pass processing step. It does a by-locus traversal operating + * only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative + * of poor base quality. This walker generates tables based on various user-specified covariates (such as read group, + * reported quality score, cycle, and dinucleotide). Since there is a large amount of data one can then calculate an empirical + * probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. + * The output file is a CSV list of (the several covariate values, num observations, num mismatches, empirical quality score). + *
+ * Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added for the user regardless of whether or not they were specified. * - * Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added for the user regardless of whether or not they were specified - * Note: This walker is designed to be used in conjunction with TableRecalibrationWalker. + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration + * + *
+ * A database of known polymorphic sites to skip over. + *
+ * + *+ * A recalibration table file in CSV format that is used by the TableRecalibration walker. + *
+ * + *+ * java -Xmx4g -jar GenomeAnalysisTK.jar \ + * -R resources/Homo_sapiens_assembly18.fasta \ + * -knownSites bundle/hg18/dbsnp_132.hg18.vcf \ + * -knownSites another/optional/setOfSitesToMask.vcf \ + * -I my_reads.bam \ + * -T CountCovariates \ + * -cov ReadGroupCovariate \ + * -cov QualityScoreCovariate \ + * -cov CycleCovariate \ + * -cov DinucCovariate \ + * -recalFile my_reads.recal_data.csv + ** - * @author rpoplin - * @since Nov 3, 2009 */ @BAQMode(ApplicationTime = BAQ.ApplicationTime.FORBIDDEN) @@ -96,8 +121,13 @@ public class CountCovariatesWalker extends LocusWalker
+ * This walker is designed to work as the second pass in a two-pass processing step, doing a by-read traversal. For each + * base in each read this walker calculates various user-specified covariates (such as read group, reported quality score, + * cycle, and dinuc). Using these values as a key in a large hashmap the walker calculates an empirical base quality score + * and overwrites the quality score currently in the read. This walker then outputs a new bam file with these updated (recalibrated) reads. * - * For each base in each read this walker calculates various user-specified covariates (such as read group, reported quality score, cycle, and dinuc) - * Using these values as a key in a large hashmap the walker calculates an empirical base quality score and overwrites the quality score currently in the read. - * This walker then outputs a new bam file with these updated (recalibrated) reads. + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration * - * Note: This walker expects as input the recalibration table file generated previously by CovariateCounterWalker. - * Note: This walker is designed to be used in conjunction with CovariateCounterWalker. + *
+ * The recalibration table file in CSV format that was generated by the CountCovariates walker. + *
+ * + *+ * A bam file in which the quality scores in each read have been recalibrated. + *
+ * + *+ * java -Xmx4g -jar GenomeAnalysisTK.jar \ + * -R resources/Homo_sapiens_assembly18.fasta \ + * -I my_reads.bam \ + * -T TableRecalibration \ + * -o my_reads.recal.bam \ + * -recalFile my_reads.recal_data.csv + ** - * @author rpoplin - * @since Nov 3, 2009 */ @BAQMode(QualityMode = BAQ.QualityMode.ADD_TAG, ApplicationTime = BAQ.ApplicationTime.ON_OUTPUT) @@ -79,24 +98,54 @@ public class TableRecalibrationWalker extends ReadWalker
but rather just write them out unmodified in the recalibrated BAM file. This is useful + * because Solexa writes Q2 and Q3 bases when the machine has really gone wrong. This would be fine in and of itself, + * but when you select a subset of these reads based on their ability to align to the reference and their dinucleotide effect, + * your Q2 and Q3 bins can be elevated to Q8 or Q10, leading to issues downstream. With the default value of 5, all Q0-Q4 bases + * are unmodified during recalibration, so they don't get inappropriately evaluated. + */ + @Argument(fullName="preserve_qscores_less_than", shortName="pQ", doc="Bases with quality scores less than this threshold won't be recalibrated. In general it's unsafe to change qualities scores below < 5, since base callers use these values to indicate random or bad bases", required=false) private int PRESERVE_QSCORES_LESS_THAN = 5; - @Argument(fullName="smoothing", shortName="sm", required = false, doc="Number of imaginary counts to add to each bin in order to smooth out bins with few data points, default=1") + + /** + * By default TableRecalibration applies a Yates' correction to account for overfitting when it calculates the empirical + * quality score, in particular, ( # mismatches + 1 ) / ( # observations + 1 ). TableRecalibration accepts a --smoothing / -sm+ * argument which sets how many unobserved counts to add to every bin. Use --smoothing 0 to turn off all smoothing or, for example, + * --smoothing 15 for a large amount of smoothing. + */ + @Argument(fullName="smoothing", shortName="sm", required = false, doc="Number of imaginary counts to add to each bin in order to smooth out bins with few data points") private int SMOOTHING = 1; - @Argument(fullName="max_quality_score", shortName="maxQ", required = false, doc="The integer value at which to cap the quality scores, default=50") + + /** + * Combinations of covariates in which there are zero mismatches technically have infinite quality. We get around this situation + * by capping at the specified value. We've found that Q40 is too low when using a more completely database of known variation like dbSNP build 132 or later. + */ + @Argument(fullName="max_quality_score", shortName="maxQ", required = false, doc="The integer value at which to cap the quality scores") private int MAX_QUALITY_SCORE = 50; + + /** + * By default TableRecalibration emits the OQ field -- so you can go back and look at the original quality scores, rerun + * the system using the OQ flags, etc, on the output BAM files; to turn off emission of the OQ field use this flag. + */ @Argument(fullName="doNotWriteOriginalQuals", shortName="noOQs", required=false, doc="If true, we will not write the original quality (OQ) tag for each read") private boolean DO_NOT_WRITE_OQ = false; From a45498150a419d1e3a0fd8814e13793cdc5f7412 Mon Sep 17 00:00:00 2001 From: Eric Banks Date: Thu, 18 Aug 2011 11:18:29 -0400 Subject: [PATCH 14/20] Remove non-ascii char --- .../sting/gatk/walkers/indels/RealignerTargetCreator.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java index 08ed1af52..145d0327c 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java @@ -57,7 +57,7 @@ import java.util.List; * * The local realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases * is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion - * or deletion (indels) in the individualÕs genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching + * or deletion (indels) in the individual's genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching * the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, * it is impossible to place reads on the reference genome such at mismatches are minimized across all reads. Consequently, even when some reads are * correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, From f5d7cabb209352e57b7bc5f08cb2e1a97432f83b Mon Sep 17 00:00:00 2001 From: Mark DePristo
+ * + *Date: Thu, 18 Aug 2011 11:20:12 -0400 Subject: [PATCH 15/20] Fix for reintroducing an already solved problem. --- .../sting/gatk/CommandLineGATK.java | 2 +- .../help/DocumentedGATKFeatureHandler.java | 2 +- .../sting/utils/help/GATKDocUtils.java | 44 +++++++++++++++++++ .../sting/utils/help/HelpUtils.java | 15 ------- 4 files changed, 46 insertions(+), 17 deletions(-) create mode 100644 public/java/src/org/broadinstitute/sting/utils/help/GATKDocUtils.java diff --git a/public/java/src/org/broadinstitute/sting/gatk/CommandLineGATK.java b/public/java/src/org/broadinstitute/sting/gatk/CommandLineGATK.java index 7c567f511..8a13dadbf 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/CommandLineGATK.java +++ b/public/java/src/org/broadinstitute/sting/gatk/CommandLineGATK.java @@ -175,7 +175,7 @@ public class CommandLineGATK extends CommandLineExecutable { Formatter formatter = new Formatter(additionalHelp); formatter.format("For a full description of this walker, see its GATKdocs at:%n"); - formatter.format("%s%n", HelpUtils.helpLinksToGATKDocs(walkerType)); + formatter.format("%s%n", GATKDocUtils.helpLinksToGATKDocs(walkerType)); return additionalHelp.toString(); } diff --git a/public/java/src/org/broadinstitute/sting/utils/help/DocumentedGATKFeatureHandler.java b/public/java/src/org/broadinstitute/sting/utils/help/DocumentedGATKFeatureHandler.java index 214a1716a..ce03c8093 100644 --- a/public/java/src/org/broadinstitute/sting/utils/help/DocumentedGATKFeatureHandler.java +++ b/public/java/src/org/broadinstitute/sting/utils/help/DocumentedGATKFeatureHandler.java @@ -71,7 +71,7 @@ public abstract class DocumentedGATKFeatureHandler { * @return */ public String getDestinationFilename(ClassDoc doc, Class clazz) { - return HelpUtils.htmlFilenameForClass(clazz); + return GATKDocUtils.htmlFilenameForClass(clazz); } /** diff --git a/public/java/src/org/broadinstitute/sting/utils/help/GATKDocUtils.java b/public/java/src/org/broadinstitute/sting/utils/help/GATKDocUtils.java new file mode 100644 index 000000000..e2909cf15 --- /dev/null +++ b/public/java/src/org/broadinstitute/sting/utils/help/GATKDocUtils.java @@ -0,0 +1,44 @@ +/* + * Copyright (c) 2011, The Broad Institute + * + * Permission is hereby granted, free of charge, to any person + * obtaining a copy of this software and associated documentation + * files (the "Software"), to deal in the Software without + * restriction, including without limitation the rights to use, + * copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following + * conditions: + * + * The above copyright notice and this permission notice shall be + * included in all copies or substantial portions of the Software. + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES + * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT + * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, + * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING + * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + */ + +package org.broadinstitute.sting.utils.help; + +public class GATKDocUtils { + public final static String URL_ROOT_FOR_RELEASE_GATKDOCS = "http://www.broadinstitute.org/gsa/gatkdocs/release/"; + public final static String URL_ROOT_FOR_STABLE_GATKDOCS = "http://iwww.broadinstitute.org/gsa/gatkdocs/stable/"; + public final static String URL_ROOT_FOR_UNSTABLE_GATKDOCS = "http://iwww.broadinstitute.org/gsa/gatkdocs/unstable/"; + + public static String htmlFilenameForClass(Class c) { + return c.getName().replace(".", "_") + ".html"; + } + + public static String helpLinksToGATKDocs(Class c) { + String classPath = htmlFilenameForClass(c); + StringBuilder b = new StringBuilder(); + b.append("release version: ").append(URL_ROOT_FOR_RELEASE_GATKDOCS).append(classPath).append("\n"); + b.append("stable version: ").append(URL_ROOT_FOR_STABLE_GATKDOCS).append(classPath).append("\n"); + b.append("unstable version: ").append(URL_ROOT_FOR_UNSTABLE_GATKDOCS).append(classPath).append("\n"); + return b.toString(); + } +} \ No newline at end of file diff --git a/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java b/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java index da4e7bdaf..d72d2e83c 100644 --- a/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java +++ b/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java @@ -32,9 +32,6 @@ import org.broadinstitute.sting.utils.classloader.JVMUtils; import java.lang.reflect.Field; public class HelpUtils { - public final static String URL_ROOT_FOR_RELEASE_GATKDOCS = "http://www.broadinstitute.org/gsa/gatkdocs/release/"; - public final static String URL_ROOT_FOR_STABLE_GATKDOCS = "http://iwww.broadinstitute.org/gsa/gatkdocs/stable/"; - public final static String URL_ROOT_FOR_UNSTABLE_GATKDOCS = "http://iwww.broadinstitute.org/gsa/gatkdocs/unstable/"; protected static boolean implementsInterface(ProgramElementDoc classDoc, Class... interfaceClasses) { for (Class interfaceClass : interfaceClasses) @@ -79,16 +76,4 @@ public class HelpUtils { String.format("%s", doc.name()); } - public static String htmlFilenameForClass(Class c) { - return c.getName().replace(".", "_") + ".html"; - } - - public static String helpLinksToGATKDocs(Class c) { - String classPath = htmlFilenameForClass(c); - StringBuilder b = new StringBuilder(); - b.append("release version: ").append(URL_ROOT_FOR_RELEASE_GATKDOCS).append(classPath).append("\n"); - b.append("stable version: ").append(URL_ROOT_FOR_STABLE_GATKDOCS).append(classPath).append("\n"); - b.append("unstable version: ").append(URL_ROOT_FOR_UNSTABLE_GATKDOCS).append(classPath).append("\n"); - return b.toString(); - } } \ No newline at end of file From c797616c658297563cb95dfd267f5cf9b17b1947 Mon Sep 17 00:00:00 2001 From: Mark DePristo Date: Thu, 18 Aug 2011 11:51:53 -0400 Subject: [PATCH 16/20] If you have one sample in your BAM, getToolkit().getSamples().size() == 2 Also deleted double initializationm, where a line of code was duplicated in creating the GATK engine. --- .../org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java | 2 -- .../sting/gatk/walkers/coverage/CallableLociWalker.java | 4 +++- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java b/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java index b0c4e203b..da7eaf6e4 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java +++ b/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java @@ -689,8 +689,6 @@ public class GenomeAnalysisEngine { validateSuppliedReads(); readsDataSource = createReadsDataSource(argCollection,genomeLocParser,referenceDataSource.getReference()); - sampleDataSource = new SampleDataSource(getSAMFileHeader(), argCollection.sampleFiles); - for (ReadFilter filter : filters) filter.initialize(this); diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/CallableLociWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/CallableLociWalker.java index 3046ef925..98331ec1d 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/CallableLociWalker.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/CallableLociWalker.java @@ -223,8 +223,10 @@ public class CallableLociWalker extends LocusWalker 1 ) + if ( getToolkit().getSamples().size() != 2 ) { + // unbelievably there are actually two samples even when there's just one in the header. God I hate this Samples system throw new UserException.BadArgumentValue("-I", "CallableLoci only works for a single sample, but multiple samples were found in the provided BAM files: " + getToolkit().getSamples()); + } try { PrintStream summaryOut = new PrintStream(summaryFile); From 7c4ce6d9696e02c87ac37824d6885d666b19c98e Mon Sep 17 00:00:00 2001 From: Ryan Poplin Date: Thu, 18 Aug 2011 14:00:39 -0400 Subject: [PATCH 19/20] Added GATKDocs for the VQSR walkers. --- .../analyzecovariates/AnalyzeCovariates.java | 1 - .../TableRecalibrationWalker.java | 2 +- .../ApplyRecalibration.java | 47 +++++++++-- .../VariantRecalibrator.java | 82 +++++++++++++++---- ...VariantRecalibratorArgumentCollection.java | 8 +- 5 files changed, 112 insertions(+), 28 deletions(-) diff --git a/public/java/src/org/broadinstitute/sting/analyzecovariates/AnalyzeCovariates.java b/public/java/src/org/broadinstitute/sting/analyzecovariates/AnalyzeCovariates.java index 9b316f077..2ff8aa979 100755 --- a/public/java/src/org/broadinstitute/sting/analyzecovariates/AnalyzeCovariates.java +++ b/public/java/src/org/broadinstitute/sting/analyzecovariates/AnalyzeCovariates.java @@ -28,7 +28,6 @@ package org.broadinstitute.sting.analyzecovariates; import org.broadinstitute.sting.commandline.Argument; import org.broadinstitute.sting.commandline.CommandLineProgram; import org.broadinstitute.sting.commandline.Input; -import org.broadinstitute.sting.gatk.CommandLineGATK; import org.broadinstitute.sting.gatk.walkers.recalibration.Covariate; import org.broadinstitute.sting.gatk.walkers.recalibration.RecalDatum; import org.broadinstitute.sting.gatk.walkers.recalibration.RecalibrationArgumentCollection; diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/TableRecalibrationWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/TableRecalibrationWalker.java index af7148803..85166d30d 100644 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/TableRecalibrationWalker.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/recalibration/TableRecalibrationWalker.java @@ -52,7 +52,7 @@ import java.util.ResourceBundle; import java.util.regex.Pattern; /** - * Second pass of the recalibration -- Uses the table generated by CountCovariates to update the base quality scores of the input bam file using a sequential table calculation making the base quality scores more accurately reflect the actual quality of the bases as measured by reference mismatch rate. + * Second pass of the base quality score recalibration -- Uses the table generated by CountCovariates to update the base quality scores of the input bam file using a sequential table calculation making the base quality scores more accurately reflect the actual quality of the bases as measured by reference mismatch rate. * * * This walker is designed to work as the second pass in a two-pass processing step, doing a by-read traversal. For each diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java index abe27e483..16f1abf1b 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java @@ -45,10 +45,43 @@ import java.io.FileNotFoundException; import java.util.*; /** - * Applies cuts to the input vcf file (by adding filter lines) to achieve the desired novel FDR levels which were specified during VariantRecalibration + * Applies cuts to the input vcf file (by adding filter lines) to achieve the desired novel truth sensitivity levels which were specified during VariantRecalibration + * + *
+ * Using the tranche file generated by the previous step the ApplyRecalibration walker looks at each variant's VQSLOD value + * and decides which tranche it falls in. Variants in tranches that fall below the specified truth sensitivity filter level + * have their filter field annotated with its tranche level. This will result in a call set that simultaneously is filtered + * to the desired level but also has the information necessary to pull out more variants for a higher sensitivity but a + * slightly lower quality level. + * + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration + * + *
Input
+ *+ * The input raw variants to be recalibrated. + *
+ * The recalibration table file in CSV format that was generated by the VariantRecalibrator walker. + *
+ * The tranches file that was generated by the VariantRecalibrator walker. + * + *
Output
+ *+ * A recalibrated VCF file in which each variant is annotated with its VQSLOD and filtered if the score is below the desired quality level. + * + *
Examples
+ *+ * java -Xmx3g -jar GenomeAnalysisTK.jar \ + * -T ApplyRecalibration \ + * -R reference/human_g1k_v37.fasta \ + * -input NA12878.HiSeq.WGS.bwa.cleaned.raw.hg19.subset.vcf \ + * --ts_filter_level 99.0 \ + * -tranchesFile path/to/output.tranches \ + * -recalFile path/to/output.recal \ + * -o path/to/output.recalibrated.filtered.vcf + ** - * @author rpoplin - * @since Mar 14, 2011 */ public class ApplyRecalibration extends RodWalker{ @@ -57,11 +90,11 @@ public class ApplyRecalibration extends RodWalker { // Inputs ///////////////////////////// /** - * The raw input variants to be recalibrated. + * These calls should be unfiltered and annotated with the error covariates that are intended to use for modeling. */ @Input(fullName="input", shortName = "input", doc="The raw input variants to be recalibrated", required=true) public List > input; - @Input(fullName="recal_file", shortName="recalFile", doc="The output recal file used by ApplyRecalibration", required=true) + @Input(fullName="recal_file", shortName="recalFile", doc="The input recal file used by ApplyRecalibration", required=true) private File RECAL_FILE; @Input(fullName="tranches_file", shortName="tranchesFile", doc="The input tranches file describing where to cut the data", required=true) private File TRANCHES_FILE; @@ -69,7 +102,7 @@ public class ApplyRecalibration extends RodWalker { ///////////////////////////// // Outputs ///////////////////////////// - @Output( doc="The output filtered, recalibrated VCF file", required=true) + @Output( doc="The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value", required=true) private VCFWriter vcfWriter = null; ///////////////////////////// @@ -77,7 +110,7 @@ public class ApplyRecalibration extends RodWalker { ///////////////////////////// @Argument(fullName="ts_filter_level", shortName="ts_filter_level", doc="The truth sensitivity level at which to start filtering", required=false) private double TS_FILTER_LEVEL = 99.0; - @Argument(fullName="ignore_filter", shortName="ignoreFilter", doc="If specified the optimizer will use variants even if the specified filter name is marked in the input VCF file", required=false) + @Argument(fullName="ignore_filter", shortName="ignoreFilter", doc="If specified the variant recalibrator will use variants even if the specified filter name is marked in the input VCF file", required=false) private String[] IGNORE_INPUT_FILTERS = null; @Argument(fullName = "mode", shortName = "mode", doc = "Recalibration mode to employ: 1.) SNP for recalibrating only SNPs (emitting indels untouched in the output VCF); 2.) INDEL for indels; and 3.) BOTH for recalibrating both SNPs and indels simultaneously.", required = false) public VariantRecalibratorArgumentCollection.Mode MODE = VariantRecalibratorArgumentCollection.Mode.SNP; diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java index da9da936b..d81a57aad 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java @@ -45,10 +45,54 @@ import java.io.PrintStream; import java.util.*; /** - * Takes variant calls as .vcf files, learns a Gaussian mixture model over the variant annotations and evaluates the variant -- assigning an informative lod score + * Create a Gaussian mixture model by looking at the annotations values over a high quality subset of the input call set and then evaluate all input variants. + * + * + * This walker is the first pass in a two-stage processing step. This walker is designed to be used in conjunction with ApplyRecalibration walker. + * + *
+ * The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set. + * One can then create highly accurate call sets by filtering based on this single estimate for the accuracy of each call. + * The approach taken by variant quality score recalibration is to develop a continuous, covarying estimate of the relationship + * between SNP call annotations (QD, SB, HaplotypeScore, HRun, for example) and the the probability that a SNP is a true genetic + * variant versus a sequencing or data processing artifact. This model is determined adaptively based on "true sites" provided + * as input, typically HapMap 3 sites and those sites found to be polymorphic on the Omni 2.5M SNP chip array. This adaptive + * error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the + * probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is + * the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model. + * + *
+ * See the GATK wiki for a tutorial and example recalibration accuracy plots. + * http://www.broadinstitute.org/gsa/wiki/index.php/Variant_quality_score_recalibration + * + *
Input
+ *+ * The input raw variants to be recalibrated. + *
+ * Known, truth, and training sets to be used by the algorithm. How these various sets are used is described below. + * + *
Output
+ *+ * A recalibration table file in CSV format that is used by the ApplyRecalibration walker. + *
+ * A tranches file which shows various metrics of the recalibration callset as a function of making several slices through the data. + * + *
Examples
+ *+ * java -Xmx4g -jar GenomeAnalysisTK.jar \ + * -T VariantRecalibrator \ + * -R reference/human_g1k_v37.fasta \ + * -input NA12878.HiSeq.WGS.bwa.cleaned.raw.hg19.subset.vcf \ + * -truth:prior=15.0 hapmap_3.3.b37.sites.vcf \ + * -training:prior=15.0 hapmap_3.3.b37.sites.vcf \ + * -training:prior=12.0 1000G_omni2.5.b37.sites.vcf \ + * -known:prior=8.0 dbsnp_132.b37.vcf \ + * -an QD -an HaplotypeScore -an MQRankSum -an ReadPosRankSum -an FS -an MQ \ + * -recalFile path/to/output.recal \ + * -tranchesFile path/to/output.tranches \ + * -rscriptFile path/to/output.plots.R + ** - * User: rpoplin - * Date: 3/12/11 */ public class VariantRecalibrator extends RodWalker, ExpandingArrayList > implements TreeReducible > { @@ -62,36 +106,32 @@ public class VariantRecalibrator extends RodWalker > input; + /** - * A list of training variants used to train the Gaussian mixture model. - * * Input variants which are found to overlap with these training sites are used to build the Gaussian mixture model. */ @Input(fullName="training", shortName = "training", doc="A list of training variants used to train the Gaussian mixture model", required=true) public List > training; + /** - * A list of true variants to be used when deciding the truth sensitivity cut of the final callset. - * * When deciding where to set the cutoff in VQSLOD sensitivity to these truth sites is used. * Typically one might want to say I dropped my threshold until I got back 99% of HapMap sites, for example. */ @Input(fullName="truth", shortName = "truth", doc="A list of true variants to be used when deciding the truth sensitivity cut of the final callset", required=true) public List > truth; + /** - * A list of known variants to be used for metric comparison purposes. - * * The known / novel status of a variant isn't used by the algorithm itself and is only used for reporting / display purposes. * The output metrics are stratified by known status in order to aid in comparisons with other call sets. */ @Input(fullName="known", shortName = "known", doc="A list of known variants to be used for metric comparison purposes", required=false) public List > known = Collections.emptyList(); + /** - * A list of known bad variants used to supplement training the negative model. - * * In addition to using the worst 3% of variants as compared to the Gaussian mixture model, we can also supplement the list * with a database of known bad variants. Maybe these are loci which are frequently filtered out in many projects (centromere, for example). */ @@ -109,13 +149,25 @@ public class VariantRecalibrator extends RodWalker Date: Thu, 18 Aug 2011 15:28:35 -0400 Subject: [PATCH 20/20] dding docs for DepthOfCoverage and ValidationAmplicons --- .../coverage/DepthOfCoverageWalker.java | 46 ++++++++++-- .../validation/ValidationAmplicons.java | 72 ++++++++++++++++--- 2 files changed, 104 insertions(+), 14 deletions(-) diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java index 90036407f..7fe16c9df 100644 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java @@ -51,14 +51,48 @@ import java.io.PrintStream; import java.util.*; /** - * A parallelizable walker designed to quickly aggregate relevant coverage statistics across samples in the input - * file. Assesses the mean and median granular coverages of each sample, and generates part of a cumulative - * distribution of % bases and % targets covered for certain depths. The granularity of DOC can be set by command - * line arguments. + * Toolbox for assessing sequence coverage by a wide array of metrics, partitioned by sample, read group, or library * + * + * DepthOfCoverage processes a set of bam files to determine coverage at different levels of partitioning and + * aggregation. Coverage can be analyzed per locus, per interval, per gene, or in total; can be partitioned by + * sample, by read group, by technology, by center, or by library; and can be summarized by mean, median, quartiles, + * and/or percentage of bases covered to or beyond a threshold. + * Additionally, reads and bases can be filtered by mapping or base quality score. + * + *
Input
+ *+ * One or more bam files (with proper headers) to be analyzed for coverage statistics + * (Optional) A REFSEQ Rod to aggregate coverage to the gene level + *
+ * + *Output
+ *+ * Tables pertaining to different coverage summaries. Suffix on the table files declares the contents: + * - no suffix: per locus coverage + * - _summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases + * - _statistics: coverage histograms (# locus with X coverage), aggregated over all bases + * - _interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval + * - _interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples + * - _gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene + * - _gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples + * - _cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases + * - _cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases + *
+ * + *Examples
+ *+ * java -Xmx2g -jar GenomeAnalysisTK.jar \ + * -R ref.fasta \ + * -T VariantEval \ + * -o file_name_base \ + * -I input_bams.list + * [-geneList refSeq.sorted.txt] \ + * [-pt readgroup] \ + * [-ct 4 -ct 6 -ct 10] \ + * [-L my_capture_genes.interval_list] + ** - * @Author chartl - * @Date Feb 22, 2010 */ // todo -- cache the map from sample names to means in the print functions, rather than regenerating each time // todo -- support for granular histograms for total depth; maybe n*[start,stop], bins*sqrt(n) diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java index 61149e5d9..cd2891874 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java @@ -30,21 +30,77 @@ import java.util.LinkedList; import java.util.List; /** - * Created by IntelliJ IDEA. - * User: chartl - * Date: 6/13/11 - * Time: 2:12 PM - * To change this template use File | Settings | File Templates. + * Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and subsequent validation + * + *+ * ValidationAmplicons consumes a VCF and an Interval list and produces FASTA sequences from which PCR primers or probe + * sequences can be designed. In addition, ValidationAmplicons uses BWA to check for specificity of tracts of bases within + * the output amplicon, lower-casing non-specific tracts, allows for users to provide sites to mask out, and specifies + * reasons why the site may fail validation (nearby variation, for example). + *
+ * + *Input
+ *+ * Requires a VCF containing alleles to design amplicons towards, a VCF of variants to mask out of the amplicons, and an + * interval list defining the size of the amplicons around the sites to be validated + *
+ * + *Output
+ *+ * Output is a FASTA-formatted file with some modifications at probe sites. For instance: + *
+ * >20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414 + * CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC + * >20:792122 Valid 20_792122 + * TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT + * >20:994145 Valid 20_994145 + * TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC + * >20:1074230 SITE_IS_FILTERED=1, 20_1074230 + * ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC + * >20:1084330 DELETION=1, 20_1084330 + * CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC + *+ * are amplicon sequences resulting from running the tool. The flags (preceding the sequence itself) can be: + * + * Valid // amplicon is valid + * SITE_IS_FILTERED=1 // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant") + * VARIANT_TOO_NEAR_PROBE=1 // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak + * MULTIPLE_PROBES=1, // multiple variants to be validated found inside the same amplicon + * DELETION=6,INSERTION=5, // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate + * DELETION=1, // deletion found inside the amplicon region, could shift mass-spec peak + * START_TOO_CLOSE, // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer + * END_TOO_CLOSE, // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer + * NO_VARIANTS_FOUND, // no variants found within the amplicon region + * INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself) + *Examples
+ * + * java + * -jar GenomeAnalysisTK.jar + * -T ValidationAmplicons + * -R /humgen/1kg/reference/human_g1k_v37.fasta + * -BTI ProbeIntervals + * -ProbeIntervals:table interval_table.table + * -ValidateAlleles:vcf sites_to_validate.vcf + * -MaskAlleles:vcf mask_sites.vcf + * --virtualPrimerSize 30 + * -o probes.fasta + * + * + * @author chartl + * @since July 2011 */ @Requires(value={DataSource.REFERENCE}) public class ValidationAmplicons extends RodWalker{ - @Input(fullName = "ProbeIntervals", doc="Chris document me", required=true) + @Input(fullName = "ProbeIntervals", doc="A collection of intervals in table format with optional names that represent the "+ + "intervals surrounding the probe sites amplicons should be designed for", required=true) RodBinding probeIntervals; - @Input(fullName = "ValidateAlleles", doc="Chris document me", required=true) + @Input(fullName = "ValidateAlleles", doc="A VCF containing the sites and alleles you want to validate. Restricted to *BI-Allelic* sites", required=true) RodBinding validateAlleles; - @Input(fullName = "MaskAlleles", doc="Chris document me", required=true) + @Input(fullName = "MaskAlleles", doc="A VCF containing the sites you want to MASK from the designed amplicon (e.g. by Ns or lower-cased bases)", required=true) RodBinding maskAlleles;