diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java index 90036407f..7fe16c9df 100644 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverageWalker.java @@ -51,14 +51,48 @@ import java.io.PrintStream; import java.util.*; /** - * A parallelizable walker designed to quickly aggregate relevant coverage statistics across samples in the input - * file. Assesses the mean and median granular coverages of each sample, and generates part of a cumulative - * distribution of % bases and % targets covered for certain depths. The granularity of DOC can be set by command - * line arguments. + * Toolbox for assessing sequence coverage by a wide array of metrics, partitioned by sample, read group, or library * + *
+ * DepthOfCoverage processes a set of bam files to determine coverage at different levels of partitioning and + * aggregation. Coverage can be analyzed per locus, per interval, per gene, or in total; can be partitioned by + * sample, by read group, by technology, by center, or by library; and can be summarized by mean, median, quartiles, + * and/or percentage of bases covered to or beyond a threshold. + * Additionally, reads and bases can be filtered by mapping or base quality score. + * + *
+ * One or more bam files (with proper headers) to be analyzed for coverage statistics + * (Optional) A REFSEQ Rod to aggregate coverage to the gene level + *
+ * + *+ * Tables pertaining to different coverage summaries. Suffix on the table files declares the contents: + * - no suffix: per locus coverage + * - _summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases + * - _statistics: coverage histograms (# locus with X coverage), aggregated over all bases + * - _interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval + * - _interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples + * - _gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene + * - _gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples + * - _cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases + * - _cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases + *
+ * + *+ * java -Xmx2g -jar GenomeAnalysisTK.jar \ + * -R ref.fasta \ + * -T VariantEval \ + * -o file_name_base \ + * -I input_bams.list + * [-geneList refSeq.sorted.txt] \ + * [-pt readgroup] \ + * [-ct 4 -ct 6 -ct 10] \ + * [-L my_capture_genes.interval_list] + ** - * @Author chartl - * @Date Feb 22, 2010 */ // todo -- cache the map from sample names to means in the print functions, rather than regenerating each time // todo -- support for granular histograms for total depth; maybe n*[start,stop], bins*sqrt(n) diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java index 61149e5d9..cd2891874 100755 --- a/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java +++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/validation/ValidationAmplicons.java @@ -30,21 +30,77 @@ import java.util.LinkedList; import java.util.List; /** - * Created by IntelliJ IDEA. - * User: chartl - * Date: 6/13/11 - * Time: 2:12 PM - * To change this template use File | Settings | File Templates. + * Creates FASTA sequences for use in Seqenom or PCR utilities for site amplification and subsequent validation + * + *
+ * ValidationAmplicons consumes a VCF and an Interval list and produces FASTA sequences from which PCR primers or probe + * sequences can be designed. In addition, ValidationAmplicons uses BWA to check for specificity of tracts of bases within + * the output amplicon, lower-casing non-specific tracts, allows for users to provide sites to mask out, and specifies + * reasons why the site may fail validation (nearby variation, for example). + *
+ * + *+ * Requires a VCF containing alleles to design amplicons towards, a VCF of variants to mask out of the amplicons, and an + * interval list defining the size of the amplicons around the sites to be validated + *
+ * + *+ * Output is a FASTA-formatted file with some modifications at probe sites. For instance: + *
+ * >20:207414 INSERTION=1,VARIANT_TOO_NEAR_PROBE=1, 20_207414 + * CCAACGTTAAGAAAGAGACATGCGACTGGGTgcggtggctcatgcctggaaccccagcactttgggaggccaaggtgggc[A/G*]gNNcacttgaggtcaggagtttgagaccagcctggccaacatggtgaaaccccgtctctactgaaaatacaaaagttagC + * >20:792122 Valid 20_792122 + * TTTTTTTTTagatggagtctcgctcttatcgcccaggcNggagtgggtggtgtgatcttggctNactgcaacttctgcct[-/CCC*]cccaggttcaagtgattNtcctgcctcagccacctgagtagctgggattacaggcatccgccaccatgcctggctaatTT + * >20:994145 Valid 20_994145 + * TCCATGGCCTCCCCCTGGCCCACGAAGTCCTCAGCCACCTCCTTCCTGGAGGGCTCAGCCAAAATCAGACTGAGGAAGAAG[AAG/-*]TGGTGGGCACCCACCTTCTGGCCTTCCTCAGCCCCTTATTCCTAGGACCAGTCCCCATCTAGGGGTCCTCACTGCCTCCC + * >20:1074230 SITE_IS_FILTERED=1, 20_1074230 + * ACCTGATTACCATCAATCAGAACTCATTTCTGTTCCTATCTTCCACCCACAATTGTAATGCCTTTTCCATTTTAACCAAG[T/C*]ACTTATTATAtactatggccataacttttgcagtttgaggtatgacagcaaaaTTAGCATACATTTCATTTTCCTTCTTC + * >20:1084330 DELETION=1, 20_1084330 + * CACGTTCGGcttgtgcagagcctcaaggtcatccagaggtgatAGTTTAGGGCCCTCTCAAGTCTTTCCNGTGCGCATGG[GT/AC*]CAGCCCTGGGCACCTGTNNNNNNNNNNNNNTGCTCATGGCCTTCTAGATTCCCAGGAAATGTCAGAGCTTTTCAAAGCCC + *+ * are amplicon sequences resulting from running the tool. The flags (preceding the sequence itself) can be: + * + * Valid // amplicon is valid + * SITE_IS_FILTERED=1 // validation site is not marked 'PASS' or '.' in its filter field ("you are trying to validate a filtered variant") + * VARIANT_TOO_NEAR_PROBE=1 // there is a variant too near to the variant to be validated, potentially shifting the mass-spec peak + * MULTIPLE_PROBES=1, // multiple variants to be validated found inside the same amplicon + * DELETION=6,INSERTION=5, // 6 deletions and 5 insertions found inside the amplicon region (from the "mask" VCF), will be potentially difficult to validate + * DELETION=1, // deletion found inside the amplicon region, could shift mass-spec peak + * START_TOO_CLOSE, // variant is too close to the start of the amplicon region to give sequenom a good chance to find a suitable primer + * END_TOO_CLOSE, // variant is too close to the end of the amplicon region to give sequenom a good chance to find a suitable primer + * NO_VARIANTS_FOUND, // no variants found within the amplicon region + * INDEL_OVERLAPS_VALIDATION_SITE, // an insertion or deletion interferes directly with the site to be validated (i.e. insertion directly preceding or postceding, or a deletion that spans the site itself) + * + * + *