();
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/missing/Metrics.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/missing/Metrics.java
new file mode 100644
index 000000000..5e3da5f4f
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/missing/Metrics.java
@@ -0,0 +1,110 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.diagnostics.missing;
+
+/**
+ * Short one line description of the walker.
+ *
+ *
+ * [Long description of the walker]
+ *
+ *
+ *
+ * Input
+ *
+ * [Description of the Input]
+ *
+ *
+ * Output
+ *
+ * [Description of the Output]
+ *
+ *
+ * Examples
+ *
+ * java
+ * -jar GenomeAnalysisTK.jar
+ * -T [walker name]
+ *
+ *
+ * @author Mauricio Carneiro
+ * @since 5/1/13
+ */
+final class Metrics {
+ private double gccontent;
+ private double baseQual;
+ private double mapQual;
+ private int reads;
+ private int refs;
+
+ void reads(int reads) {this.reads = reads;}
+ void refs(int refs) {this.refs = refs;}
+
+ void gccontent(double gccontent) {this.gccontent = gccontent;}
+ void baseQual(double baseQual) {this.baseQual = baseQual;}
+ void mapQual(double mapQual) {this.mapQual = mapQual;}
+
+ double gccontent() {return refs > 0 ? gccontent/refs : 0.0;}
+ double baseQual() {return reads > 0 ? baseQual/reads : 0.0;}
+ double mapQual() {return reads > 0 ? mapQual/reads : 0.0;}
+
+ /**
+ * Combines two metrics
+ *
+ * @param value the other metric to combine
+ * @return itself, for simple reduce
+ */
+ public Metrics combine(Metrics value) {
+ this.gccontent += value.gccontent;
+ this.baseQual += value.baseQual;
+ this.mapQual += value.mapQual;
+ this.reads += value.reads;
+ this.refs += value.refs;
+
+ return this;
+ }
+}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/missing/QualifyMissingIntervals.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/missing/QualifyMissingIntervals.java
new file mode 100644
index 000000000..d0db3ef98
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/missing/QualifyMissingIntervals.java
@@ -0,0 +1,228 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.diagnostics.missing;
+
+import org.broadinstitute.sting.commandline.Argument;
+import org.broadinstitute.sting.commandline.Output;
+import org.broadinstitute.sting.gatk.CommandLineGATK;
+import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
+import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
+import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
+import org.broadinstitute.sting.gatk.report.GATKReport;
+import org.broadinstitute.sting.gatk.walkers.By;
+import org.broadinstitute.sting.gatk.walkers.DataSource;
+import org.broadinstitute.sting.gatk.walkers.LocusWalker;
+import org.broadinstitute.sting.gatk.walkers.NanoSchedulable;
+import org.broadinstitute.sting.utils.GenomeLoc;
+import org.broadinstitute.sting.utils.GenomeLocParser;
+import org.broadinstitute.sting.utils.GenomeLocSortedSet;
+import org.broadinstitute.sting.utils.collections.Pair;
+import org.broadinstitute.sting.utils.exceptions.UserException;
+import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
+import org.broadinstitute.sting.utils.help.HelpConstants;
+import org.broadinstitute.sting.utils.pileup.ReadBackedPileup;
+import org.broadinstitute.sting.utils.text.XReadLines;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.PrintStream;
+import java.util.List;
+
+/**
+ * Walks along reference and calculates a few metrics for each interval.
+ *
+ * Metrics:
+ *
+ * - Average Base Quality
+ * - Average Mapping Quality
+ * - GC Content
+ * - Position in the target
+ * - Coding Sequence / Intron
+ * - Length of the uncovered area
+ *
+ *
+ * Input
+ *
+ * A reference file (for GC content), the input bam file (for base and mapping quality calculation), the missing intervals (in the -L), the baits/targets used to sequence (in the -targets) and a bed file with the coding sequence intervals of the genome (in the -cds)
+ *
+ *
+ * Output
+ *
+ * GC content calculations per interval.
+ *
+ *
+ * Example
+ *
+ * java -Xmx2g -jar GenomeAnalysisTK.jar \
+ * -T QualifyMissingIntervals \
+ * -R ref.fasta \
+ * -I input.bam \
+ * -o output.grp \
+ * -L input.intervals \
+ * -cds cds.intervals \
+ * -targets targets.intervals
+ *
+ *
+ */
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
+@By(DataSource.REFERENCE)
+public final class QualifyMissingIntervals extends LocusWalker implements NanoSchedulable {
+ @Output
+ protected PrintStream out;
+
+ @Argument(shortName = "targets", required = true)
+ public File targetsFile;
+
+ @Argument(shortName = "cds", required = false)
+ public File cdsFile = null;
+
+ GATKReport simpleReport;
+ GenomeLocSortedSet target;
+ GenomeLocSortedSet cds;
+
+ public boolean isReduceByInterval() {
+ return true;
+ }
+
+ public void initialize() {
+ simpleReport = GATKReport.newSimpleReport("QualifyMissingIntervals", "IN", "GC", "BQ", "MQ", "TP", "CD", "LN");
+ final GenomeLocParser parser = getToolkit().getGenomeLocParser();
+ target = new GenomeLocSortedSet(parser);
+ cds = new GenomeLocSortedSet(parser);
+ parseFile(targetsFile, target, parser);
+ if (cdsFile != null)
+ parseFile(cdsFile, cds, parser);
+ }
+
+ public Metrics reduceInit() {
+ return new Metrics();
+ }
+
+ public Metrics map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
+ if (tracker == null)
+ return null;
+
+ final Metrics metrics = new Metrics();
+ final byte baseIndex = ref.getBase();
+ final ReadBackedPileup pileup = context.getBasePileup();
+ final int nBases = pileup.getNumberOfElements();
+
+ double baseQual = 0.0;
+ for (byte qual : pileup.getQuals()) {
+ baseQual += qual;
+ }
+ double mapQual = 0.0;
+ for (byte qual : pileup.getMappingQuals()) {
+ mapQual += qual;
+ }
+
+ metrics.baseQual(baseQual);
+ metrics.mapQual(mapQual);
+ metrics.gccontent(baseIndex == 'C' || baseIndex == 'G' ? 1.0 : 0.0);
+ metrics.reads(nBases);
+ metrics.refs(1);
+
+ return metrics;
+ }
+
+ @Override
+ public Metrics reduce(Metrics value, Metrics sum) {
+ return sum.combine(value);
+ }
+
+ public void onTraversalDone(List> results) {
+ for (Pair r : results) {
+ GenomeLoc interval = r.getFirst();
+ Metrics metrics = r.getSecond();
+ simpleReport.addRow(
+ interval.toString(),
+ metrics.gccontent(),
+ metrics.baseQual(),
+ metrics.mapQual(),
+ getPositionInTarget(interval),
+ cds.overlaps(interval),
+ interval.size()
+ );
+ }
+ simpleReport.print(out);
+ out.close();
+ }
+
+ private static GenomeLoc parseInterval(String s, GenomeLocParser parser) {
+ if (s.isEmpty()) {
+ return null;
+ }
+ String[] first = s.split(":");
+ if (first.length == 2) {
+ String[] second = first[1].split("\\-");
+ return parser.createGenomeLoc(first[0], Integer.decode(second[0]), Integer.decode(second[1]));
+ } else {
+ throw new UserException.BadInput("Interval doesn't parse correctly: " + s);
+ }
+ }
+
+ private void parseFile(File file, GenomeLocSortedSet set, GenomeLocParser parser) {
+ try {
+ for (String s : new XReadLines(file) ) {
+ GenomeLoc interval = parseInterval(s, parser);
+ if (interval != null)
+ set.add(interval, true);
+ }
+ } catch (FileNotFoundException e) {
+ e.printStackTrace();
+ }
+ }
+
+ private int getPositionInTarget(GenomeLoc interval) {
+ final List hits = target.getOverlapping(interval);
+ int result = 0;
+ for (GenomeLoc hit : hits) {
+ result = interval.getStart() - hit.getStart(); // if there are multiple hits, we'll get the last one.
+ }
+ return result;
+ }
+}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IndelGenotypeLikelihoodsCalculationModel.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IndelGenotypeLikelihoodsCalculationModel.java
index c6e9ea379..0f3f7739d 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IndelGenotypeLikelihoodsCalculationModel.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/IndelGenotypeLikelihoodsCalculationModel.java
@@ -76,7 +76,8 @@ public class IndelGenotypeLikelihoodsCalculationModel extends GenotypeLikelihood
private List alleleList = new ArrayList();
- protected IndelGenotypeLikelihoodsCalculationModel(UnifiedArgumentCollection UAC, Logger logger) {
+ protected IndelGenotypeLikelihoodsCalculationModel(final UnifiedArgumentCollection UAC,
+ final Logger logger) {
super(UAC, logger);
pairModel = new PairHMMIndelErrorModel(UAC.INDEL_GAP_OPEN_PENALTY, UAC.INDEL_GAP_CONTINUATION_PENALTY,
UAC.OUTPUT_DEBUG_INDEL_INFO, UAC.pairHMM);
@@ -85,10 +86,11 @@ public class IndelGenotypeLikelihoodsCalculationModel extends GenotypeLikelihood
ignoreSNPAllelesWhenGenotypingIndels = UAC.IGNORE_SNP_ALLELES;
}
- protected static List computeConsensusAlleles(ReferenceContext ref,
- Map contexts,
- AlignmentContextUtils.ReadOrientation contextType,
- GenomeLocParser locParser, UnifiedArgumentCollection UAC) {
+ protected static List computeConsensusAlleles(final ReferenceContext ref,
+ final Map contexts,
+ final AlignmentContextUtils.ReadOrientation contextType,
+ final GenomeLocParser locParser,
+ final UnifiedArgumentCollection UAC) {
ConsensusAlleleCounter counter = new ConsensusAlleleCounter(locParser, true, UAC.MIN_INDEL_COUNT_FOR_GENOTYPING, UAC.MIN_INDEL_FRACTION_PER_SAMPLE);
return counter.computeConsensusAlleles(ref, contexts, contextType);
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/SNPGenotypeLikelihoodsCalculationModel.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/SNPGenotypeLikelihoodsCalculationModel.java
index ce5f94478..360f88e51 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/SNPGenotypeLikelihoodsCalculationModel.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/SNPGenotypeLikelihoodsCalculationModel.java
@@ -147,9 +147,17 @@ public class SNPGenotypeLikelihoodsCalculationModel extends GenotypeLikelihoodsC
// if we only want variants, then we don't need to calculate genotype likelihoods
if ( UAC.OutputMode == UnifiedGenotyperEngine.OUTPUT_MODE.EMIT_VARIANTS_ONLY )
return builder.make();
+ // if user requires all PLs at all sites, add all possible alt alleles
+ else if (UAC.annotateAllSitesWithPLs) {
+ for ( final byte base : BaseUtils.BASES ) {
+ if ( base != refBase )
+ alleles.add(Allele.create(base));
+ }
+ }
- // otherwise, choose any alternate allele (it doesn't really matter)
- alleles.add(Allele.create(BaseUtils.baseIndexToSimpleBase(indexOfRefBase == 0 ? 1 : 0)));
+ else
+ // otherwise, choose any alternate allele (it doesn't really matter)
+ alleles.add(Allele.create(BaseUtils.baseIndexToSimpleBase(indexOfRefBase == 0 ? 1 : 0)));
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedArgumentCollection.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedArgumentCollection.java
index e346b10b7..4fae3d6e3 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedArgumentCollection.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedArgumentCollection.java
@@ -52,6 +52,9 @@ import org.broadinstitute.sting.utils.pairhmm.PairHMM;
import org.broadinstitute.sting.utils.variant.GATKVariantContextUtils;
import org.broadinstitute.variant.variantcontext.VariantContext;
+import java.util.Collections;
+import java.util.List;
+
public class UnifiedArgumentCollection extends StandardCallerArgumentCollection {
@Argument(fullName = "genotype_likelihoods_model", shortName = "glm", doc = "Genotype likelihoods calculation model to employ -- SNP is the default option, while INDEL is also available for calling indels and BOTH is available for calling both together", required = false)
@@ -82,7 +85,7 @@ public class UnifiedArgumentCollection extends StandardCallerArgumentCollection
* The PairHMM implementation to use for -glm INDEL genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.
*/
@Argument(fullName = "pair_hmm_implementation", shortName = "pairHMM", doc = "The PairHMM implementation to use for -glm INDEL genotype likelihood calculations", required = false)
- public PairHMM.HMM_IMPLEMENTATION pairHMM = PairHMM.HMM_IMPLEMENTATION.ORIGINAL;
+ public PairHMM.HMM_IMPLEMENTATION pairHMM = PairHMM.HMM_IMPLEMENTATION.LOGLESS_CACHING;
/**
* The minimum confidence needed in a given base for it to be used in variant calling. Note that the base quality of a base
@@ -95,6 +98,18 @@ public class UnifiedArgumentCollection extends StandardCallerArgumentCollection
@Argument(fullName = "max_deletion_fraction", shortName = "deletions", doc = "Maximum fraction of reads with deletions spanning this locus for it to be callable [to disable, set to < 0 or > 1; default:0.05]", required = false)
public Double MAX_DELETION_FRACTION = 0.05;
+ /**
+ * Advanced, experimental argument: if SNP likelihood model is specified, and if EMIT_ALL_SITES output mode is set, when we set this argument then we will also emit PLs at all sites.
+ * This will give a measure of reference confidence and a measure of which alt alleles are more plausible (if any).
+ * WARNINGS:
+ * - This feature will inflate VCF file size considerably.
+ * - All SNP ALT alleles will be emitted with corresponding 10 PL values.
+ * - An error will be emitted if EMIT_ALL_SITES is not set, or if anything other than diploid SNP model is used
+ */
+ @Advanced
+ @Argument(fullName = "allSitePLs", shortName = "allSitePLs", doc = "Annotate all sites with PLs", required = false)
+ public boolean annotateAllSitesWithPLs = false;
+
// indel-related arguments
/**
* A candidate indel is genotyped (and potentially called) if there are this number of reads with a consensus indel at a site.
@@ -199,6 +214,9 @@ public class UnifiedArgumentCollection extends StandardCallerArgumentCollection
@Argument(shortName="ef", fullName="exclude_filtered_reference_sites", doc="Don't include in the analysis sites where the reference sample VCF is filtered. Default: false.", required=false)
boolean EXCLUDE_FILTERED_REFERENCE_SITES = false;
+ @Argument(fullName = "output_mode", shortName = "out_mode", doc = "Specifies which type of calls we should output", required = false)
+ public UnifiedGenotyperEngine.OUTPUT_MODE OutputMode = UnifiedGenotyperEngine.OUTPUT_MODE.EMIT_VARIANTS_ONLY;
+
/**
* Create a new UAC with defaults for all UAC arguments
*/
@@ -247,7 +265,9 @@ public class UnifiedArgumentCollection extends StandardCallerArgumentCollection
this.EXCLUDE_FILTERED_REFERENCE_SITES = uac.EXCLUDE_FILTERED_REFERENCE_SITES;
this.IGNORE_LANE_INFO = uac.IGNORE_LANE_INFO;
this.pairHMM = uac.pairHMM;
+ this.OutputMode = uac.OutputMode;
+ this.annotateAllSitesWithPLs = uac.annotateAllSitesWithPLs;
// todo- arguments to remove
this.IGNORE_SNP_ALLELES = uac.IGNORE_SNP_ALLELES;
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java
index 3380efcc9..9f3368cf8 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java
@@ -83,6 +83,9 @@ public class UnifiedGenotyperEngine {
public static final double HUMAN_SNP_HETEROZYGOSITY = 1e-3;
public static final double HUMAN_INDEL_HETEROZYGOSITY = 1e-4;
+ private static final int SNP_MODEL = 0;
+ private static final int INDEL_MODEL = 1;
+
public enum OUTPUT_MODE {
/** produces calls only at variant sites */
EMIT_VARIANTS_ONLY,
@@ -165,6 +168,13 @@ public class UnifiedGenotyperEngine {
filter.add(LOW_QUAL_FILTER_NAME);
determineGLModelsToUse();
+
+ // do argument checking
+ if (UAC.annotateAllSitesWithPLs) {
+ if (!modelsToUse.contains(GenotypeLikelihoodsCalculationModel.Model.SNP))
+ throw new IllegalArgumentException("Invalid genotype likelihood model specification: Only diploid SNP model can be used in conjunction with option allSitePLs");
+
+ }
}
/**
@@ -436,7 +446,8 @@ public class UnifiedGenotyperEngine {
bestGuessIsRef = false;
}
// if in GENOTYPE_GIVEN_ALLELES mode, we still want to allow the use of a poor allele
- else if ( UAC.GenotypingMode == GenotypeLikelihoodsCalculationModel.GENOTYPING_MODE.GENOTYPE_GIVEN_ALLELES ) {
+ else if ( UAC.GenotypingMode == GenotypeLikelihoodsCalculationModel.GENOTYPING_MODE.GENOTYPE_GIVEN_ALLELES ||
+ UAC.annotateAllSitesWithPLs) {
myAlleles.add(alternateAllele);
alleleCountsofMLE.add(AFresult.getAlleleCountAtMLE(alternateAllele));
}
@@ -446,7 +457,7 @@ public class UnifiedGenotyperEngine {
// note the math.abs is necessary because -10 * 0.0 => -0.0 which isn't nice
final double phredScaledConfidence =
- Math.abs(! bestGuessIsRef || UAC.GenotypingMode == GenotypeLikelihoodsCalculationModel.GENOTYPING_MODE.GENOTYPE_GIVEN_ALLELES
+ Math.abs(! bestGuessIsRef || UAC.GenotypingMode == GenotypeLikelihoodsCalculationModel.GENOTYPING_MODE.GENOTYPE_GIVEN_ALLELES || UAC.annotateAllSitesWithPLs
? -10 * AFresult.getLog10PosteriorOfAFEq0()
: -10 * AFresult.getLog10PosteriorOfAFGT0());
@@ -540,11 +551,6 @@ public class UnifiedGenotyperEngine {
builder.attributes(attributes);
VariantContext vcCall = builder.make();
- // if we are subsetting alleles (either because there were too many or because some were not polymorphic)
- // then we may need to trim the alleles (because the original VariantContext may have had to pad at the end).
- if ( myAlleles.size() != vc.getAlleles().size() && !limitedContext ) // limitedContext callers need to handle allele trimming on their own to keep their perReadAlleleLikelihoodMap alleles in sync
- vcCall = GATKVariantContextUtils.reverseTrimAlleles(vcCall);
-
if ( annotationEngine != null && !limitedContext ) { // limitedContext callers need to handle annotations on their own by calling their own annotationEngine
// Note: we want to use the *unfiltered* and *unBAQed* context for the annotations
final ReadBackedPileup pileup = rawContext.getBasePileup();
@@ -553,6 +559,11 @@ public class UnifiedGenotyperEngine {
vcCall = annotationEngine.annotateContext(tracker, refContext, stratifiedContexts, vcCall, perReadAlleleLikelihoodMap);
}
+ // if we are subsetting alleles (either because there were too many or because some were not polymorphic)
+ // then we may need to trim the alleles (because the original VariantContext may have had to pad at the end).
+ if ( myAlleles.size() != vc.getAlleles().size() && !limitedContext ) // limitedContext callers need to handle allele trimming on their own to keep their perReadAlleleLikelihoodMap alleles in sync
+ vcCall = GATKVariantContextUtils.reverseTrimAlleles(vcCall);
+
return new VariantCallContext(vcCall, confidentlyCalled(phredScaledConfidence, PoFGT0));
}
@@ -693,13 +704,13 @@ public class UnifiedGenotyperEngine {
}
private void determineGLModelsToUse() {
-
String modelPrefix = "";
if ( !UAC.GLmodel.name().contains(GPSTRING) && UAC.samplePloidy != GATKVariantContextUtils.DEFAULT_PLOIDY )
modelPrefix = GPSTRING;
- if ( UAC.GLmodel.name().toUpperCase().contains("BOTH") ) {
- modelPrefix += UAC.GLmodel.name().toUpperCase().replaceAll("BOTH","");
+ // GGA mode => must initialize both the SNP and indel models
+ if ( UAC.GenotypingMode == GenotypeLikelihoodsCalculationModel.GENOTYPING_MODE.GENOTYPE_GIVEN_ALLELES ||
+ UAC.GLmodel.name().toUpperCase().contains("BOTH") ) {
modelsToUse.add(GenotypeLikelihoodsCalculationModel.Model.valueOf(modelPrefix+"SNP"));
modelsToUse.add(GenotypeLikelihoodsCalculationModel.Model.valueOf(modelPrefix+"INDEL"));
}
@@ -712,31 +723,24 @@ public class UnifiedGenotyperEngine {
private List getGLModelsToUse(final RefMetaDataTracker tracker,
final ReferenceContext refContext,
final AlignmentContext rawContext) {
-
if ( UAC.GenotypingMode != GenotypeLikelihoodsCalculationModel.GENOTYPING_MODE.GENOTYPE_GIVEN_ALLELES )
return modelsToUse;
+ if ( modelsToUse.size() != 2 )
+ throw new IllegalStateException("GGA mode assumes that we have initialized both the SNP and indel models but found " + modelsToUse);
+
// if we're genotyping given alleles then we need to choose the model corresponding to the variant type requested
- final List GGAmodel = new ArrayList(1);
final VariantContext vcInput = getVCFromAllelesRod(tracker, refContext, rawContext.getLocation(), false, logger, UAC.alleles);
- if ( vcInput == null )
- return GGAmodel; // no work to be done
- if ( vcInput.isSNP() ) {
- // use the SNP model unless the user chose INDEL mode only
- if ( modelsToUse.size() == 2 || modelsToUse.get(0).name().endsWith("SNP") )
- GGAmodel.add(modelsToUse.get(0));
+ if ( vcInput == null ) {
+ return Collections.emptyList(); // no work to be done
+ } else if ( vcInput.isSNP() ) {
+ return Collections.singletonList(modelsToUse.get(SNP_MODEL));
+ } else if ( vcInput.isIndel() || vcInput.isMixed() ) {
+ return Collections.singletonList(modelsToUse.get(INDEL_MODEL));
+ } else {
+ return Collections.emptyList(); // No support for other types yet
}
- else if ( vcInput.isIndel() || vcInput.isMixed() ) {
- // use the INDEL model unless the user chose SNP mode only
- if ( modelsToUse.size() == 2 )
- GGAmodel.add(modelsToUse.get(1));
- else if ( modelsToUse.get(0).name().endsWith("INDEL") )
- GGAmodel.add(modelsToUse.get(0));
- }
- // No support for other types yet
-
- return GGAmodel;
}
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/afcalc/DiploidExactAFCalc.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/afcalc/DiploidExactAFCalc.java
index 170b6e250..2ece18002 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/afcalc/DiploidExactAFCalc.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/afcalc/DiploidExactAFCalc.java
@@ -106,7 +106,7 @@ public abstract class DiploidExactAFCalc extends ExactAFCalc {
alleles.add(vc.getReference());
alleles.addAll(chooseMostLikelyAlternateAlleles(vc, getMaxAltAlleles()));
builder.alleles(alleles);
- builder.genotypes(GATKVariantContextUtils.subsetDiploidAlleles(vc, alleles, false));
+ builder.genotypes(GATKVariantContextUtils.subsetDiploidAlleles(vc, alleles, GATKVariantContextUtils.GenotypeAssignmentMethod.SET_TO_NO_CALL));
return builder.make();
} else {
return vc;
@@ -352,6 +352,9 @@ public abstract class DiploidExactAFCalc extends ExactAFCalc {
final List allelesToUse,
final boolean assignGenotypes,
final int ploidy) {
- return GATKVariantContextUtils.subsetDiploidAlleles(vc, allelesToUse, assignGenotypes);
+ return allelesToUse.size() == 1
+ ? GATKVariantContextUtils.subsetToRefOnly(vc, ploidy)
+ : GATKVariantContextUtils.subsetDiploidAlleles(vc, allelesToUse,
+ assignGenotypes ? GATKVariantContextUtils.GenotypeAssignmentMethod.USE_PLS_TO_ASSIGN : GATKVariantContextUtils.GenotypeAssignmentMethod.SET_TO_NO_CALL);
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/AssemblyResult.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/AssemblyResult.java
new file mode 100644
index 000000000..f07dbb392
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/AssemblyResult.java
@@ -0,0 +1,89 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
+
+import org.broadinstitute.sting.gatk.walkers.haplotypecaller.graphs.SeqGraph;
+
+/**
+ * Result of assembling, with the resulting graph and status
+ *
+ * User: depristo
+ * Date: 7/1/13
+ * Time: 5:35 PM
+ */
+public class AssemblyResult {
+ private final Status status;
+ private final SeqGraph graph;
+
+ /**
+ * Create a new assembly result
+ * @param status the status, cannot be null
+ * @param graph the resulting graph of the assembly, can only be null if result is failed
+ */
+ public AssemblyResult(final Status status, final SeqGraph graph) {
+ if ( status == null ) throw new IllegalArgumentException("status cannot be null");
+ if ( status != Status.FAILED && graph == null ) throw new IllegalArgumentException("graph is null but status is " + status);
+
+ this.status = status;
+ this.graph = graph;
+ }
+
+ public Status getStatus() { return status; }
+ public SeqGraph getGraph() { return graph; }
+
+ /**
+ * Status of the assembly result
+ */
+ public enum Status {
+ /** Something went wrong, and we couldn't produce a meaningful graph */
+ FAILED,
+ /** Assembly succeeded, but graph degenerated into just the reference sequence */
+ JUST_ASSEMBLED_REFERENCE,
+ /** Assembly succeeded, and the graph has some meaningful structure */
+ ASSEMBLED_SOME_VARIATION
+ }
+}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
index 48972dfd5..1a59cdb63 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
@@ -49,7 +49,6 @@ package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
import com.google.java.contract.Requires;
import org.apache.log4j.Logger;
import org.broadinstitute.sting.gatk.walkers.haplotypecaller.graphs.DeBruijnGraph;
-import org.broadinstitute.sting.gatk.walkers.haplotypecaller.graphs.SeqGraph;
import org.broadinstitute.sting.utils.MathUtils;
import org.broadinstitute.sting.utils.exceptions.UserException;
import org.broadinstitute.sting.utils.haplotype.Haplotype;
@@ -77,6 +76,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
private final static int NUM_PATHS_PER_GRAPH = 25;
private static final int KMER_OVERLAP = 5; // the additional size of a valid chunk of sequence, used to string together k-mers
private static final int GRAPH_KMER_STEP = 6;
+ private static final int GGA_MODE_ARTIFICIAL_COUNTS = 1000;
private final int minKmer;
private final int onlyBuildKmersOfThisSizeWhenDebuggingGraphAlgorithms;
@@ -92,8 +92,8 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
@Override
- protected List assemble(final List reads, final Haplotype refHaplotype) {
- final List graphs = new LinkedList();
+ protected List assemble(final List reads, final Haplotype refHaplotype, final List activeAlleleHaplotypes ) {
+ final List results = new LinkedList<>();
final int maxKmer = ReadUtils.getMaxReadLength(reads) - KMER_OVERLAP - 1;
if( maxKmer < minKmer) {
@@ -106,7 +106,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
continue;
if ( debug ) logger.info("Creating de Bruijn graph for " + kmer + " kmer using " + reads.size() + " reads");
- DeBruijnGraph graph = createGraphFromSequences( reads, kmer, refHaplotype);
+ DeBruijnGraph graph = createGraphFromSequences(reads, kmer, refHaplotype, activeAlleleHaplotypes);
if( graph != null ) { // graphs that fail during creation ( for example, because there are cycles in the reference graph ) will show up here as a null graph object
// do a series of steps to clean up the raw assembly graph to make it analysis-ready
if ( debugGraphTransformations ) graph.printGraph(new File("unpruned.dot"), pruneFactor);
@@ -117,23 +117,18 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
" future subsystem will actually go and error correct the reads");
}
- final SeqGraph seqGraph = cleanupSeqGraph(graph.convertToSequenceGraph());
+ results.add(cleanupSeqGraph(graph.convertToSequenceGraph()));
- if ( seqGraph != null ) { // if the graph contains interesting variation from the reference
- graphs.add(seqGraph);
-
- if ( debugGraphTransformations ) // we only want to use one graph size
- break;
- }
+ if ( debugGraphTransformations ) // we only want to use one graph size
+ break;
}
-
}
- return graphs;
+ return results;
}
@Requires({"reads != null", "kmerLength > 0", "refHaplotype != null"})
- protected DeBruijnGraph createGraphFromSequences( final List reads, final int kmerLength, final Haplotype refHaplotype ) {
+ protected DeBruijnGraph createGraphFromSequences( final List reads, final int kmerLength, final Haplotype refHaplotype, final List activeAlleleHaplotypes ) {
final DeBruijnGraph graph = new DeBruijnGraph(kmerLength);
final DeBruijnGraphBuilder builder = new DeBruijnGraphBuilder(graph);
@@ -142,6 +137,11 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
// something went wrong, so abort right now with a null graph
return null;
+ // add the artificial GGA haplotypes to the graph
+ if ( ! addGGAKmersToGraph(builder, activeAlleleHaplotypes) )
+ // something went wrong, so abort right now with a null graph
+ return null;
+
// now go through the graph already seeded with the reference sequence and add the read kmers to it
if ( ! addReadKmersToGraph(builder, reads) )
// some problem was detected adding the reads to the graph, return null to indicate we failed
@@ -151,6 +151,28 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
return graph;
}
+ /**
+ * Add the high-quality kmers from the artificial GGA haplotypes to the graph
+ *
+ * @param builder a debruijn graph builder to add the read kmers to
+ * @param activeAlleleHaplotypes a list of haplotypes to add to the graph for GGA mode
+ * @return true if we successfully added the read kmers to the graph without corrupting it in some way
+ */
+ protected boolean addGGAKmersToGraph(final DeBruijnGraphBuilder builder, final List activeAlleleHaplotypes) {
+
+ final int kmerLength = builder.getKmerSize();
+
+ for( final Haplotype haplotype : activeAlleleHaplotypes ) {
+ final int end = haplotype.length() - kmerLength;
+ for( int start = 0; start < end; start++ ) {
+ builder.addKmerPairFromSeqToGraph( haplotype.getBases(), start, GGA_MODE_ARTIFICIAL_COUNTS );
+ }
+ }
+
+ // always returns true now, but it's possible that we'd add kmers and decide we don't like the graph in some way
+ return true;
+ }
+
/**
* Add the high-quality kmers from the reads to the graph
*
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
index 419ea378f..04173b64f 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
@@ -49,6 +49,7 @@ package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
import com.google.java.contract.Ensures;
import com.google.java.contract.Requires;
import org.apache.log4j.Logger;
+import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.annotator.VariantAnnotatorEngine;
import org.broadinstitute.sting.gatk.walkers.genotyper.GenotypeLikelihoodsCalculationModel;
import org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyperEngine;
@@ -71,7 +72,7 @@ public class GenotypingEngine {
private final boolean DEBUG;
private final boolean USE_FILTERED_READ_MAP_FOR_ANNOTATIONS;
- private final static List noCall = new ArrayList(); // used to noCall all genotypes until the exact model is applied
+ private final static List noCall = new ArrayList<>(); // used to noCall all genotypes until the exact model is applied
private final VariantAnnotatorEngine annotationEngine;
private final MergeVariantsAcrossHaplotypes crossHaplotypeEventMerger;
@@ -146,6 +147,7 @@ public class GenotypingEngine {
final GenomeLoc refLoc,
final GenomeLoc activeRegionWindow,
final GenomeLocParser genomeLocParser,
+ final RefMetaDataTracker tracker,
final List activeAllelesToGenotype ) {
// sanity check input arguments
if (UG_engine == null) throw new IllegalArgumentException("UG_Engine input can't be null, got "+UG_engine);
@@ -162,8 +164,8 @@ public class GenotypingEngine {
final TreeSet startPosKeySet = decomposeHaplotypesIntoVariantContexts(haplotypes, haplotypeReadMap, ref, refLoc, activeAllelesToGenotype);
// Walk along each position in the key set and create each event to be outputted
- final Set calledHaplotypes = new HashSet();
- final List returnCalls = new ArrayList();
+ final Set calledHaplotypes = new HashSet<>();
+ final List returnCalls = new ArrayList<>();
for( final int loc : startPosKeySet ) {
if( loc >= activeRegionWindow.getStart() && loc <= activeRegionWindow.getStop() ) { // genotyping an event inside this active region
final List eventsAtThisLoc = getVCsAtThisLocation(haplotypes, loc, activeAllelesToGenotype);
@@ -183,7 +185,7 @@ public class GenotypingEngine {
if( eventsAtThisLoc.size() != mergedVC.getAlternateAlleles().size() ) {
throw new ReviewedStingException("Record size mismatch! Something went wrong in the merging of alleles.");
}
- final Map mergeMap = new LinkedHashMap();
+ final Map mergeMap = new LinkedHashMap<>();
mergeMap.put(null, mergedVC.getReference()); // the reference event (null) --> the reference allele
for(int iii = 0; iii < mergedVC.getAlternateAlleles().size(); iii++) {
mergeMap.put(eventsAtThisLoc.get(iii), mergedVC.getAlternateAllele(iii)); // BUGBUG: This is assuming that the order of alleles is the same as the priority list given to simpleMerge function
@@ -204,13 +206,12 @@ public class GenotypingEngine {
convertHaplotypeReadMapToAlleleReadMap( haplotypeReadMap, alleleMapper, 0.0 ) );
final Map stratifiedReadMap = filterToOnlyOverlappingReads( genomeLocParser, alleleReadMap_annotations, perSampleFilteredReadList, call );
- VariantContext annotatedCall = call;
- if( annotatedCall.getAlleles().size() != mergedVC.getAlleles().size() ) { // some alleles were removed so reverseTrimming might be necessary!
+ VariantContext annotatedCall = annotationEngine.annotateContextForActiveRegion(tracker, stratifiedReadMap, call);
+
+ if( call.getAlleles().size() != mergedVC.getAlleles().size() ) { // some alleles were removed so reverseTrimming might be necessary!
annotatedCall = GATKVariantContextUtils.reverseTrimAlleles(annotatedCall);
}
- annotatedCall = annotationEngine.annotateContext(stratifiedReadMap, annotatedCall);
-
// maintain the set of all called haplotypes
for ( final Allele calledAllele : call.getAlleles() )
calledHaplotypes.addAll(alleleMapper.get(calledAllele));
@@ -244,7 +245,7 @@ public class GenotypingEngine {
if ( in_GGA_mode ) startPosKeySet.clear();
- cleanUpSymbolicUnassembledEvents( haplotypes );
+ //cleanUpSymbolicUnassembledEvents( haplotypes ); // We don't make symbolic alleles so this isn't needed currently
if ( !in_GGA_mode ) {
// run the event merger if we're not in GGA mode
final boolean mergedAnything = crossHaplotypeEventMerger.merge(haplotypes, haplotypeReadMap, startPosKeySet, ref, refLoc);
@@ -267,7 +268,7 @@ public class GenotypingEngine {
* @return the list of the sources of vcs in the same order
*/
private List makePriorityList(final List vcs) {
- final List priorityList = new LinkedList();
+ final List priorityList = new LinkedList<>();
for ( final VariantContext vc : vcs ) priorityList.add(vc.getSource());
return priorityList;
}
@@ -276,7 +277,7 @@ public class GenotypingEngine {
final int loc,
final List activeAllelesToGenotype) {
// the overlapping events to merge into a common reference view
- final List eventsAtThisLoc = new ArrayList();
+ final List eventsAtThisLoc = new ArrayList<>();
if( activeAllelesToGenotype.isEmpty() ) {
for( final Haplotype h : haplotypes ) {
@@ -292,7 +293,7 @@ public class GenotypingEngine {
if( compVC.getStart() == loc ) {
int alleleCount = 0;
for( final Allele compAltAllele : compVC.getAlternateAlleles() ) {
- List alleleSet = new ArrayList(2);
+ List alleleSet = new ArrayList<>(2);
alleleSet.add(compVC.getReference());
alleleSet.add(compAltAllele);
final String vcSourceName = "Comp" + compCount + "Allele" + alleleCount;
@@ -348,7 +349,7 @@ public class GenotypingEngine {
final Map> perSampleFilteredReadList,
final VariantContext call ) {
- final Map returnMap = new LinkedHashMap();
+ final Map returnMap = new LinkedHashMap<>();
final GenomeLoc callLoc = parser.createGenomeLoc(call);
for( final Map.Entry sample : perSampleReadMap.entrySet() ) {
final PerReadAlleleLikelihoodMap likelihoodMap = new PerReadAlleleLikelihoodMap();
@@ -384,7 +385,7 @@ public class GenotypingEngine {
// TODO - split into input haplotypes and output haplotypes as not to share I/O arguments
@Requires("haplotypes != null")
protected static void cleanUpSymbolicUnassembledEvents( final List haplotypes ) {
- final List haplotypesToRemove = new ArrayList();
+ final List haplotypesToRemove = new ArrayList<>();
for( final Haplotype h : haplotypes ) {
for( final VariantContext vc : h.getEventMap().getVariantContexts() ) {
if( vc.isSymbolic() ) {
@@ -407,7 +408,7 @@ public class GenotypingEngine {
final Map> alleleMapper,
final double downsamplingFraction ) {
- final Map alleleReadMap = new LinkedHashMap();
+ final Map alleleReadMap = new LinkedHashMap<>();
for( final Map.Entry haplotypeReadMapEntry : haplotypeReadMap.entrySet() ) { // for each sample
final PerReadAlleleLikelihoodMap perReadAlleleLikelihoodMap = new PerReadAlleleLikelihoodMap();
for( final Map.Entry> alleleMapperEntry : alleleMapper.entrySet() ) { // for each output allele
@@ -430,7 +431,7 @@ public class GenotypingEngine {
}
protected static Map> createAlleleMapper( final Map mergeMap, final Map> eventMap ) {
- final Map> alleleMapper = new LinkedHashMap>();
+ final Map> alleleMapper = new LinkedHashMap<>();
for( final Map.Entry