From e2d41f02820f0987764b5bfbe6e3d331b679c43a Mon Sep 17 00:00:00 2001
From: Mauricio Carneiro
Date: Tue, 5 Mar 2013 17:25:52 -0500
Subject: [PATCH 001/211] Turning @Output required to false
By default all output is assigned to stdout if a -o is not provided. Technically this makes @Output a not required parameter, and the documentation is misleading because it's reading from the annotation.
GSA-820 #resolve
---
.../java/src/org/broadinstitute/sting/commandline/Output.java | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/public/java/src/org/broadinstitute/sting/commandline/Output.java b/public/java/src/org/broadinstitute/sting/commandline/Output.java
index 6c2b143c4..47a47602a 100644
--- a/public/java/src/org/broadinstitute/sting/commandline/Output.java
+++ b/public/java/src/org/broadinstitute/sting/commandline/Output.java
@@ -64,7 +64,7 @@ public @interface Output {
* fail if the type can't be populated.
* @return True if the argument is required. False otherwise.
*/
- boolean required() default true;
+ boolean required() default false;
/**
* Should this command-line argument be exclusive of others. Should be
From 78721ee09b14730c9cd054daea4d8563592330b3 Mon Sep 17 00:00:00 2001
From: Eric Banks
Date: Mon, 4 Mar 2013 14:13:42 -0500
Subject: [PATCH 002/211] Added new walker to split MNPs into their allelic
primitives (SNPs).
* Can be extended to complex alleles at some point.
* Currently only works for bi-allelics (documented).
* Added unit and integration tests.
---
...ntsToAllelicPrimitivesIntegrationTest.java | 67 +++++++++
.../VariantsToAllelicPrimitives.java | 140 ++++++++++++++++++
.../variant/GATKVariantContextUtils.java | 83 +++++++++--
.../GATKVariantContextUtilsUnitTest.java | 117 +++++++++++++++
4 files changed, 394 insertions(+), 13 deletions(-)
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitivesIntegrationTest.java
create mode 100644 public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitives.java
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitivesIntegrationTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitivesIntegrationTest.java
new file mode 100644
index 000000000..7b1b9b7d2
--- /dev/null
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitivesIntegrationTest.java
@@ -0,0 +1,67 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.variantutils;
+
+import org.broadinstitute.sting.WalkerTest;
+import org.testng.annotations.Test;
+
+import java.util.Arrays;
+
+/**
+ * Tests VariantsToAllelicPrimitives
+ */
+public class VariantsToAllelicPrimitivesIntegrationTest extends WalkerTest {
+
+ @Test
+ public void testMNPsToSNPs() {
+ WalkerTestSpec spec = new WalkerTestSpec(
+ "-T VariantsToAllelicPrimitives -o %s -R " + b37KGReference + " -V " + privateTestDir + "vcfWithMNPs.vcf --no_cmdline_in_header",
+ 1,
+ Arrays.asList("c5333d2e352312bdb7c5182ca3009594"));
+ executeTest("test MNPs To SNPs", spec);
+ }
+}
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitives.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitives.java
new file mode 100644
index 000000000..319183f28
--- /dev/null
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToAllelicPrimitives.java
@@ -0,0 +1,140 @@
+/*
+* Copyright (c) 2012 The Broad Institute
+*
+* Permission is hereby granted, free of charge, to any person
+* obtaining a copy of this software and associated documentation
+* files (the "Software"), to deal in the Software without
+* restriction, including without limitation the rights to use,
+* copy, modify, merge, publish, distribute, sublicense, and/or sell
+* copies of the Software, and to permit persons to whom the
+* Software is furnished to do so, subject to the following
+* conditions:
+*
+* The above copyright notice and this permission notice shall be
+* included in all copies or substantial portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.variantutils;
+
+import com.google.java.contract.Requires;
+import org.broadinstitute.sting.commandline.ArgumentCollection;
+import org.broadinstitute.sting.commandline.Output;
+import org.broadinstitute.sting.gatk.CommandLineGATK;
+import org.broadinstitute.sting.gatk.arguments.StandardVariantContextInputArgumentCollection;
+import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
+import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
+import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
+import org.broadinstitute.sting.gatk.walkers.RodWalker;
+import org.broadinstitute.sting.utils.SampleUtils;
+import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
+import org.broadinstitute.sting.utils.help.HelpConstants;
+import org.broadinstitute.sting.utils.variant.GATKVCFUtils;
+import org.broadinstitute.sting.utils.variant.GATKVariantContextUtils;
+import org.broadinstitute.variant.variantcontext.*;
+import org.broadinstitute.variant.variantcontext.writer.VariantContextWriter;
+import org.broadinstitute.variant.variantcontext.writer.VariantContextWriterFactory;
+import org.broadinstitute.variant.vcf.VCFHeader;
+import org.broadinstitute.variant.vcf.VCFHeaderLine;
+
+import java.util.*;
+
+/**
+ * Takes alleles from a variants file and breaks them up (if possible) into more basic/primitive alleles.
+ *
+ *
+ * For now this tool modifies only multi-nucleotide polymorphisms (MNPs) and leaves SNPs, indels, and complex substitutions as is,
+ * although one day it may be extended to handle the complex substitution case.
+ *
+ * This tool will take an MNP (e.g. ACCCA -> TCCCG) and break it up into separate records for each component part (A-T and A->G).
+ *
+ * Note that this tool modifies only bi-allelic variants.
+ *
+ *
Input
+ *
+ * A variant set with any type of alleles.
+ *
+ *
+ *
Output
+ *
+ * A VCF with alleles broken into primitive types.
+ *
+ *
+ */
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP, extraDocs = {CommandLineGATK.class} )
+public class VariantsToAllelicPrimitives extends RodWalker {
+
+ @ArgumentCollection
+ protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
+
+ @Output(doc="File to which variants should be written",required=true)
+ protected VariantContextWriter baseWriter = null;
+
+ private VariantContextWriter vcfWriter;
+
+ public void initialize() {
+ final String trackName = variantCollection.variants.getName();
+ final Set samples = SampleUtils.getSampleListWithVCFHeader(getToolkit(), Arrays.asList(trackName));
+
+ final Map vcfHeaders = GATKVCFUtils.getVCFHeadersFromRods(getToolkit(), Arrays.asList(trackName));
+ final Set headerLines = vcfHeaders.get(trackName).getMetaDataInSortedOrder();
+
+ baseWriter.writeHeader(new VCFHeader(headerLines, samples));
+
+ vcfWriter = VariantContextWriterFactory.sortOnTheFly(baseWriter, 200);
+ }
+
+ public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
+ if ( tracker == null )
+ return 0;
+
+ final Collection VCs = tracker.getValues(variantCollection.variants, context.getLocation());
+
+ int changedSites = 0;
+ for ( final VariantContext vc : VCs )
+ changedSites += writeVariants(vc);
+
+ return changedSites;
+ }
+
+ public Integer reduceInit() { return 0; }
+
+ public Integer reduce(Integer value, Integer sum) {
+ return sum + value;
+ }
+
+ public void onTraversalDone(Integer result) {
+ System.out.println(result + " MNPs were broken up into primitives");
+ vcfWriter.close();
+ }
+
+ @Requires("vc != null")
+ private int writeVariants(final VariantContext vc) {
+ // for now, we modify only bi-allelic MNPs; update docs above if this changes
+ if ( vc.isBiallelic() && vc.isMNP() ) {
+ for ( final VariantContext splitVC : GATKVariantContextUtils.splitIntoPrimitiveAlleles(vc) )
+ vcfWriter.add(splitVC);
+ return 1;
+ } else {
+ vcfWriter.add(vc);
+ return 0;
+ }
+ }
+}
diff --git a/public/java/src/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java b/public/java/src/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java
index 37bd798cf..398b32669 100644
--- a/public/java/src/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/variant/GATKVariantContextUtils.java
@@ -989,7 +989,6 @@ public class GATKVariantContextUtils {
return inputVC;
final List alleles = new LinkedList();
- final GenotypesContext genotypes = GenotypesContext.create();
final Map originalToTrimmedAlleleMap = new HashMap();
for (final Allele a : inputVC.getAlleles()) {
@@ -1006,17 +1005,8 @@ public class GATKVariantContextUtils {
}
// now we can recreate new genotypes with trimmed alleles
- for ( final Genotype genotype : inputVC.getGenotypes() ) {
- final List originalAlleles = genotype.getAlleles();
- final List trimmedAlleles = new ArrayList();
- for ( final Allele a : originalAlleles ) {
- if ( a.isCalled() )
- trimmedAlleles.add(originalToTrimmedAlleleMap.get(a));
- else
- trimmedAlleles.add(Allele.NO_CALL);
- }
- genotypes.add(new GenotypeBuilder(genotype).alleles(trimmedAlleles).make());
- }
+ final AlleleMapper alleleMapper = new AlleleMapper(originalToTrimmedAlleleMap);
+ final GenotypesContext genotypes = updateGenotypesWithMappedAlleles(inputVC.getGenotypes(), alleleMapper);
final int start = inputVC.getStart() + (fwdTrimEnd + 1);
final VariantContextBuilder builder = new VariantContextBuilder(inputVC);
@@ -1027,6 +1017,18 @@ public class GATKVariantContextUtils {
return builder.make();
}
+ @Requires("originalGenotypes != null && alleleMapper != null")
+ protected static GenotypesContext updateGenotypesWithMappedAlleles(final GenotypesContext originalGenotypes, final AlleleMapper alleleMapper) {
+ final GenotypesContext updatedGenotypes = GenotypesContext.create();
+
+ for ( final Genotype genotype : originalGenotypes ) {
+ final List updatedAlleles = alleleMapper.remap(genotype.getAlleles());
+ updatedGenotypes.add(new GenotypeBuilder(genotype).alleles(updatedAlleles).make());
+ }
+
+ return updatedGenotypes;
+ }
+
public static int computeReverseClipping(final List unclippedAlleles, final byte[] ref) {
int clipping = 0;
boolean stillClipping = true;
@@ -1263,7 +1265,7 @@ public class GATKVariantContextUtils {
}
- private static class AlleleMapper {
+ protected static class AlleleMapper {
private VariantContext vc = null;
private Map map = null;
public AlleleMapper(VariantContext vc) { this.vc = vc; }
@@ -1323,4 +1325,59 @@ public class GATKVariantContextUtils {
}
return new VariantContextBuilder(name, contig, start, start+length-1, alleles).make();
}
+
+ /**
+ * Splits the alleles for the provided variant context into its primitive parts.
+ * Requires that the input VC be bi-allelic, so calling methods should first call splitVariantContextToBiallelics() if needed.
+ * Currently works only for MNPs.
+ *
+ * @param vc the non-null VC to split
+ * @return a non-empty list of VCs split into primitive parts or the original VC otherwise
+ */
+ public static List splitIntoPrimitiveAlleles(final VariantContext vc) {
+ if ( vc == null )
+ throw new IllegalArgumentException("Trying to break a null Variant Context into primitive parts");
+
+ if ( !vc.isBiallelic() )
+ throw new IllegalArgumentException("Trying to break a multi-allelic Variant Context into primitive parts");
+
+ // currently only works for MNPs
+ if ( !vc.isMNP() )
+ return Arrays.asList(vc);
+
+ final byte[] ref = vc.getReference().getBases();
+ final byte[] alt = vc.getAlternateAllele(0).getBases();
+
+ if ( ref.length != alt.length )
+ throw new IllegalStateException("ref and alt alleles for MNP have different lengths");
+
+ final List result = new ArrayList(ref.length);
+
+ for ( int i = 0; i < ref.length; i++ ) {
+
+ // if the ref and alt bases are different at a given position, create a new SNP record (otherwise do nothing)
+ if ( ref[i] != alt[i] ) {
+
+ // create the ref and alt SNP alleles
+ final Allele newRefAllele = Allele.create(ref[i], true);
+ final Allele newAltAllele = Allele.create(alt[i], false);
+
+ // create a new VariantContext with the new SNP alleles
+ final VariantContextBuilder newVC = new VariantContextBuilder(vc).start(vc.getStart() + i).stop(vc.getStart() + i).alleles(Arrays.asList(newRefAllele, newAltAllele));
+
+ // create new genotypes with updated alleles
+ final Map alleleMap = new HashMap();
+ alleleMap.put(vc.getReference(), newRefAllele);
+ alleleMap.put(vc.getAlternateAllele(0), newAltAllele);
+ final GenotypesContext newGenotypes = updateGenotypesWithMappedAlleles(vc.getGenotypes(), new AlleleMapper(alleleMap));
+
+ result.add(newVC.genotypes(newGenotypes).make());
+ }
+ }
+
+ if ( result.isEmpty() )
+ result.add(vc);
+
+ return result;
+ }
}
diff --git a/public/java/test/org/broadinstitute/sting/utils/variant/GATKVariantContextUtilsUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/variant/GATKVariantContextUtilsUnitTest.java
index 2a15d709a..ff42abb23 100644
--- a/public/java/test/org/broadinstitute/sting/utils/variant/GATKVariantContextUtilsUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/variant/GATKVariantContextUtilsUnitTest.java
@@ -26,6 +26,8 @@
package org.broadinstitute.sting.utils.variant;
import org.broadinstitute.sting.BaseTest;
+import org.broadinstitute.sting.gatk.GenomeAnalysisEngine;
+import org.broadinstitute.sting.utils.BaseUtils;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.collections.Pair;
import org.broadinstitute.variant.variantcontext.*;
@@ -976,4 +978,119 @@ public class GATKVariantContextUtilsUnitTest extends BaseTest {
Assert.assertEquals(trimmed.getBaseString(), expected.get(i));
}
}
+
+ // --------------------------------------------------------------------------------
+ //
+ // test primitive allele splitting
+ //
+ // --------------------------------------------------------------------------------
+
+ @DataProvider(name = "PrimitiveAlleleSplittingData")
+ public Object[][] makePrimitiveAlleleSplittingData() {
+ List
*
- *
Examples
+ *
Examples
*
* java -Xmx4g -jar GenomeAnalysisTK.jar \
* -T BaseRecalibrator \
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java
index 5ab296a5f..ee2edee5a 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java
@@ -146,38 +146,38 @@ public class RecalibrationArgumentCollection {
public RecalUtils.SOLID_NOCALL_STRATEGY SOLID_NOCALL_STRATEGY = RecalUtils.SOLID_NOCALL_STRATEGY.THROW_EXCEPTION;
/**
- * The context covariate will use a context of this size to calculate it's covariate value for base mismatches
+ * The context covariate will use a context of this size to calculate its covariate value for base mismatches. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
*/
- @Argument(fullName = "mismatches_context_size", shortName = "mcs", doc = "size of the k-mer context to be used for base mismatches", required = false)
+ @Argument(fullName = "mismatches_context_size", shortName = "mcs", doc = "Size of the k-mer context to be used for base mismatches", required = false)
public int MISMATCHES_CONTEXT_SIZE = 2;
/**
- * The context covariate will use a context of this size to calculate it's covariate value for base insertions and deletions
+ * The context covariate will use a context of this size to calculate its covariate value for base insertions and deletions. Must be between 1 and 13 (inclusive). Note that higher values will increase runtime and required java heap size.
*/
- @Argument(fullName = "indels_context_size", shortName = "ics", doc = "size of the k-mer context to be used for base insertions and deletions", required = false)
+ @Argument(fullName = "indels_context_size", shortName = "ics", doc = "Size of the k-mer context to be used for base insertions and deletions", required = false)
public int INDELS_CONTEXT_SIZE = 3;
/**
* The cycle covariate will generate an error if it encounters a cycle greater than this value.
* This argument is ignored if the Cycle covariate is not used.
*/
- @Argument(fullName = "maximum_cycle_value", shortName = "maxCycle", doc = "the maximum cycle value permitted for the Cycle covariate", required = false)
+ @Argument(fullName = "maximum_cycle_value", shortName = "maxCycle", doc = "The maximum cycle value permitted for the Cycle covariate", required = false)
public int MAXIMUM_CYCLE_VALUE = 500;
/**
- * A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off (default is off)
+ * A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is off]
*/
@Argument(fullName = "mismatches_default_quality", shortName = "mdq", doc = "default quality for the base mismatches covariate", required = false)
public byte MISMATCHES_DEFAULT_QUALITY = -1;
/**
- * A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. (default is on)
+ * A default base qualities to use as a prior (reported quality) in the insertion covariate model. This parameter is used for all reads without insertion quality scores for each base. [default is on]
*/
@Argument(fullName = "insertions_default_quality", shortName = "idq", doc = "default quality for the base insertions covariate", required = false)
public byte INSERTIONS_DEFAULT_QUALITY = 45;
/**
- * A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off (default is off)
+ * A default base qualities to use as a prior (reported quality) in the mismatch covariate model. This value will replace all base qualities in the read for this default value. Negative value turns it off. [default is on]
*/
@Argument(fullName = "deletions_default_quality", shortName = "ddq", doc = "default quality for the base deletions covariate", required = false)
public byte DELETIONS_DEFAULT_QUALITY = 45;
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/CompareBAM.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/CompareBAM.java
index a8a765ddc..36da92b4f 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/CompareBAM.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/CompareBAM.java
@@ -69,15 +69,15 @@ import java.util.Map;
*
* This is a test walker used for asserting that the ReduceReads procedure is not making blatant mistakes when compressing bam files.
*
- *
Input
+ *
Input
*
* Two BAM files (using -I) with different read group IDs
*
- *
Output
+ *
Output
*
* [Output description]
*
- *
Examples
+ *
Examples
*
* java
* -jar GenomeAnalysisTK.jar
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java
index e89158412..c2c154053 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java
@@ -86,17 +86,17 @@ import org.broadinstitute.sting.utils.sam.ReadUtils;
* shown to reduce a typical whole exome BAM file 100x. The higher the coverage, the bigger the
* savings in file size and performance of the downstream tools.
*
- *
* A modified VCF detailing each interval by sample
*
*
- *
Examples
+ *
Examples
*
* java
* -jar GenomeAnalysisTK.jar
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java
index b1a26b7a2..6b4d1f7a8 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java
@@ -63,6 +63,31 @@ import org.broadinstitute.sting.utils.help.HelpConstants;
import java.io.PrintStream;
+/**
+ * Outputs a list of intervals that are covered above a given threshold.
+ *
+ *
The list can be used as an interval list for other walkers. Note that if the -uncovered argument is given, the tool will instead output intervals that fail the coverage threshold.
The system is under active and continuous development. All outputs, the underlying likelihood model, arguments, and
* file formats are likely to change.
@@ -167,7 +167,7 @@ public class UnifiedGenotyper extends LocusWalker, Unif
* Records that are filtered in the comp track will be ignored.
* Note that 'dbSNP' has been special-cased (see the --dbsnp argument).
*/
- @Input(fullName="comp", shortName = "comp", doc="comparison VCF file", required=false)
+ @Input(fullName="comp", shortName = "comp", doc="Comparison VCF file", required=false)
public List> comps = Collections.emptyList();
public List> getCompRodBindings() { return comps; }
@@ -205,7 +205,8 @@ public class UnifiedGenotyper extends LocusWalker, Unif
protected List annotationsToExclude = new ArrayList();
/**
- * Which groups of annotations to add to the output VCF file. See the VariantAnnotator -list argument to view available groups.
+ * If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups.
+ * Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.
*/
@Argument(fullName="group", shortName="G", doc="One or more classes/groups of annotations to apply to variant calls", required=false)
protected String[] annotationClassesToUse = { "Standard" };
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
index 003b8197f..7948b93a9 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
@@ -96,17 +96,17 @@ import java.util.*;
/**
* Call SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. Haplotypes are evaluated using an affine gap penalty Pair HMM.
*
- *
Input
+ *
Input
*
* Input bam file(s) from which to make calls
*
*
- *
Output
+ *
Output
*
* VCF file with raw, unrecalibrated SNP and indel calls.
*
The system is under active and continuous development. All outputs, the underlying likelihood model, and command line arguments are likely to change often.
*
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java
index c7cc84b9c..4de9488e9 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java
@@ -84,17 +84,17 @@ import java.util.*;
* From that, it can resolve potential differences in variant calls that are inherently the same (or similar) variants.
* Records are annotated with the set and status attributes.
*
- *
Input
+ *
Input
*
* 2 variant files to resolve.
*
*
- *
Output
+ *
Output
*
* A single consensus VCF.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx1g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java
index c7d24f475..d3a13df29 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java
@@ -87,7 +87,7 @@ import java.io.IOException;
import java.util.*;
/**
- * Performs local realignment of reads based on misalignments due to the presence of indels.
+ * Performs local realignment of reads to correct misalignments due to the presence of indels.
*
*
* The local realignment tool is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases
@@ -100,39 +100,46 @@ import java.util.*;
* indel suitable for standard variant discovery approaches. Unlike most mappers, this walker uses the full alignment context to determine whether an
* appropriate alternate reference (i.e. indel) exists. Following local realignment, the GATK tool Unified Genotyper can be used to sensitively and
* specifically identify indels.
- *
+ *
* There are 2 steps to the realignment process:
*
Determining (small) suspicious intervals which are likely in need of realignment (see the RealignerTargetCreator tool)
*
Running the realigner over those intervals (IndelRealigner)
*
- *
- * An important note: the input bam(s), reference, and known indel file(s) should be the same ones used for the RealignerTargetCreator step.
*
- * Another important note: because reads produced from the 454 technology inherently contain false indels, the realigner will not currently work with them
- * (or with reads from similar technologies).
+ * For more details, see http://www.broadinstitute.org/gatk/guide/article?id=38
+ *
*
- *
Input
+ *
Input
*
* One or more aligned BAM files and optionally one or more lists of known indels.
*
*
- *
Output
+ *
Output
*
* A realigned version of your input BAM file(s).
*
*
- *
Examples
+ *
Example
*
* java -Xmx4g -jar GenomeAnalysisTK.jar \
- * -I input.bam \
- * -R ref.fasta \
* -T IndelRealigner \
+ * -R ref.fasta \
+ * -I input.bam \
* -targetIntervals intervalListFromRTC.intervals \
* -o realignedBam.bam \
* [-known /path/to/indels.vcf] \
* [-compress 0] (this argument recommended to speed up the process *if* this is only a temporary file; otherwise, use the default value)
*
*
+ *
Caveats
+ *
+ *
+ * An important note: the input bam(s), reference, and known indel file(s) should be the same ones used for the RealignerTargetCreator step.
+ *
+ * Another important note: because reads produced from the 454 technology inherently contain false indels, the realigner will not currently work with them
+ * (or with reads from similar technologies).
+ *
+ *
* @author ebanks
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_DATA, extraDocs = {CommandLineGATK.class} )
@@ -168,7 +175,7 @@ public class IndelRealigner extends ReadWalker {
/**
* The interval list output from the RealignerTargetCreator tool using the same bam(s), reference, and known indel file(s).
*/
- @Input(fullName="targetIntervals", shortName="targetIntervals", doc="intervals file output from RealignerTargetCreator", required=true)
+ @Input(fullName="targetIntervals", shortName="targetIntervals", doc="Intervals file output from RealignerTargetCreator", required=true)
protected IntervalBinding intervalsFile = null;
/**
@@ -203,7 +210,7 @@ public class IndelRealigner extends ReadWalker {
* push the mismatch column to another position). This parameter is just a heuristic and should be adjusted based on your particular data set.
*/
@Advanced
- @Argument(fullName="entropyThreshold", shortName="entropy", doc="percentage of mismatches at a locus to be considered having high entropy", required=false)
+ @Argument(fullName="entropyThreshold", shortName="entropy", doc="Percentage of mismatches at a locus to be considered having high entropy (0.0 < entropy <= 1.0)", required=false)
protected double MISMATCH_THRESHOLD = 0.15;
/**
@@ -225,21 +232,21 @@ public class IndelRealigner extends ReadWalker {
* For expert users only!
*/
@Advanced
- @Argument(fullName="maxPositionalMoveAllowed", shortName="maxPosMove", doc="maximum positional move in basepairs that a read can be adjusted during realignment", required=false)
+ @Argument(fullName="maxPositionalMoveAllowed", shortName="maxPosMove", doc="Maximum positional move in basepairs that a read can be adjusted during realignment", required=false)
protected int MAX_POS_MOVE_ALLOWED = 200;
/**
* For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.
*/
@Advanced
- @Argument(fullName="maxConsensuses", shortName="maxConsensuses", doc="max alternate consensuses to try (necessary to improve performance in deep coverage)", required=false)
+ @Argument(fullName="maxConsensuses", shortName="maxConsensuses", doc="Max alternate consensuses to try (necessary to improve performance in deep coverage)", required=false)
protected int MAX_CONSENSUSES = 30;
/**
* For expert users only! If you need to find the optimal solution regardless of running time, use a higher number.
*/
@Advanced
- @Argument(fullName="maxReadsForConsensuses", shortName="greedy", doc="max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)", required=false)
+ @Argument(fullName="maxReadsForConsensuses", shortName="greedy", doc="Max reads used for finding the alternate consensuses (necessary to improve performance in deep coverage)", required=false)
protected int MAX_READS_FOR_CONSENSUSES = 120;
/**
@@ -247,7 +254,7 @@ public class IndelRealigner extends ReadWalker {
* If you need to allow more reads (e.g. with very deep coverage) regardless of memory, use a higher number.
*/
@Advanced
- @Argument(fullName="maxReadsForRealignment", shortName="maxReads", doc="max reads allowed at an interval for realignment", required=false)
+ @Argument(fullName="maxReadsForRealignment", shortName="maxReads", doc="Max reads allowed at an interval for realignment", required=false)
protected int MAX_READS = 20000;
@Advanced
@@ -263,7 +270,7 @@ public class IndelRealigner extends ReadWalker {
*
* Note that some GATK arguments do NOT work in conjunction with nWayOut (e.g. --disable_bam_indexing).
*/
- @Argument(fullName="nWayOut", shortName="nWayOut", required=false, doc="Generate one output file for each input (-I) bam file")
+ @Argument(fullName="nWayOut", shortName="nWayOut", required=false, doc="Generate one output file for each input (-I) bam file (not compatible with -output)")
protected String N_WAY_OUT = null;
@Hidden
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/LeftAlignIndels.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/LeftAlignIndels.java
index ff21893f1..532d13690 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/LeftAlignIndels.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/LeftAlignIndels.java
@@ -68,17 +68,17 @@ import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
* placed at multiple positions and still represent the same haplotype. While a standard convention is to place an
* indel at the left-most position this doesn't always happen, so this tool can be used to left-align them.
*
- *
Input
+ *
Input
*
* A bam file to left-align.
*
*
- *
Output
+ *
Output
*
* A left-aligned bam.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx3g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java
index 1ee04e317..caeb1e8d7 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/RealignerTargetCreator.java
@@ -99,22 +99,22 @@ import java.util.TreeSet;
* Important note 3: because reads produced from the 454 technology inherently contain false indels, the realigner will not currently work with them
* (or with reads from similar technologies). This tool also ignores MQ0 reads and reads with consecutive indel operators in the CIGAR string.
*
- *
Input
+ *
Input
*
* One or more aligned BAM files and optionally one or more lists of known indels.
*
*
- *
Output
+ *
Output
*
* A list of target intervals to pass to the Indel Realigner.
*
@@ -143,7 +143,7 @@ public class RealignerTargetCreator extends RodWalker> known = Collections.emptyList();
/**
- * Any two SNP calls and/or high entropy positions are considered clustered when they occur no more than this many basepairs apart.
+ * Any two SNP calls and/or high entropy positions are considered clustered when they occur no more than this many basepairs apart. Must be > 1.
*/
@Argument(fullName="windowSize", shortName="window", doc="window size for calculating entropy or SNP clusters", required=false)
protected int windowSize = 10;
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
index 54a324411..a4c1caf86 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/PhaseByTransmission.java
@@ -90,7 +90,7 @@ import java.util.*;
*
In trios: If two individuals are missing, the remaining individual is phased if it is homozygous. No phasing probability is emitted.
*
*
- *
Input
+ *
Input
*
*
*
A VCF variant set containing trio(s) and/or parent/child pair(s).
@@ -108,12 +108,12 @@ import java.util.*;
*
*
*
- *
Output
+ *
Output
*
* An VCF with genotypes recalibrated as most likely under the familial constraint and phased by descent where non ambiguous..
*
* A BAM file to make calls on and a VCF file to use as truth validation dataset.
*
* You also have the option to invert the roles of the files using the command line options listed below.
*
*
- *
Output
+ *
Output
*
* GenotypeAndValidate has two outputs. The truth table and the optional VCF file. The truth table is a
* 2x2 table correlating what was called in the dataset with the truth of the call (whether it's a true
@@ -176,7 +176,7 @@ import static org.broadinstitute.sting.utils.IndelUtils.isInsideExtendedIndel;
*
*
*
- *
Examples
+ *
Examples
*
*
* Genotypes BAM file from new technology using the VCF as a truth dataset:
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/validation/validationsiteselector/ValidationSiteSelector.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/validation/validationsiteselector/ValidationSiteSelector.java
index 5c216928b..d587c305e 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/validation/validationsiteselector/ValidationSiteSelector.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/validation/validationsiteselector/ValidationSiteSelector.java
@@ -85,17 +85,17 @@ import java.util.*;
*
* User can additionally restrict output to a particular type of variant (SNP, Indel, etc.)
*
- *
Input
+ *
Input
*
* One or more variant sets to choose from.
*
*
- *
Output
+ *
Output
*
* A sites-only VCF with the desired number of randomly selected sites.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java
index f2120213a..22425e62e 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java
@@ -81,7 +81,7 @@ import java.util.*;
* to the desired level but also has the information necessary to pull out more variants for a higher sensitivity but a
* slightly lower quality level.
*
- *
Input
+ *
Input
*
* The input raw variants to be recalibrated.
*
@@ -89,11 +89,11 @@ import java.util.*;
*
* The tranches file that was generated by the VariantRecalibrator walker.
*
- *
Output
+ *
Output
*
* A recalibrated VCF file in which each variant is annotated with its VQSLOD and filtered if the score is below the desired quality level.
*
- *
* This walker is the first pass in a two-stage processing step. This walker is designed to be used in conjunction with ApplyRecalibration walker.
+ *
*
*
* The purpose of the variant recalibrator is to assign a well-calibrated probability to each variant call in a call set.
@@ -91,24 +92,26 @@ import java.util.*;
* error model can then be applied to both known and novel variation discovered in the call set of interest to evaluate the
* probability that each call is real. The score that gets added to the INFO field of each variant is called the VQSLOD. It is
* the log odds ratio of being a true variant versus being false under the trained Gaussian mixture model.
+ *
*
*
* NOTE: In order to create the model reporting plots Rscript needs to be in your environment PATH (this is the scripting version of R, not the interactive version).
* See http://www.r-project.org for more info on how to download and install R.
+ *
*
- *
Input
+ *
Input
*
* The input raw variants to be recalibrated.
*
* Known, truth, and training sets to be used by the algorithm. How these various sets are used is described below.
*
- *
Output
+ *
Output
*
* A recalibration table file in VCF format that is used by the ApplyRecalibration walker.
*
* A tranches file which shows various metrics of the recalibration callset as a function of making several slices through the data.
*
- *
Example
+ *
Example
*
* java -Xmx4g -jar GenomeAnalysisTK.jar \
* -T VariantRecalibrator \
@@ -152,7 +155,7 @@ public class VariantRecalibrator extends RodWalker> resource = Collections.emptyList();
/////////////////////////////
@@ -170,7 +173,7 @@ public class VariantRecalibrator extends RodWalkerInput
+ *
* java
* -jar GenomeAnalysisTK.jar
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjectsIntegrationTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjectsIntegrationTest.java
index c93f68ef8..5a308928d 100644
--- a/protected/java/test/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjectsIntegrationTest.java
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjectsIntegrationTest.java
@@ -74,10 +74,10 @@ public class DiffObjectsIntegrationTest extends WalkerTest {
@DataProvider(name = "data")
public Object[][] createData() {
- new TestParams(privateTestDir + "diffTestMaster.vcf", privateTestDir + "diffTestTest.vcf", true, "aea3d5df32a2acd400da48d06b4dbc60");
- new TestParams(publicTestDir + "exampleBAM.bam", publicTestDir + "exampleBAM.simple.bam", true, "3f46f5a964f7c34015d972256fe49a35");
- new TestParams(privateTestDir + "diffTestMaster.vcf", privateTestDir + "diffTestTest.vcf", false, "e71e23e7ebfbe768e59527bc62f8918d");
- new TestParams(publicTestDir + "exampleBAM.bam", publicTestDir + "exampleBAM.simple.bam", false, "47bf16c27c9e2c657a7e1d13f20880c9");
+ new TestParams(privateTestDir + "diffTestMaster.vcf", privateTestDir + "diffTestTest.vcf", true, "71869ddf9665773a842a9def4cc5f3c8");
+ new TestParams(publicTestDir + "exampleBAM.bam", publicTestDir + "exampleBAM.simple.bam", true, "cec7c644c84ef9c96aacaed604d9ec9b");
+ new TestParams(privateTestDir + "diffTestMaster.vcf", privateTestDir + "diffTestTest.vcf", false, "47546e03344103020e49d8037a7e0727");
+ new TestParams(publicTestDir + "exampleBAM.bam", publicTestDir + "exampleBAM.simple.bam", false, "d27b37f7a366c8dacca5cd2590d3c6ce");
return TestParams.getTests(TestParams.class);
}
diff --git a/public/R/src/org/broadinstitute/sting/utils/R/gsalib/man/gsalib-package.Rd b/public/R/src/org/broadinstitute/sting/utils/R/gsalib/man/gsalib-package.Rd
index dc7a08287..4a49cf932 100644
--- a/public/R/src/org/broadinstitute/sting/utils/R/gsalib/man/gsalib-package.Rd
+++ b/public/R/src/org/broadinstitute/sting/utils/R/gsalib/man/gsalib-package.Rd
@@ -19,9 +19,11 @@ Medical and Population Genetics Program
Maintainer: Kiran Garimella
}
\references{
-GSA wiki page: http://www.broadinstitute.org/gatk
+GATK website: http://www.broadinstitute.org/gatk
-GATK help forum: http://www.broadinstitute.org/gatk
+GATK documentation guide: http://www.broadinstitute.org/gatk/guide
+
+GATK help forum: http://gatkforums.broadinstitute.org
}
\examples{
## get script arguments in interactive and non-interactive mode
diff --git a/public/doc/README b/public/doc/README
index ec5fa8500..e70ced0df 100644
--- a/public/doc/README
+++ b/public/doc/README
@@ -59,7 +59,7 @@ index (.fasta.fai).
Instructions for preparing input files are available here:
-http://www.broadinstitute.org/gsa/wiki/index.php/Preparing_input_files
+http://www.broadinstitute.org/gatk/guide/article?id=1204
The bundled 'resources' directory contains an example BAM and fasta.
@@ -69,7 +69,7 @@ The GATK is distributed with a few standard analyses, including PrintReads,
Pileup, and DepthOfCoverage. More information on the included walkers is
available here:
-http://www.broadinstitute.org/gsa/wiki/index.php/Built-in_walkers
+http://www.broadinstitute.org/gatk/gatkdocs
To print the reads of the included sample data, untar the package into
the GenomeAnalysisTK directory and run the following command:
@@ -81,6 +81,6 @@ java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar \
Support
-------
-Documentation for the GATK is available at http://www.broadinstitute.org/gsa/wiki.
+Documentation for the GATK is available at http://www.broadinstitute.org/gatk/guide.
For help using the GATK, developing analyses with the GATK, bug reports,
-or feature requests, please email gsadevelopers@broadinstitute.org.
+or feature requests, please visit our support forum at http://gatkforums.broadinstitute.org/
diff --git a/public/java/src/org/broadinstitute/sting/alignment/CheckAlignment.java b/public/java/src/org/broadinstitute/sting/alignment/CheckAlignment.java
index 93b4d5e6f..d313f35ce 100644
--- a/public/java/src/org/broadinstitute/sting/alignment/CheckAlignment.java
+++ b/public/java/src/org/broadinstitute/sting/alignment/CheckAlignment.java
@@ -42,9 +42,14 @@ import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import java.util.Iterator;
/**
- * Validates consistency of the aligner interface by taking reads already aligned by BWA in a BAM file, stripping them
+ * Validates consistency of the aligner interface
+ *
+ *
Validates consistency of the aligner interface by taking reads already aligned by BWA in a BAM file, stripping them
* of their alignment data, realigning them, and making sure one of the best resulting realignments matches the original
- * alignment from the input file.
+ * alignment from the input file.
+ *
+ *
Caveat
+ *
This tool requires that BWA be available on the java path.
*
* @author mhanna
* @version 0.1
diff --git a/public/java/src/org/broadinstitute/sting/commandline/CommandLineProgram.java b/public/java/src/org/broadinstitute/sting/commandline/CommandLineProgram.java
index 08aa5f8b3..cf11bb61c 100644
--- a/public/java/src/org/broadinstitute/sting/commandline/CommandLineProgram.java
+++ b/public/java/src/org/broadinstitute/sting/commandline/CommandLineProgram.java
@@ -370,7 +370,7 @@ public abstract class CommandLineProgram {
errorPrintf("------------------------------------------------------------------------------------------%n");
errorPrintf("A GATK RUNTIME ERROR has occurred (version %s):%n", CommandLineGATK.getVersionNumber());
errorPrintf("%n");
- errorPrintf("Please visit the wiki to see if this is a known problem%n");
+ errorPrintf("Please check the documentation guide to see if this is a known problem%n");
errorPrintf("If not, please post the error, with stack trace, to the GATK forum%n");
printDocumentationReference();
if ( msg == null ) // some exceptions don't have detailed messages
diff --git a/public/java/src/org/broadinstitute/sting/gatk/arguments/GATKArgumentCollection.java b/public/java/src/org/broadinstitute/sting/gatk/arguments/GATKArgumentCollection.java
index a3e19b944..a9016708b 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/arguments/GATKArgumentCollection.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/arguments/GATKArgumentCollection.java
@@ -206,7 +206,7 @@ public class GATKArgumentCollection {
* Enables on-the-fly recalibrate of base qualities. The covariates tables are produced by the BaseQualityScoreRecalibrator tool.
* Please be aware that one should only run recalibration with the covariates file created on the same input bam(s).
*/
- @Input(fullName="BQSR", shortName="BQSR", required=false, doc="The input covariates table file which enables on-the-fly base quality score recalibration")
+ @Input(fullName="BQSR", shortName="BQSR", required=false, doc="The input covariates table file which enables on-the-fly base quality score recalibration (intended for use with BaseRecalibrator and PrintReads)")
public File BQSR_RECAL_FILE = null;
/**
diff --git a/public/java/src/org/broadinstitute/sting/gatk/examples/GATKDocsExample.java b/public/java/src/org/broadinstitute/sting/gatk/examples/GATKDocsExample.java
index 362cb202e..fcae3cc68 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/examples/GATKDocsExample.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/examples/GATKDocsExample.java
@@ -41,17 +41,17 @@ import org.broadinstitute.sting.gatk.walkers.RodWalker;
* [Functionality of this walker]
*
*
- *
* BAM file(s) with one read mapping quality selectively reassigned as desired
*
*
- *
Examples
+ *
Examples
*
* java
* -jar GenomeAnalysisTK.jar
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/AlleleBalance.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/AlleleBalance.java
index 73c31ef66..6e7bc9805 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/AlleleBalance.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/AlleleBalance.java
@@ -46,7 +46,7 @@ import java.util.Map;
/**
- * The allele balance (fraction of ref bases over ref + alt bases) across all bialleleic het-called samples
+ * The allele balance (fraction of ref bases over ref + alt bases) across all biallelic het-called samples
*/
public class AlleleBalance extends InfoFieldAnnotation {
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
index 826dc9f22..fa3ab885d 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
@@ -55,17 +55,17 @@ import java.util.*;
* VariantAnnotator is a GATK tool for annotating variant calls based on their context.
* The tool is modular; new annotations can be written easily without modifying VariantAnnotator itself.
*
- *
Input
+ *
Input
*
* A variant set to annotate and optionally one or more BAM files.
*
*
- *
Output
+ *
Output
*
* An annotated VCF.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
@@ -142,7 +142,8 @@ public class VariantAnnotator extends RodWalker implements Ann
protected List annotationsToExclude = new ArrayList();
/**
- * See the -list argument to view available groups.
+ * If specified, all available annotations in the group will be applied. See the VariantAnnotator -list argument to view available groups.
+ * Keep in mind that RODRequiringAnnotations are not intended to be used as a group, because they require specific ROD inputs.
*/
@Argument(fullName="group", shortName="G", doc="One or more classes/groups of annotations to apply to variant calls", required=false)
protected List annotationGroupsToUse = new ArrayList();
@@ -166,13 +167,13 @@ public class VariantAnnotator extends RodWalker implements Ann
/**
* Note that the --list argument requires a fully resolved and correct command-line to work.
*/
- @Argument(fullName="list", shortName="ls", doc="List the available annotations and exit")
+ @Argument(fullName="list", shortName="ls", doc="List the available annotations and exit", required=false)
protected Boolean LIST = false;
/**
* By default, the dbSNP ID is added only when the ID field in the variant VCF is empty.
*/
- @Argument(fullName="alwaysAppendDbsnpId", shortName="alwaysAppendDbsnpId", doc="In conjunction with the dbSNP binding, append the dbSNP ID even when the variant VCF already has the ID field populated")
+ @Argument(fullName="alwaysAppendDbsnpId", shortName="alwaysAppendDbsnpId", doc="In conjunction with the dbSNP binding, append the dbSNP ID even when the variant VCF already has the ID field populated", required=false)
protected Boolean ALWAYS_APPEND_DBSNP_ID = false;
public boolean alwaysAppendDbsnpId() { return ALWAYS_APPEND_DBSNP_ID; }
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/BeagleOutputToVCF.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/BeagleOutputToVCF.java
index 2e85fe8f9..4b96dbffb 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/BeagleOutputToVCF.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/BeagleOutputToVCF.java
@@ -61,7 +61,7 @@ import static java.lang.Math.log10;
* Note that this walker requires all input files produced by Beagle.
*
*
- *
Example
+ *
Example
*
* java -Xmx4000m -jar dist/GenomeAnalysisTK.jar \
* -R reffile.fasta -T BeagleOutputToVCF \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/ProduceBeagleInput.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/ProduceBeagleInput.java
index 937c3abc0..618fda0df 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/ProduceBeagleInput.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/beagle/ProduceBeagleInput.java
@@ -57,7 +57,7 @@ import java.util.*;
* Converts the input VCF into a format accepted by the Beagle imputation/analysis program.
*
*
- *
Input
+ *
Input
*
* A VCF with variants to convert to Beagle format
*
@@ -70,7 +70,7 @@ import java.util.*;
* Optional: A file with a list of markers
*
*
- *
-o: a OutputFormatted (recommended BED) file with the callable status covering each base
@@ -83,7 +83,7 @@ import java.io.PrintStream;
*
*
*
- *
Examples
+ *
Examples
*
* -T CallableLociWalker \
* -I my.bam \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java
index 3bd114aa1..61574d947 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java
@@ -66,7 +66,7 @@ import java.util.*;
* and/or percentage of bases covered to or beyond a threshold.
* Additionally, reads and bases can be filtered by mapping or base quality score.
*
- *
Input
+ *
Input
*
* One or more bam files (with proper headers) to be analyzed for coverage statistics
*
@@ -75,7 +75,7 @@ import java.util.*;
*
* (for information about creating the REFSEQ Rod, please consult the RefSeqCodec documentation)
*
- *
Output
+ *
Output
*
* Tables pertaining to different coverage summaries. Suffix on the table files declares the contents:
*
@@ -98,7 +98,7 @@ import java.util.*;
* - _cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/GCContentByInterval.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/GCContentByInterval.java
index 9a6ef61d8..2975df4a5 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/GCContentByInterval.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/GCContentByInterval.java
@@ -44,21 +44,21 @@ import java.util.List;
* Walks along reference and calculates the GC content for each interval.
*
*
- *
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java
index a5a8edb0c..169c2708b 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java
@@ -50,17 +50,17 @@ import java.util.Collection;
* CoveredByNSamplesSites is a GATK tool for filter out sites based on their coverage.
* The sites that pass the filter are printed out to an intervals file.
*
- *
Input
+ *
Input
*
* A variant file and optionally min coverage and sample percentage values.
*
*
- *
Output
+ *
Output
*
* An intervals file.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ErrorRatePerCycle.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ErrorRatePerCycle.java
index 76f5478a4..86676ca54 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ErrorRatePerCycle.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ErrorRatePerCycle.java
@@ -49,12 +49,12 @@ import java.io.PrintStream;
* Emits a GATKReport containing readgroup, cycle, mismatches, counts, qual, and error rate for each read
* group in the input BAMs FOR ONLY THE FIRST OF PAIR READS.
*
- *
* java
* -jar GenomeAnalysisTK.jar
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ReadGroupProperties.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ReadGroupProperties.java
index de7ac3e41..0af1dbed5 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ReadGroupProperties.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/ReadGroupProperties.java
@@ -53,12 +53,12 @@ import java.util.Map;
* the median statistics are well determined. It is safe to run it WG and it'll finish in an appropriate
* timeframe.
*
- *
Input
+ *
Input
*
* Any number of BAM files
*
*
- *
Output
+ *
Output
*
* GATKReport containing read group, sample, library, platform, center, median insert size and median read length.
*
@@ -86,7 +86,7 @@ import java.util.Map;
*
* A human/R readable table of tab separated values with one column per sample and one row per read.
*
*
- *
Examples
+ *
Examples
*
* java
* -jar GenomeAnalysisTK.jar
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffEngine.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffEngine.java
index 7ac59790c..c909eb2d5 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffEngine.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffEngine.java
@@ -83,7 +83,7 @@ public class DiffEngine {
DiffElement masterElt = master.getElement(name);
DiffElement testElt = test.getElement(name);
if ( masterElt == null && testElt == null ) {
- throw new ReviewedStingException("BUG: unexceptedly got two null elements for field: " + name);
+ throw new ReviewedStingException("BUG: unexpectedly got two null elements for field: " + name);
} else if ( masterElt == null || testElt == null ) { // if either is null, we are missing a value
// todo -- should one of these be a special MISSING item?
diffs.add(new Difference(masterElt, testElt));
@@ -283,8 +283,7 @@ public class DiffEngine {
// now that we have a specific list of values we want to show, display them
GATKReport report = new GATKReport();
final String tableName = "differences";
- // TODO for Geraldine -- link needs to be updated below
- report.addTable(tableName, "Summarized differences between the master and test files. See http://www.broadinstitute.org/gsa/wiki/index.php/DiffEngine for more information", 3);
+ report.addTable(tableName, "Summarized differences between the master and test files. See http://www.broadinstitute.org/gatk/guide/article?id=1299 for more information", 3);
final GATKReportTable table = report.getTable(tableName);
table.addColumn("Difference");
table.addColumn("NumberOfOccurrences");
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjects.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjects.java
index d1903c2bb..6b5189dfd 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjects.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/diffengine/DiffObjects.java
@@ -68,12 +68,12 @@ import java.util.List;
* The reason for this system is that it allows you to compare two structured files -- such as BAMs and VCFs -- for common differences among them. This is primarily useful in regression testing or optimization, where you want to ensure that the differences are those that you expect and not any others.
*
*
- *
Input
+ *
Input
*
* The DiffObjectsWalker works with BAM or VCF files.
*
*
- *
Output
+ *
Output
*
* The DiffEngine system compares to two hierarchical data structures for specific differences in the values of named
* nodes. Suppose I have two trees:
@@ -132,6 +132,10 @@ import java.util.List;
[testng] 64b991fd3850f83614518f7d71f0532f.integrationtest.20:10000598.AC 1
*
+ *
Caveat
+ *
Because this is a walker, it requires that you pass a reference file. However the reference is not actually used, so it does not matter what you pass as reference.
+ *
+ *
* @author Mark DePristo
* @since 7/4/11
*/
@@ -140,8 +144,7 @@ public class DiffObjects extends RodWalker {
/**
* Writes out a file of the DiffEngine format:
*
- * TODO for Geraldine -- link needs to be updated below (and also in SelectVariants and RefSeqCodec GATK docs)
- * http://www.broadinstitute.org/gsa/wiki/index.php/DiffEngine
+ * See http://www.broadinstitute.org/gatk/guide/article?id=1299 for details.
*/
@Output(doc="File to which results should be written",required=true)
protected PrintStream out;
@@ -169,7 +172,7 @@ public class DiffObjects extends RodWalker {
@Argument(fullName="maxObjectsToRead", shortName="motr", doc="Max. number of objects to read from the files. -1 [default] means unlimited", required=false)
int MAX_OBJECTS_TO_READ = -1;
- @Argument(fullName="maxRawDiffsToSummary", shortName="maxRawDiffsToSummary", doc="Max. number of objects to read from the files. -1 [default] means unlimited", required=false)
+ @Argument(fullName="maxRawDiffsToSummarize", shortName="maxRawDiffsToSummarize", doc="Max. number of differences to include in the summary. -1 [default] means unlimited", required=false)
int maxRawDiffsToSummary = -1;
@Argument(fullName="doPairwise", shortName="doPairwise", doc="If provided, we will compute the minimum pairwise differences to summary, which can be extremely expensive", required=false)
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaAlternateReferenceMaker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaAlternateReferenceMaker.java
index e881315b9..d2f2e32b3 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaAlternateReferenceMaker.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaAlternateReferenceMaker.java
@@ -60,17 +60,17 @@ import java.util.List;
* 3) this tool works only for SNPs and for simple indels (but not for things like complex substitutions).
* Reference bases for each interval will be output as a separate fasta sequence (named numerically in order).
*
- *
Input
+ *
Input
*
* The reference, requested intervals, and any number of variant rod files.
*
*
- *
Output
+ *
Output
*
* A fasta file representing the requested intervals.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaReferenceMaker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaReferenceMaker.java
index f2f5fb5fe..fb7941fec 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaReferenceMaker.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/fasta/FastaReferenceMaker.java
@@ -48,17 +48,17 @@ import java.io.PrintStream;
* Overlapping intervals are automatically merged; reference bases for each disjoint interval will be output as a
* separate fasta sequence (named numerically in order).
*
- *
Input
+ *
Input
*
* The reference and requested intervals.
*
*
- *
Output
+ *
Output
*
* A fasta file representing the requested intervals.
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
public class FastaStats extends RefWalker {
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/filters/VariantFiltration.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/filters/VariantFiltration.java
index 61a847f4c..c59c61803 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/filters/VariantFiltration.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/filters/VariantFiltration.java
@@ -55,17 +55,17 @@ import java.util.*;
* VariantFiltration is a GATK tool for hard-filtering variant calls based on certain criteria.
* Records are hard-filtered by changing the value in the FILTER field to something other than PASS.
*
- *
Input
+ *
Input
*
* A variant set to filter.
*
*
- *
Output
+ *
Output
*
* A filtered VCF.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
@@ -114,7 +114,7 @@ public class VariantFiltration extends RodWalker {
* One can filter normally based on most fields (e.g. "GQ < 5.0"), but the GT (genotype) field is an exception. We have put in convenience
* methods so that one can now filter out hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1").
*/
- @Argument(fullName="genotypeFilterExpression", shortName="G_filter", doc="One or more expression used with FORMAT (sample/genotype-level) fields to filter (see wiki docs for more info)", required=false)
+ @Argument(fullName="genotypeFilterExpression", shortName="G_filter", doc="One or more expression used with FORMAT (sample/genotype-level) fields to filter (see documentation guide for more info)", required=false)
protected ArrayList GENOTYPE_FILTER_EXPS = new ArrayList();
/**
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountBases.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountBases.java
index 503cdb6d6..8b82e50a7 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountBases.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountBases.java
@@ -38,17 +38,17 @@ import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
/**
* Walks over the input data set, calculating the number of bases seen for diagnostic purposes.
*
- *
Input
+ *
Input
*
* One or more BAM files.
*
*
- *
Output
+ *
Output
*
* Number of bases seen.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountIntervals.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountIntervals.java
index 3b8eba398..e7b6df623 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountIntervals.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountIntervals.java
@@ -45,9 +45,42 @@ import java.util.Collections;
import java.util.List;
/**
- * Counts the number of contiguous regions the walker traverses over. Slower than it needs to be, but
- * very useful since overlapping intervals get merged, so you can count the number of intervals the GATK merges down to.
- * This was its very first use.
+ * Count contiguous regions in an interval list.
+ *
+ *
When the GATK reads in intervals from an intervals list, any intervals that overlap each other get merged into
+ * a single interval spanning the original ones. For example, if you have the following intervals:
+ *
+ * 20:1-2000
+ *
+ * 20:1500-3000
+ *
+ * They will be merged into a single interval:
+ *
20:1-3000
+ *
+ * This tool allows you to check, for a given list of intervals, how many separate intervals the GATK will actually
+ * distinguish at runtime.
+ *
+ *
+ *
Input
+ *
+ * One or more rod files containing intervals to check.
+ *
+ *
+ *
Output
+ *
+ * Number of separate intervals identified by GATK after merging overlapping intervals.
+ *
+ *
+ * You can use the -numOverlaps argument to find out how many cases you have of a specific number of overlaps.
+ *
+ *
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
public class CountIntervals extends RefWalker {
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLoci.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLoci.java
index f2bd791c1..d999dfebf 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLoci.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountLoci.java
@@ -42,33 +42,34 @@ import java.io.PrintStream;
* Walks over the input data set, calculating the total number of covered loci for diagnostic purposes.
*
*
- * Simplest example of a locus walker.
+ * This is the simplest example of a locus walker.
+ *
*
- *
- *
Input
+ *
Input
*
* One or more BAM files.
*
*
- *
Output
+ *
Output
*
- * Number of loci traversed.
+ * Number of loci traversed. If an output file name is provided, then the result will be written to that file.
+ * Otherwise it will be sent to standard console output.
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
@Requires({DataSource.READS, DataSource.REFERENCE})
public class CountMales extends ReadWalker {
+ @Output
+ public PrintStream out;
+
public Integer map(ReferenceContext ref, GATKSAMRecord read, RefMetaDataTracker tracker) {
Sample sample = getSampleDB().getSample(read);
return sample.getGender() == Gender.MALE ? 1 : 0;
@@ -53,4 +78,8 @@ public class CountMales extends ReadWalker {
public Integer reduce(Integer value, Integer sum) {
return value + sum;
}
+
+ public void onTraversalDone( Integer c ) {
+ out.println(c);
+ }
}
\ No newline at end of file
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODs.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODs.java
index c01a1df89..65f82efe4 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODs.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODs.java
@@ -53,22 +53,32 @@ import java.util.*;
/**
* Prints out counts of the number of reference ordered data objects encountered.
*
+ *
CountRods is a RODWalker, and so traverses the data by ROD. For example if the ROD passed to it is a VCF file,
+ * it will count the variants in the file.
*
- *
Input
+ *
Note that this tool is different from CountRodsByRef which is a RefWalker, and so traverses the data by
+ * position along the reference. CountRodsByRef can count ROD elements (such as, but not limited to, variants) found
+ * at each position or within specific intervals if you use the -L argument (see CommandLineGATK).
+ *
+ *
Both these tools are different from CountVariants in that they are more generic (they can also count RODs that
+ * are not variants) and CountVariants is more detailed, in that it computes additional statistics (type of variants
+ * being indels vs. SNPs etc).
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODsByRef.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODsByRef.java
index 303f1704f..594ca239d 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODsByRef.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountRODsByRef.java
@@ -43,24 +43,34 @@ import java.util.Collections;
import java.util.List;
/**
- * Prints out counts of the number of reference ordered data objects encountered.
+ * Prints out counts of the number of reference ordered data objects encountered along the reference.
*
+ *
CountRodsByRef is a RefWalker, and so traverses the data by position along the reference. It counts ROD
+ * elements (such as, but not limited to, variants) found at each position or within specific intervals if you use
+ * the -L argument (see CommandLineGATK).
*
- *
Input
+ *
Note that this tool is different from the basic CountRods, which is a RODWalker, and so traverses the data by
+ * ROD. For example if the ROD passed to it is a VCF file, CountRods will simply count the variants in the file.
+ *
+ *
Both these tools are different from CountVariants in that they are more generic (they can also count RODs that
+ * are not variants) and CountVariants is more detailed, in that it computes additional statistics (type of variants
+ * being indels vs. SNPs etc).
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadEvents.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadEvents.java
index 8b0646092..cfb7325a9 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadEvents.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/qc/CountReadEvents.java
@@ -47,22 +47,22 @@ import java.util.Map;
/**
* Walks over the input data set, counting the number of read events (from the CIGAR operator)
*
- *
Input
+ *
Input
*
* One or more BAM files.
*
*
- *
Output
+ *
Output
*
- * Number of reads events for each category
+ * Number of read events for each category, formatted as a GATKReport table.
*
- *
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/LeftAlignVariants.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/LeftAlignVariants.java
index 65ec7a4f0..e6d3e6e94 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/LeftAlignVariants.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/LeftAlignVariants.java
@@ -60,17 +60,17 @@ import java.util.*;
* place an indel at the left-most position this doesn't always happen, so this tool can be used to left-align them.
* Note that this tool cannot handle anything other than bi-allelic, simple indels. Complex events are written out unchanged.
*
- *
Input
+ *
Input
*
* A variant set to left-align.
*
*
- *
Output
+ *
Output
*
* A left-aligned VCF.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectHeaders.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectHeaders.java
index 17aaa7513..9bbf728e1 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectHeaders.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectHeaders.java
@@ -58,17 +58,17 @@ import java.util.*;
* SelectHeaders can be used for this purpose. Given a single VCF file, one or more headers can be extracted from the
* file (based on a complete header name or a pattern match).
*
- *
Input
+ *
Input
*
* A set of VCFs.
*
*
- *
Output
+ *
Output
*
* A header selected VCF.
*
*
- *
Examples
+ *
Examples
*
* Select only the FILTER, FORMAT, and INFO headers:
* java -Xmx2g -jar GenomeAnalysisTK.jar \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
index 9c209ae2c..f72ce3bd6 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
@@ -62,20 +62,20 @@ import java.util.*;
* Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a
* pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of
* coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are
- * documented in the Using JEXL expressions section (http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions).
+ * documented in the Using JEXL expressions section (http://www.broadinstitute.org/gatk/guide/article?id=1255).
* One can optionally include concordance or discordance tracks for use in selecting overlapping variants.
*
- *
Input
+ *
Input
*
* A variant set to select from.
*
*
- *
Output
+ *
Output
*
* A selected VCF.
*
*
- *
Examples
+ *
Examples
*
* Select two samples out of a VCF with many samples:
* java -Xmx2g -jar GenomeAnalysisTK.jar \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/ValidateVariants.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/ValidateVariants.java
index a242f9310..d11cf5aee 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/ValidateVariants.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/ValidateVariants.java
@@ -60,12 +60,12 @@ import java.util.Set;
*
* If you are looking simply to test the adherence to the VCF specification, use --validationType NONE.
*
- *
Input
+ *
Input
*
* A variant set to validate.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantValidationAssessor.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantValidationAssessor.java
index 02089eb6c..0e2a04bf2 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantValidationAssessor.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantValidationAssessor.java
@@ -55,12 +55,12 @@ import java.util.*;
* default is soft-filtered by high no-call rate or low Hardy-Weinberg probability.
* If you have .ped files, please first convert them to VCF format.
*
- *
Input
+ *
Input
*
* A validation VCF to annotate.
*
*
- *
Output
+ *
Output
*
* An annotated VCF. Additionally, a table like the following will be output:
*
@@ -74,7 +74,7 @@ import java.util.*;
*
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToTable.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToTable.java
index b12f51a1e..444eb745c 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToTable.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToTable.java
@@ -62,14 +62,13 @@ import java.util.*;
* genotypes), NO-CALL (count of no-call genotypes), TYPE (the type of event), VAR (count of
* non-reference genotypes), NSAMPLES (number of samples), NCALLED (number of called samples),
* GQ (from the genotype field; works only for a file with a single sample), and MULTI-ALLELIC
- * (is the record from a multi-allelic site). Note that this tool does not support capturing any
- * GENOTYPE field values. If a VCF record is missing a value, then the tool by
+ * (is the record from a multi-allelic site). Note that if a VCF record is missing a value, then the tool by
* default throws an error, but the special value NA can be emitted instead with
* appropriate tool arguments.
*
*
*
- *
Input
+ *
Input
*
*
*
A VCF file
@@ -77,12 +76,12 @@ import java.util.*;
*
*
*
- *
Output
+ *
Output
*
* A tab-delimited file containing the values of the requested fields in the VCF file
*
* Note that there must be a Tribble feature/codec for the file format as well as an adaptor.
*
- *
Input
+ *
Input
*
* A variant file to filter.
*
*
- *
Output
+ *
Output
*
* A VCF file.
*
*
- *
Examples
+ *
Examples
*
* java -Xmx2g -jar GenomeAnalysisTK.jar \
* -R ref.fasta \
diff --git a/public/java/src/org/broadinstitute/sting/tools/CatVariants.java b/public/java/src/org/broadinstitute/sting/tools/CatVariants.java
index 10fb606f9..e1dd2c255 100644
--- a/public/java/src/org/broadinstitute/sting/tools/CatVariants.java
+++ b/public/java/src/org/broadinstitute/sting/tools/CatVariants.java
@@ -35,6 +35,9 @@ import org.broadinstitute.sting.commandline.Argument;
import org.broadinstitute.sting.commandline.Input;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.commandline.CommandLineProgram;
+import org.broadinstitute.sting.gatk.CommandLineGATK;
+import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
+import org.broadinstitute.sting.utils.help.HelpConstants;
import org.broadinstitute.variant.bcf2.BCF2Codec;
import org.broadinstitute.sting.utils.collections.Pair;
import org.broadinstitute.variant.vcf.VCFCodec;
@@ -51,12 +54,48 @@ import java.util.*;
/**
*
- * Usage: java -cp dist/GenomeAnalysisTK.jar org.broadinstitute.sting.tools.CatVariants [sorted (optional)]");
- * The input files can be of type: VCF (ends in .vcf or .VCF)");
- * BCF2 (ends in .bcf or .BCF)");
- * Output file must be vcf or bcf file (.vcf or .bcf)");
- * If the input files are already sorted, the last argument can indicate that");
+ * Concatenates VCF files of non-overlapped genome intervals, all with the same set of samples.
+ *
+ *
+ * The main purpose of this tool is to speed up the gather function when using scatter-gather parallelization.
+ * This tool concatenates the scattered output VCF files. It assumes that:
+ * - All the input VCFs (or BCFs) contain the same samples in the same order.
+ * - The variants in each input file are from non-overlapping (scattered) intervals.
+ *
+ * When the input files are already sorted based on the intervals start positions, use -assumeSorted.
+ *
+ * Note: Currently the tool is more efficient when working with VCFs; we will work to make it as efficient for BCFs.
+ *
+ *
+ *
+ *
Input
+ *
+ * One or more variant sets to combine. They should be of non-overlapping genome intervals and with the same samples (in the same order).
+ * The input files should be 'name.vcf' or 'name.VCF' or 'name.bcf' or 'name.BCF'.
+ * If the files are ordered according to the appearance of intervals in the ref genome, then one can use the -assumeSorted flag.
+ *
+ *
+ *
Output
+ *
+ * A combined VCF. The output file should be 'name.vcf' or 'name.VCF'.
+ * <\p>
+ *
+ *
+ *
+ *
+ * @author Ami Levy Moonshine
+ * @since Jan 2012
*/
+
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP, extraDocs = {CommandLineGATK.class} )
public class CatVariants extends CommandLineProgram {
// setup the logging system, used by some codecs
private static org.apache.log4j.Logger logger = org.apache.log4j.Logger.getRootLogger();
@@ -64,6 +103,14 @@ public class CatVariants extends CommandLineProgram {
@Input(fullName = "reference", shortName = "R", doc = "genome reference file .fasta", required = true)
private File refFile = null;
+ /**
+ * The VCF or BCF files to merge together
+ *
+ * CatVariants can take any number of -V arguments on the command line. Each -V argument
+ * will be included in the final merged output VCF. The order of arguments does not matter, but it runs more
+ * efficiently if they are sorted based on the intervals and the assumeSorted argument is used.
+ *
+ */
@Input(fullName="variant", shortName="V", doc="Input VCF file/s named .vcf or .bcf", required = true)
private List variant = null;
diff --git a/public/java/src/org/broadinstitute/sting/utils/codecs/refseq/RefSeqCodec.java b/public/java/src/org/broadinstitute/sting/utils/codecs/refseq/RefSeqCodec.java
index fb26f6c37..82ee76a81 100644
--- a/public/java/src/org/broadinstitute/sting/utils/codecs/refseq/RefSeqCodec.java
+++ b/public/java/src/org/broadinstitute/sting/utils/codecs/refseq/RefSeqCodec.java
@@ -45,8 +45,8 @@ import java.util.ArrayList;
*
*
*
* The RefSeq Rod can be bound as any other rod, and is specified by REFSEQ, for example
diff --git a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/GATKResourcesBundle.scala b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/GATKResourcesBundle.scala
index 8a8c76806..e20d285e1 100644
--- a/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/GATKResourcesBundle.scala
+++ b/public/scala/qscript/org/broadinstitute/sting/queue/qscripts/GATKResourcesBundle.scala
@@ -171,7 +171,7 @@ class GATKResourcesBundle extends QScript {
"CEUTrio.HiSeq.WGS.b37.bestPractices.phased",b37,true,false))
//
- // example call set for wiki tutorial
+ // example call set for documentation guide tutorial
//
addResource(new Resource("/humgen/gsa-hpprojects/NA12878Collection/exampleCalls/NA12878.HiSeq.WGS.bwa.cleaned.raw.b37.subset.vcf",
"NA12878.HiSeq.WGS.bwa.cleaned.raw.subset", b37, true, true))
diff --git a/public/scala/src/org/broadinstitute/sting/queue/extensions/snpeff/SnpEff.scala b/public/scala/src/org/broadinstitute/sting/queue/extensions/snpeff/SnpEff.scala
index 344f5fe5b..529615c24 100644
--- a/public/scala/src/org/broadinstitute/sting/queue/extensions/snpeff/SnpEff.scala
+++ b/public/scala/src/org/broadinstitute/sting/queue/extensions/snpeff/SnpEff.scala
@@ -31,7 +31,7 @@ import org.broadinstitute.sting.commandline.{Argument, Output, Input}
/**
* Basic snpEff support.
- * See: http://www.broadinstitute.org/gsa/wiki/index.php/Adding_Genomic_Annotations_Using_SnpEff_and_VariantAnnotator
+ * See: http://www.broadinstitute.org/gatk/guide/article?id=50
*/
class SnpEff extends JavaCommandLineFunction {
javaMainClass = "ca.mcgill.mcb.pcingola.snpEffect.commandLine.SnpEff"
diff --git a/settings/helpTemplates/generic.template.html b/settings/helpTemplates/generic.template.html
index 587828d1e..b05ad65c0 100644
--- a/settings/helpTemplates/generic.template.html
+++ b/settings/helpTemplates/generic.template.html
@@ -130,7 +130,7 @@
#if>
-
Introduction
+
Overview
${description}
<#-- Create references to additional capabilities if appropriate -->
From cdb1fa110547a23182005c98787d9bf6c861a526 Mon Sep 17 00:00:00 2001
From: David Roazen
Date: Tue, 12 Mar 2013 13:41:29 -0400
Subject: [PATCH 015/211] Fix more tests that fail when run in parallel on the
farm
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
to be overridden via a constructor argument, and use a timeout
of 300 seconds for tests. The timeout remains 30 seconds in all
other cases.
-Change integration tests that themselves dispatch farm jobs
into pipeline tests. Necessary because some farm nodes are
not set up as submit hosts. Pipeline tests are still run
directly on gsa4.
-Bump up the timeout for the MaxRuntimeIntegrationTest even more
(was still occasionally failing on the farm!)
---
.../sting/gatk/phonehome/GATKRunReport.java | 43 ++++++++++++++-----
.../sting/gatk/MaxRuntimeIntegrationTest.java | 5 ++-
...nTest.java => JnaSessionPipelineTest.java} | 2 +-
...ionTest.java => LibDrmaaPipelineTest.java} | 2 +-
...ationTest.java => LibBatPipelineTest.java} | 2 +-
5 files changed, 39 insertions(+), 15 deletions(-)
rename public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/{JnaSessionIntegrationTest.java => JnaSessionPipelineTest.java} (99%)
rename public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/{LibDrmaaIntegrationTest.java => LibDrmaaPipelineTest.java} (99%)
rename public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/{LibBatIntegrationTest.java => LibBatPipelineTest.java} (99%)
diff --git a/public/java/src/org/broadinstitute/sting/gatk/phonehome/GATKRunReport.java b/public/java/src/org/broadinstitute/sting/gatk/phonehome/GATKRunReport.java
index 02f2f9f02..de84809bd 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/phonehome/GATKRunReport.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/phonehome/GATKRunReport.java
@@ -78,17 +78,11 @@ public class GATKRunReport {
private static final DateFormat DATE_FORMAT = new SimpleDateFormat("yyyy/MM/dd HH.mm.ss");
- /**
- * number of milliseconds before the S3 put operation is timed-out:
- */
- private static final long S3_PUT_TIME_OUT = 30 * 1000;
-
/**
* The root file system directory where we keep common report data
*/
private final static File REPORT_DIR = new File("/humgen/gsa-hpprojects/GATK/reports");
-
/**
* The full path to the direct where submitted (and uncharacterized) report files are written
*/
@@ -105,6 +99,17 @@ public class GATKRunReport {
*/
protected static final Logger logger = Logger.getLogger(GATKRunReport.class);
+ /**
+ * Default value for the number of milliseconds before an S3 put operation is timed-out.
+ * Can be overridden via a constructor argument.
+ */
+ private static final long S3_DEFAULT_PUT_TIME_OUT_IN_MILLISECONDS = 30 * 1000;
+
+ /**
+ * Number of milliseconds before an S3 put operation is timed-out.
+ */
+ private long s3PutTimeOutInMilliseconds = S3_DEFAULT_PUT_TIME_OUT_IN_MILLISECONDS;
+
// -----------------------------------------------------------------
// elements captured for the report
// -----------------------------------------------------------------
@@ -230,13 +235,31 @@ public class GATKRunReport {
}
/**
- * Create a new RunReport and population all of the fields with values from the walker and engine
+ * Create a new RunReport and population all of the fields with values from the walker and engine.
+ * Allows the S3 put timeout to be explicitly set.
*
* @param walker the GATK walker that we ran
* @param e the exception caused by running this walker, or null if we completed successfully
* @param engine the GAE we used to run the walker, so we can fetch runtime, args, etc
+ * @param type the GATK phone home setting
+ * @param s3PutTimeOutInMilliseconds number of milliseconds to wait before timing out an S3 put operation
*/
- public GATKRunReport(Walker,?> walker, Exception e, GenomeAnalysisEngine engine, PhoneHomeOption type) {
+ public GATKRunReport(final Walker,?> walker, final Exception e, final GenomeAnalysisEngine engine, final PhoneHomeOption type,
+ final long s3PutTimeOutInMilliseconds) {
+ this(walker, e, engine, type);
+ this.s3PutTimeOutInMilliseconds = s3PutTimeOutInMilliseconds;
+ }
+
+ /**
+ * Create a new RunReport and population all of the fields with values from the walker and engine.
+ * Leaves the S3 put timeout set to the default value of S3_DEFAULT_PUT_TIME_OUT_IN_MILLISECONDS.
+ *
+ * @param walker the GATK walker that we ran
+ * @param e the exception caused by running this walker, or null if we completed successfully
+ * @param engine the GAE we used to run the walker, so we can fetch runtime, args, etc
+ * @param type the GATK phone home setting
+ */
+ public GATKRunReport(final Walker,?> walker, final Exception e, final GenomeAnalysisEngine engine, final PhoneHomeOption type) {
if ( type == PhoneHomeOption.NO_ET )
throw new ReviewedStingException("Trying to create a run report when type is NO_ET!");
@@ -563,7 +586,7 @@ public class GATKRunReport {
throw new IllegalStateException("We are throwing an exception for testing purposes");
case TIMEOUT:
try {
- Thread.sleep(S3_PUT_TIME_OUT * 100);
+ Thread.sleep(s3PutTimeOutInMilliseconds * 100);
} catch ( InterruptedException e ) {
// supposed to be empty
}
@@ -625,7 +648,7 @@ public class GATKRunReport {
s3thread.setName("S3Put-Thread");
s3thread.start();
- s3thread.join(S3_PUT_TIME_OUT);
+ s3thread.join(s3PutTimeOutInMilliseconds);
if(s3thread.isAlive()){
s3thread.interrupt();
diff --git a/public/java/test/org/broadinstitute/sting/gatk/MaxRuntimeIntegrationTest.java b/public/java/test/org/broadinstitute/sting/gatk/MaxRuntimeIntegrationTest.java
index 25ee9ff09..9df768e70 100644
--- a/public/java/test/org/broadinstitute/sting/gatk/MaxRuntimeIntegrationTest.java
+++ b/public/java/test/org/broadinstitute/sting/gatk/MaxRuntimeIntegrationTest.java
@@ -39,7 +39,8 @@ import java.util.concurrent.TimeUnit;
*
*/
public class MaxRuntimeIntegrationTest extends WalkerTest {
- private static final long STARTUP_TIME = TimeUnit.NANOSECONDS.convert(120, TimeUnit.SECONDS);
+ // Assume a ridiculous amount of startup overhead to allow for running these tests on slow farm nodes
+ private static final long STARTUP_TIME = TimeUnit.NANOSECONDS.convert(300, TimeUnit.SECONDS);
private class MaxRuntimeTestProvider extends TestDataProvider {
final long maxRuntime;
@@ -68,7 +69,7 @@ public class MaxRuntimeIntegrationTest extends WalkerTest {
//
// Loop over errors to throw, make sure they are the errors we get back from the engine, regardless of NT type
//
- @Test(enabled = true, dataProvider = "MaxRuntimeProvider", timeOut = 300 * 1000)
+ @Test(enabled = true, dataProvider = "MaxRuntimeProvider", timeOut = 600 * 1000)
public void testMaxRuntime(final MaxRuntimeTestProvider cfg) {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T PrintReads -R " + hg18Reference
diff --git a/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/JnaSessionIntegrationTest.java b/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/JnaSessionPipelineTest.java
similarity index 99%
rename from public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/JnaSessionIntegrationTest.java
rename to public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/JnaSessionPipelineTest.java
index 677f87cac..d2da0e228 100644
--- a/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/JnaSessionIntegrationTest.java
+++ b/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/JnaSessionPipelineTest.java
@@ -34,7 +34,7 @@ import org.testng.annotations.Test;
import java.io.File;
import java.util.*;
-public class JnaSessionIntegrationTest extends BaseTest {
+public class JnaSessionPipelineTest extends BaseTest {
private String implementation = null;
private static final SessionFactory factory = new JnaSessionFactory();
diff --git a/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/LibDrmaaIntegrationTest.java b/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/LibDrmaaPipelineTest.java
similarity index 99%
rename from public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/LibDrmaaIntegrationTest.java
rename to public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/LibDrmaaPipelineTest.java
index 038bfd85d..efeeb3640 100644
--- a/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/LibDrmaaIntegrationTest.java
+++ b/public/java/test/org/broadinstitute/sting/jna/drmaa/v1_0/LibDrmaaPipelineTest.java
@@ -40,7 +40,7 @@ import java.io.File;
import java.util.Arrays;
import java.util.List;
-public class LibDrmaaIntegrationTest extends BaseTest {
+public class LibDrmaaPipelineTest extends BaseTest {
private String implementation = null;
@Test
diff --git a/public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/LibBatIntegrationTest.java b/public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/LibBatPipelineTest.java
similarity index 99%
rename from public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/LibBatIntegrationTest.java
rename to public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/LibBatPipelineTest.java
index 4898f17c3..af8d0e7b1 100644
--- a/public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/LibBatIntegrationTest.java
+++ b/public/java/test/org/broadinstitute/sting/jna/lsf/v7_0_6/LibBatPipelineTest.java
@@ -40,7 +40,7 @@ import java.io.File;
/**
* Really unit tests, but these test will only run on systems with LSF setup.
*/
-public class LibBatIntegrationTest extends BaseTest {
+public class LibBatPipelineTest extends BaseTest {
@BeforeClass
public void initLibBat() {
Assert.assertFalse(LibBat.lsb_init("LibBatIntegrationTest") < 0, LibBat.lsb_sperror("lsb_init() failed"));
From 8ed78b453f1a2c3e07a9efc703df057c0fa27c0c Mon Sep 17 00:00:00 2001
From: David Roazen
Date: Tue, 12 Mar 2013 23:53:26 -0400
Subject: [PATCH 018/211] Increase timeout for a test in the
EngineFeaturesIntegrationTest
-This test was intermittently failing when run on the farm
---
.../sting/gatk/EngineFeaturesIntegrationTest.java | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/public/java/test/org/broadinstitute/sting/gatk/EngineFeaturesIntegrationTest.java b/public/java/test/org/broadinstitute/sting/gatk/EngineFeaturesIntegrationTest.java
index 8d0874ea1..2a9bbeb09 100644
--- a/public/java/test/org/broadinstitute/sting/gatk/EngineFeaturesIntegrationTest.java
+++ b/public/java/test/org/broadinstitute/sting/gatk/EngineFeaturesIntegrationTest.java
@@ -117,7 +117,7 @@ public class EngineFeaturesIntegrationTest extends WalkerTest {
//
// Loop over errors to throw, make sure they are the errors we get back from the engine, regardless of NT type
//
- @Test(enabled = true, dataProvider = "EngineErrorHandlingTestProvider", timeOut = 60 * 1000 )
+ @Test(enabled = true, dataProvider = "EngineErrorHandlingTestProvider", timeOut = 300 * 1000 )
public void testEngineErrorHandlingTestProvider(final EngineErrorHandlingTestProvider cfg) {
for ( int i = 0; i < cfg.iterationsToTest; i++ ) {
final String root = "-T ErrorThrowing -R " + exampleFASTA;
From 925846c65f9a64ac16daecc2e2f33b901fe1cd8d Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Sun, 10 Feb 2013 19:21:26 -0800
Subject: [PATCH 019/211] Cleanup of FragmentUtils
-- Code was undocumented, big, and not well tested. All three things fixed.
-- Currently not passing, but the framework works well for testing
-- Added concat(byte[] ... arrays) to utils
---
.../org/broadinstitute/sting/utils/Utils.java | 18 +++
.../sting/utils/fragments/FragmentUtils.java | 109 +++++++++++++-----
.../sting/utils/UtilsUnitTest.java | 13 +++
.../fragments/FragmentUtilsUnitTest.java | 87 +++++++++++++-
4 files changed, 196 insertions(+), 31 deletions(-)
diff --git a/public/java/src/org/broadinstitute/sting/utils/Utils.java b/public/java/src/org/broadinstitute/sting/utils/Utils.java
index 45a2fa58d..ff64133a7 100644
--- a/public/java/src/org/broadinstitute/sting/utils/Utils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/Utils.java
@@ -415,6 +415,24 @@ public class Utils {
return C;
}
+ /**
+ * Concatenates byte arrays
+ * @return a concat of all bytes in allBytes in order
+ */
+ public static byte[] concat(final byte[] ... allBytes) {
+ int size = 0;
+ for ( final byte[] bytes : allBytes ) size += bytes.length;
+
+ final byte[] c = new byte[size];
+ int offset = 0;
+ for ( final byte[] bytes : allBytes ) {
+ System.arraycopy(bytes, 0, c, offset, bytes.length);
+ offset += bytes.length;
+ }
+
+ return c;
+ }
+
/**
* Appends String(s) B to array A.
* @param A First array.
diff --git a/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java b/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java
index 76ccede62..fa0187728 100644
--- a/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java
@@ -25,6 +25,8 @@
package org.broadinstitute.sting.utils.fragments;
+import com.google.java.contract.Ensures;
+import com.google.java.contract.Requires;
import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
@@ -56,7 +58,8 @@ import java.util.*;
* Date: 3/26/11
* Time: 10:09 PM
*/
-public class FragmentUtils {
+public final class FragmentUtils {
+ protected final static byte MIN_QUAL_BAD_OVERLAP = 16;
private FragmentUtils() {} // private constructor
/**
@@ -65,18 +68,28 @@ public class FragmentUtils {
* Allows us to write a generic T -> Fragment algorithm that works with any object containing
* a read.
*
- * @param
+ * @param The type of the object that contains a GATKSAMRecord
*/
public interface ReadGetter {
+ /**
+ * Get the GATKSAMRecord associated with object
+ *
+ * @param object the thing that contains the read
+ * @return a non-null GATKSAMRecord read
+ */
public GATKSAMRecord get(T object);
}
- /** Identify getter for SAMRecords themselves */
+ /**
+ * Identify getter for SAMRecords themselves
+ */
private final static ReadGetter SamRecordGetter = new ReadGetter() {
@Override public GATKSAMRecord get(final GATKSAMRecord object) { return object; }
};
- /** Gets the SAMRecord in a PileupElement */
+ /**
+ * Gets the SAMRecord in a PileupElement
+ */
private final static ReadGetter PileupElementGetter = new ReadGetter() {
@Override public GATKSAMRecord get(final PileupElement object) { return object.getRead(); }
};
@@ -87,13 +100,20 @@ public class FragmentUtils {
* and returns a FragmentCollection that contains the T objects whose underlying reads either overlap (or
* not) with their mate pairs.
*
- * @param readContainingObjects
- * @param nElements
- * @param getter
+ * @param readContainingObjects An iterator of objects that contain GATKSAMRecords
+ * @param nElements the number of elements to be provided by the iterator, which is usually known upfront and
+ * greatly improves the efficiency of the fragment calculation
+ * @param getter a helper function that takes an object of type T and returns is associated GATKSAMRecord
* @param
- * @return
+ * @return a fragment collection
*/
- private final static FragmentCollection create(Iterable readContainingObjects, int nElements, ReadGetter getter) {
+ @Requires({
+ "readContainingObjects != null",
+ "nElements >= 0",
+ "getter != null"
+ })
+ @Ensures("result != null")
+ private static FragmentCollection create(final Iterable readContainingObjects, final int nElements, final ReadGetter getter) {
Collection singletons = null;
Collection> overlapping = null;
Map nameMap = null;
@@ -145,30 +165,69 @@ public class FragmentUtils {
return new FragmentCollection(singletons, overlapping);
}
- public final static FragmentCollection create(ReadBackedPileup rbp) {
+ /**
+ * Create a FragmentCollection containing PileupElements from the ReadBackedPileup rbp
+ * @param rbp a non-null read-backed pileup. The elements in this ReadBackedPileup must be ordered
+ * @return a non-null FragmentCollection
+ */
+ @Ensures("result != null")
+ public static FragmentCollection create(final ReadBackedPileup rbp) {
+ if ( rbp == null ) throw new IllegalArgumentException("Pileup cannot be null");
return create(rbp, rbp.getNumberOfElements(), PileupElementGetter);
}
- public final static FragmentCollection create(List reads) {
+ /**
+ * Create a FragmentCollection containing GATKSAMRecords from a list of reads
+ *
+ * @param reads a non-null list of reads, ordered by their start location
+ * @return a non-null FragmentCollection
+ */
+ @Ensures("result != null")
+ public static FragmentCollection create(final List reads) {
+ if ( reads == null ) throw new IllegalArgumentException("Pileup cannot be null");
return create(reads, reads.size(), SamRecordGetter);
}
- public final static List mergeOverlappingPairedFragments( final List overlappingPair ) {
- final byte MIN_QUAL_BAD_OVERLAP = 16;
+ public static List mergeOverlappingPairedFragments( final List overlappingPair ) {
if( overlappingPair.size() != 2 ) { throw new ReviewedStingException("Found overlapping pair with " + overlappingPair.size() + " reads, but expecting exactly 2."); }
- GATKSAMRecord firstRead = overlappingPair.get(0);
- GATKSAMRecord secondRead = overlappingPair.get(1);
+ final GATKSAMRecord firstRead = overlappingPair.get(0);
+ final GATKSAMRecord secondRead = overlappingPair.get(1);
+
+ final GATKSAMRecord merged;
+ if( !(secondRead.getSoftStart() <= firstRead.getSoftEnd() && secondRead.getSoftStart() >= firstRead.getSoftStart() && secondRead.getSoftEnd() >= firstRead.getSoftEnd()) ) {
+ merged = mergeOverlappingPairedFragments(secondRead, firstRead);
+ } else {
+ merged = mergeOverlappingPairedFragments(firstRead, secondRead);
+ }
+
+ return merged == null ? overlappingPair : Collections.singletonList(merged);
+ }
+
+ /**
+ * Merge two overlapping reads from the same fragment into a single super read, if possible
+ *
+ * firstRead and secondRead must be part of the same fragment (though this isn't checked). Looks
+ * at the bases and alignment, and tries its best to create a meaningful synthetic single super read
+ * that represents the entire sequenced fragment.
+ *
+ * Assumes that firstRead starts before secondRead (according to their soft clipped starts)
+ *
+ * @param firstRead the left most read
+ * @param firstRead the right most read
+ *
+ * @return a strandless merged read of first and second, or null if the algorithm cannot create a meaningful one
+ */
+ public static GATKSAMRecord mergeOverlappingPairedFragments(final GATKSAMRecord firstRead, final GATKSAMRecord secondRead) {
+ if ( firstRead == null ) throw new IllegalArgumentException("firstRead cannot be null");
+ if ( secondRead == null ) throw new IllegalArgumentException("secondRead cannot be null");
+ if ( ! firstRead.getReadName().equals(secondRead.getReadName()) ) throw new IllegalArgumentException("attempting to merge two reads with different names " + firstRead + " and " + secondRead);
if( !(secondRead.getSoftStart() <= firstRead.getSoftEnd() && secondRead.getSoftStart() >= firstRead.getSoftStart() && secondRead.getSoftEnd() >= firstRead.getSoftEnd()) ) {
- firstRead = overlappingPair.get(1); // swap them
- secondRead = overlappingPair.get(0);
- }
- if( !(secondRead.getSoftStart() <= firstRead.getSoftEnd() && secondRead.getSoftStart() >= firstRead.getSoftStart() && secondRead.getSoftEnd() >= firstRead.getSoftEnd()) ) {
- return overlappingPair; // can't merge them, yet: AAAAAAAAAAA-BBBBBBBBBBB-AAAAAAAAAAAAAA, B is contained entirely inside A
+ return null; // can't merge them, yet: AAAAAAAAAAA-BBBBBBBBBBB-AAAAAAAAAAAAAA, B is contained entirely inside A
}
if( firstRead.getCigarString().contains("I") || firstRead.getCigarString().contains("D") || secondRead.getCigarString().contains("I") || secondRead.getCigarString().contains("D") ) {
- return overlappingPair; // fragments contain indels so don't merge them
+ return null; // fragments contain indels so don't merge them
}
final Pair pair = ReadUtils.getReadCoordinateForReferenceCoordinate(firstRead, secondRead.getSoftStart());
@@ -190,10 +249,10 @@ public class FragmentUtils {
}
for(int iii = firstReadStop; iii < firstRead.getReadLength(); iii++) {
if( firstReadQuals[iii] > MIN_QUAL_BAD_OVERLAP && secondReadQuals[iii-firstReadStop] > MIN_QUAL_BAD_OVERLAP && firstReadBases[iii] != secondReadBases[iii-firstReadStop] ) {
- return overlappingPair; // high qual bases don't match exactly, probably indel in only one of the fragments, so don't merge them
+ return null; // high qual bases don't match exactly, probably indel in only one of the fragments, so don't merge them
}
if( firstReadQuals[iii] < MIN_QUAL_BAD_OVERLAP && secondReadQuals[iii-firstReadStop] < MIN_QUAL_BAD_OVERLAP ) {
- return overlappingPair; // both reads have low qual bases in the overlap region so don't merge them because don't know what is going on
+ return null; // both reads have low qual bases in the overlap region so don't merge them because don't know what is going on
}
bases[iii] = ( firstReadQuals[iii] > secondReadQuals[iii-firstReadStop] ? firstReadBases[iii] : secondReadBases[iii-firstReadStop] );
quals[iii] = ( firstReadQuals[iii] > secondReadQuals[iii-firstReadStop] ? firstReadQuals[iii] : secondReadQuals[iii-firstReadStop] );
@@ -237,8 +296,6 @@ public class FragmentUtils {
returnRead.setBaseQualities( deletionQuals, EventType.BASE_DELETION );
}
- final ArrayList returnList = new ArrayList();
- returnList.add(returnRead);
- return returnList;
+ return returnRead;
}
}
diff --git a/public/java/test/org/broadinstitute/sting/utils/UtilsUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/UtilsUnitTest.java
index 705db6f85..154b000ce 100644
--- a/public/java/test/org/broadinstitute/sting/utils/UtilsUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/UtilsUnitTest.java
@@ -112,6 +112,19 @@ public class UtilsUnitTest extends BaseTest {
Assert.assertTrue("one-1;two-2;three-1;four-2;five-1;six-2".equals(joined));
}
+ @Test
+ public void testConcat() {
+ final String s1 = "A";
+ final String s2 = "CC";
+ final String s3 = "TTT";
+ final String s4 = "GGGG";
+ Assert.assertEquals(new String(Utils.concat()), "");
+ Assert.assertEquals(new String(Utils.concat(s1.getBytes())), s1);
+ Assert.assertEquals(new String(Utils.concat(s1.getBytes(), s2.getBytes())), s1 + s2);
+ Assert.assertEquals(new String(Utils.concat(s1.getBytes(), s2.getBytes(), s3.getBytes())), s1 + s2 + s3);
+ Assert.assertEquals(new String(Utils.concat(s1.getBytes(), s2.getBytes(), s3.getBytes(), s4.getBytes())), s1 + s2 + s3 + s4);
+ }
+
@Test
public void testEscapeExpressions() {
String[] expected, actual;
diff --git a/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java
index 15d69c400..89d192f9e 100644
--- a/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java
@@ -27,23 +27,30 @@ package org.broadinstitute.sting.utils.fragments;
import net.sf.samtools.SAMFileHeader;
import org.broadinstitute.sting.BaseTest;
+import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.pileup.PileupElement;
import org.broadinstitute.sting.utils.pileup.ReadBackedPileup;
import org.broadinstitute.sting.utils.pileup.ReadBackedPileupImpl;
+import org.broadinstitute.sting.utils.recalibration.EventType;
import org.broadinstitute.sting.utils.sam.ArtificialSAMUtils;
+import org.broadinstitute.sting.utils.sam.GATKSAMReadGroupRecord;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import org.testng.Assert;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
-import java.util.*;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
/**
* Test routines for read-backed pileup.
*/
public class FragmentUtilsUnitTest extends BaseTest {
private static SAMFileHeader header;
+ private static GATKSAMReadGroupRecord rgForMerged;
+ private final static boolean DEBUG = false;
private class FragmentUtilsTest extends TestDataProvider {
List statesForPileup = new ArrayList();
@@ -119,7 +126,7 @@ public class FragmentUtilsUnitTest extends BaseTest {
return FragmentUtilsTest.getTests(FragmentUtilsTest.class);
}
- @Test(enabled = true, dataProvider = "fragmentUtilsTest")
+ @Test(enabled = !DEBUG, dataProvider = "fragmentUtilsTest")
public void testAsPileup(FragmentUtilsTest test) {
for ( TestState testState : test.statesForPileup ) {
ReadBackedPileup rbp = testState.pileup;
@@ -129,7 +136,7 @@ public class FragmentUtilsUnitTest extends BaseTest {
}
}
- @Test(enabled = true, dataProvider = "fragmentUtilsTest")
+ @Test(enabled = !DEBUG, dataProvider = "fragmentUtilsTest")
public void testAsListOfReadsFromPileup(FragmentUtilsTest test) {
for ( TestState testState : test.statesForPileup ) {
FragmentCollection fp = FragmentUtils.create(testState.pileup.getReads());
@@ -138,7 +145,7 @@ public class FragmentUtilsUnitTest extends BaseTest {
}
}
- @Test(enabled = true, dataProvider = "fragmentUtilsTest")
+ @Test(enabled = !DEBUG, dataProvider = "fragmentUtilsTest")
public void testAsListOfReads(FragmentUtilsTest test) {
for ( TestState testState : test.statesForReads ) {
FragmentCollection fp = FragmentUtils.create(testState.rawReads);
@@ -147,7 +154,7 @@ public class FragmentUtilsUnitTest extends BaseTest {
}
}
- @Test(enabled = true, expectedExceptions = IllegalArgumentException.class)
+ @Test(enabled = !DEBUG, expectedExceptions = IllegalArgumentException.class)
public void testOutOfOrder() {
final List pair = ArtificialSAMUtils.createPair(header, "readpair", 100, 1, 50, true, true);
final GATKSAMRecord left = pair.get(0);
@@ -161,5 +168,75 @@ public class FragmentUtilsUnitTest extends BaseTest {
@BeforeTest
public void setup() {
header = ArtificialSAMUtils.createArtificialSamHeader(1,1,1000);
+ rgForMerged = new GATKSAMReadGroupRecord("RG1");
+ }
+
+ @DataProvider(name = "MergeFragmentsTest")
+ public Object[][] createMergeFragmentsTest() throws Exception {
+ List tests = new ArrayList();
+
+ final String leftFlank = "CCC";
+ final String rightFlank = "AAA";
+ final String allOverlappingBases = "ACGTACGTGGAACCTTAG";
+ for ( int overlapSize = 1; overlapSize < allOverlappingBases.length(); overlapSize++ ) {
+ final String overlappingBases = allOverlappingBases.substring(0, overlapSize);
+ final byte[] overlappingBaseQuals = new byte[overlapSize];
+ for ( int i = 0; i < overlapSize; i++ ) overlappingBaseQuals[i] = (byte)(i + 30);
+ final GATKSAMRecord read1 = makeOverlappingRead(leftFlank, 20, overlappingBases, overlappingBaseQuals, "", 30, 1);
+ final GATKSAMRecord read2 = makeOverlappingRead("", 20, overlappingBases, overlappingBaseQuals, rightFlank, 30, leftFlank.length() + 1);
+ final GATKSAMRecord merged = makeOverlappingRead(leftFlank, 20, overlappingBases, overlappingBaseQuals, rightFlank, 30, 1);
+ tests.add(new Object[]{"equalQuals", read1, read2, merged});
+
+ // test that the merged read base quality is the
+ tests.add(new Object[]{"lowQualLeft", modifyBaseQualities(read1, leftFlank.length(), overlapSize), read2, merged});
+ tests.add(new Object[]{"lowQualRight", read1, modifyBaseQualities(read2, 0, overlapSize), merged});
+ }
+
+ return tests.toArray(new Object[][]{});
+ }
+
+ private GATKSAMRecord modifyBaseQualities(final GATKSAMRecord read, final int startOffset, final int length) throws Exception {
+ final GATKSAMRecord readWithLowQuals = (GATKSAMRecord)read.clone();
+ final byte[] withLowQuals = Arrays.copyOf(read.getBaseQualities(), read.getBaseQualities().length);
+ for ( int i = startOffset; i < startOffset + length; i++ )
+ withLowQuals[i] = (byte)(read.getBaseQualities()[i] + (i % 2 == 0 ? -1 : 0));
+ readWithLowQuals.setBaseQualities(withLowQuals);
+ return readWithLowQuals;
+ }
+
+ private GATKSAMRecord makeOverlappingRead(final String leftFlank, final int leftQual, final String overlapBases,
+ final byte[] overlapQuals, final String rightFlank, final int rightQual,
+ final int alignmentStart) {
+ final String bases = leftFlank + overlapBases + rightFlank;
+ final int readLength = bases.length();
+ final GATKSAMRecord read = ArtificialSAMUtils.createArtificialRead(header, "myRead", 0, alignmentStart, readLength);
+ final byte[] leftQuals = Utils.dupBytes((byte) leftQual, leftFlank.length());
+ final byte[] rightQuals = Utils.dupBytes((byte) rightQual, rightFlank.length());
+ final byte[] quals = Utils.concat(leftQuals, overlapQuals, rightQuals);
+ read.setCigarString(readLength + "M");
+ read.setReadBases(bases.getBytes());
+ for ( final EventType type : EventType.values() )
+ read.setBaseQualities(quals, type);
+ read.setReadGroup(rgForMerged);
+ read.setMappingQuality(60);
+ return read;
+ }
+
+ @Test(enabled = true, dataProvider = "MergeFragmentsTest")
+ public void testMergingTwoReads(final String name, final GATKSAMRecord read1, GATKSAMRecord read2, final GATKSAMRecord expectedMerged) {
+ final GATKSAMRecord actual = FragmentUtils.mergeOverlappingPairedFragments(read1, read2);
+
+ if ( expectedMerged == null ) {
+ Assert.assertNull(actual, "Expected reads not to merge, but got non-null result from merging");
+ } else {
+ Assert.assertNotNull(actual, "Expected reads to merge, but got null result from merging");
+ // I really care about the bases, the quals, the CIGAR, and the read group tag
+ Assert.assertEquals(actual.getCigarString(), expectedMerged.getCigarString());
+ Assert.assertEquals(actual.getReadBases(), expectedMerged.getReadBases());
+ Assert.assertEquals(actual.getReadGroup(), expectedMerged.getReadGroup());
+ Assert.assertEquals(actual.getMappingQuality(), expectedMerged.getMappingQuality());
+ for ( final EventType type : EventType.values() )
+ Assert.assertEquals(actual.getBaseQualities(type), expectedMerged.getBaseQualities(type), "Failed base qualities for event type " + type);
+ }
}
}
From b5b63eaac708ecc1c3b08725d1c66611d58a9be1 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Mon, 11 Mar 2013 14:54:20 -0400
Subject: [PATCH 020/211] New GATKSAMRecord concept of a strandless read,
update to FS
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation. Differences are just minor updates to the FS
---
.../gatk/walkers/annotator/FisherStrand.java | 30 +++++++----
.../reducereads/SlidingWindow.java | 6 +--
.../reducereads/SyntheticRead.java | 12 ++---
.../reducereads/SyntheticReadUnitTest.java | 2 +-
...lexAndSymbolicVariantsIntegrationTest.java | 2 +-
.../HaplotypeCallerIntegrationTest.java | 12 ++---
.../sting/utils/fragments/FragmentUtils.java | 1 +
.../sting/utils/sam/GATKSAMRecord.java | 50 +++++++++++++++++++
.../fragments/FragmentUtilsUnitTest.java | 1 +
.../utils/sam/GATKSAMRecordUnitTest.java | 31 +++++++++++-
10 files changed, 117 insertions(+), 30 deletions(-)
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/annotator/FisherStrand.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/annotator/FisherStrand.java
index 14c785678..39fdcb707 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/annotator/FisherStrand.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/annotator/FisherStrand.java
@@ -47,6 +47,7 @@
package org.broadinstitute.sting.gatk.walkers.annotator;
import cern.jet.math.Arithmetic;
+import org.apache.log4j.Logger;
import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
@@ -74,6 +75,8 @@ import java.util.*;
* calculated for certain complex indel cases or for multi-allelic sites.
*/
public class FisherStrand extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation {
+ private final static Logger logger = Logger.getLogger(FisherStrand.class);
+
private static final String FS = "FS";
private static final double MIN_PVALUE = 1E-320;
private static final int MIN_QUAL_FOR_FILTERED_TEST = 17;
@@ -95,6 +98,8 @@ public class FisherStrand extends InfoFieldAnnotation implements StandardAnnotat
else if (stratifiedPerReadAlleleLikelihoodMap != null) {
// either SNP with no alignment context, or indels: per-read likelihood map needed
final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc);
+// logger.info("VC " + vc);
+// printTable(table, 0.0);
return pValueForBestTable(table, null);
}
else
@@ -131,9 +136,6 @@ public class FisherStrand extends InfoFieldAnnotation implements StandardAnnotat
private Map annotationForOneTable(final double pValue) {
final Object value = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(pValue, MIN_PVALUE))); // prevent INFINITYs
return Collections.singletonMap(FS, value);
-// Map map = new HashMap();
-// map.put(FS, String.format("%.3f", QualityUtils.phredScaleErrorRate(pValue)));
-// return map;
}
public List getKeyNames() {
@@ -192,7 +194,7 @@ public class FisherStrand extends InfoFieldAnnotation implements StandardAnnotat
private static void printTable(int[][] table, double pValue) {
- System.out.printf("%d %d; %d %d : %f\n", table[0][0], table[0][1], table[1][0], table[1][1], pValue);
+ logger.info(String.format("%d %d; %d %d : %f", table[0][0], table[0][1], table[1][0], table[1][1], pValue));
}
private static boolean rotateTable(int[][] table) {
@@ -315,13 +317,21 @@ public class FisherStrand extends InfoFieldAnnotation implements StandardAnnotat
final boolean matchesAlt = allele.equals(alt, true);
if ( matchesRef || matchesAlt ) {
+ final int row = matchesRef ? 0 : 1;
- final boolean isFW = !read.getReadNegativeStrandFlag();
-
- int row = matchesRef ? 0 : 1;
- int column = isFW ? 0 : 1;
-
- table[row][column] += representativeCount;
+ if ( read.isStrandless() ) {
+ // a strandless read counts as observations on both strand, at 50% weight, with a minimum of 1
+ // (the 1 is to ensure that a strandless read always counts as an observation on both strands, even
+ // if the read is only seen once, because it's a merged read or other)
+ final int toAdd = Math.max(representativeCount / 2, 1);
+ table[row][0] += toAdd;
+ table[row][1] += toAdd;
+ } else {
+ // a normal read with an actual strand
+ final boolean isFW = !read.getReadNegativeStrandFlag();
+ final int column = isFW ? 0 : 1;
+ table[row][column] += representativeCount;
+ }
}
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java
index 6c063110e..11e023b9b 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java
@@ -567,7 +567,7 @@ public class SlidingWindow {
ObjectArrayList result = new ObjectArrayList();
if (filteredDataConsensus == null)
- filteredDataConsensus = new SyntheticRead(samHeader, readGroupAttribute, contig, contigIndex, filteredDataReadName + filteredDataConsensusCounter++, header.get(start).getLocation(), GATKSAMRecord.REDUCED_READ_CONSENSUS_TAG, hasIndelQualities, isNegativeStrand);
+ filteredDataConsensus = new SyntheticRead(samHeader, readGroupAttribute, contig, contigIndex, filteredDataReadName + filteredDataConsensusCounter++, header.get(start).getLocation(), hasIndelQualities, isNegativeStrand);
ListIterator headerElementIterator = header.listIterator(start);
for (int index = start; index < end; index++) {
@@ -583,7 +583,7 @@ public class SlidingWindow {
if ( filteredDataConsensus.getRefStart() + filteredDataConsensus.size() != headerElement.getLocation() ) {
result.add(finalizeFilteredDataConsensus());
- filteredDataConsensus = new SyntheticRead(samHeader, readGroupAttribute, contig, contigIndex, filteredDataReadName + filteredDataConsensusCounter++, headerElement.getLocation(), GATKSAMRecord.REDUCED_READ_CONSENSUS_TAG, hasIndelQualities, isNegativeStrand);
+ filteredDataConsensus = new SyntheticRead(samHeader, readGroupAttribute, contig, contigIndex, filteredDataReadName + filteredDataConsensusCounter++, headerElement.getLocation(), hasIndelQualities, isNegativeStrand);
}
genericAddBaseToConsensus(filteredDataConsensus, headerElement.getFilteredBaseCounts(), headerElement.getRMS());
@@ -606,7 +606,7 @@ public class SlidingWindow {
@Requires({"start >= 0 && (end >= start || end == 0)"})
private void addToRunningConsensus(LinkedList header, int start, int end, boolean isNegativeStrand) {
if (runningConsensus == null)
- runningConsensus = new SyntheticRead(samHeader, readGroupAttribute, contig, contigIndex, consensusReadName + consensusCounter++, header.get(start).getLocation(), GATKSAMRecord.REDUCED_READ_CONSENSUS_TAG, hasIndelQualities, isNegativeStrand);
+ runningConsensus = new SyntheticRead(samHeader, readGroupAttribute, contig, contigIndex, consensusReadName + consensusCounter++, header.get(start).getLocation(), hasIndelQualities, isNegativeStrand);
Iterator headerElementIterator = header.listIterator(start);
for (int index = start; index < end; index++) {
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticRead.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticRead.java
index 72fd52ebe..451e50286 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticRead.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticRead.java
@@ -124,8 +124,7 @@ public class SyntheticRead {
private final ObjectArrayList basesCountsQuals;
- private double mappingQuality; // the average of the rms of the mapping qualities of all the reads that contributed to this consensus
- private String readTag;
+ private double mappingQuality;
// Information to produce a GATKSAMRecord
private SAMFileHeader header;
@@ -147,14 +146,12 @@ public class SyntheticRead {
* @param contigIndex the read's contig index
* @param readName the read's name
* @param refStart the alignment start (reference based)
- * @param readTag the reduce reads tag for the synthetic read
*/
- public SyntheticRead(SAMFileHeader header, GATKSAMReadGroupRecord readGroupRecord, String contig, int contigIndex, String readName, int refStart, String readTag, boolean hasIndelQualities, boolean isNegativeRead) {
+ public SyntheticRead(SAMFileHeader header, GATKSAMReadGroupRecord readGroupRecord, String contig, int contigIndex, String readName, int refStart, boolean hasIndelQualities, boolean isNegativeRead) {
final int initialCapacity = 10000;
basesCountsQuals = new ObjectArrayList(initialCapacity);
mappingQuality = 0.0;
- this.readTag = readTag;
this.header = header;
this.readGroupRecord = readGroupRecord;
this.contig = contig;
@@ -165,13 +162,12 @@ public class SyntheticRead {
this.isNegativeStrand = isNegativeRead;
}
- public SyntheticRead(ObjectArrayList bases, ByteArrayList counts, ByteArrayList quals, ByteArrayList insertionQuals, ByteArrayList deletionQuals, double mappingQuality, String readTag, SAMFileHeader header, GATKSAMReadGroupRecord readGroupRecord, String contig, int contigIndex, String readName, int refStart, boolean hasIndelQualities, boolean isNegativeRead) {
+ public SyntheticRead(ObjectArrayList bases, ByteArrayList counts, ByteArrayList quals, ByteArrayList insertionQuals, ByteArrayList deletionQuals, double mappingQuality, SAMFileHeader header, GATKSAMReadGroupRecord readGroupRecord, String contig, int contigIndex, String readName, int refStart, boolean hasIndelQualities, boolean isNegativeRead) {
basesCountsQuals = new ObjectArrayList(bases.size());
for (int i = 0; i < bases.size(); ++i) {
basesCountsQuals.add(new SingleBaseInfo(bases.get(i).getOrdinalByte(), counts.get(i), quals.get(i), insertionQuals.get(i), deletionQuals.get(i)));
}
this.mappingQuality = mappingQuality;
- this.readTag = readTag;
this.header = header;
this.readGroupRecord = readGroupRecord;
this.contig = contig;
@@ -228,7 +224,7 @@ public class SyntheticRead {
read.setReadBases(convertReadBases());
read.setMappingQuality((int) Math.ceil(mappingQuality / basesCountsQuals.size()));
read.setReadGroup(readGroupRecord);
- read.setAttribute(readTag, convertBaseCounts());
+ read.setReducedReadCounts(convertBaseCounts());
if (hasIndelQualities) {
read.setBaseQualities(convertInsertionQualities(), EventType.BASE_INSERTION);
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticReadUnitTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticReadUnitTest.java
index 1ed28dec2..570b797ca 100644
--- a/protected/java/test/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticReadUnitTest.java
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SyntheticReadUnitTest.java
@@ -77,7 +77,7 @@ public void testBaseCounts() {
new TestRead(bases, quals, new byte[] {1, 127, 51, 126}, new byte [] {1, 126, 50, 125})};
for (TestRead testRead : testReads) {
- SyntheticRead syntheticRead = new SyntheticRead(new ObjectArrayList(testRead.getBases()), new ByteArrayList(testRead.getCounts()), new ByteArrayList(testRead.getQuals()), new ByteArrayList(testRead.getInsQuals()), new ByteArrayList(testRead.getDelQuals()), artificialMappingQuality, GATKSAMRecord.REDUCED_READ_CONSENSUS_TAG, artificialSAMHeader, artificialGATKRG, artificialContig, artificialContigIndex, artificialReadName, artificialRefStart, false, false);
+ SyntheticRead syntheticRead = new SyntheticRead(new ObjectArrayList(testRead.getBases()), new ByteArrayList(testRead.getCounts()), new ByteArrayList(testRead.getQuals()), new ByteArrayList(testRead.getInsQuals()), new ByteArrayList(testRead.getDelQuals()), artificialMappingQuality, artificialSAMHeader, artificialGATKRG, artificialContig, artificialContigIndex, artificialReadName, artificialRefStart, false, false);
Assert.assertEquals(syntheticRead.convertBaseCounts(), testRead.getExpectedCounts());
}
}
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest.java
index 2e3e45247..fcf9168b3 100644
--- a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest.java
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest.java
@@ -63,7 +63,7 @@ public class HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest extends Wa
@Test
public void testHaplotypeCallerMultiSampleComplex() {
- HCTestComplexVariants(privateTestDir + "AFR.complex.variants.bam", "", "a2232995ca9bec143e664748845a0045");
+ HCTestComplexVariants(privateTestDir + "AFR.complex.variants.bam", "", "b83b53741edb07218045d6f25f20a18b");
}
private void HCTestSymbolicVariants(String bam, String args, String md5) {
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerIntegrationTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerIntegrationTest.java
index bf2ddea12..8ed589c63 100644
--- a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerIntegrationTest.java
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCallerIntegrationTest.java
@@ -68,12 +68,12 @@ public class HaplotypeCallerIntegrationTest extends WalkerTest {
@Test
public void testHaplotypeCallerMultiSample() {
- HCTest(CEUTRIO_BAM, "", "8f33e40686443b9a72de45d5a9da1861");
+ HCTest(CEUTRIO_BAM, "", "4a2880f0753e6e813b9e0c35209b3708");
}
@Test
public void testHaplotypeCallerSingleSample() {
- HCTest(NA12878_BAM, "", "8f2b047cdace0ef122d6ad162e7bc5b9");
+ HCTest(NA12878_BAM, "", "588892934f2e81247bf32e457db88449");
}
@Test(enabled = false)
@@ -84,7 +84,7 @@ public class HaplotypeCallerIntegrationTest extends WalkerTest {
@Test
public void testHaplotypeCallerMultiSampleGGA() {
HCTest(CEUTRIO_BAM, "--max_alternate_alleles 3 -gt_mode GENOTYPE_GIVEN_ALLELES -out_mode EMIT_ALL_SITES -alleles " + validationDataLocation + "combined.phase1.chr20.raw.indels.sites.vcf",
- "9d4be26a2c956ba4b7b4044820eab030");
+ "fa1b92373c89d2238542a319ad25c257");
}
private void HCTestIndelQualityScores(String bam, String args, String md5) {
@@ -112,7 +112,7 @@ public class HaplotypeCallerIntegrationTest extends WalkerTest {
@Test
public void HCTestStructuralIndels() {
final String base = String.format("-T HaplotypeCaller -R %s -I %s", REF, privateTestDir + "AFR.structural.indels.bam") + " --no_cmdline_in_header -o %s -minPruning 6 -L 20:8187565-8187800 -L 20:18670537-18670730";
- final WalkerTestSpec spec = new WalkerTestSpec(base, Arrays.asList("03557376242bdf78c5237703b762573b"));
+ final WalkerTestSpec spec = new WalkerTestSpec(base, Arrays.asList("9296f1af6cf1f1cc4b79494eb366e976"));
executeTest("HCTestStructuralIndels: ", spec);
}
@@ -134,7 +134,7 @@ public class HaplotypeCallerIntegrationTest extends WalkerTest {
public void HCTestReducedBam() {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T HaplotypeCaller -R " + b37KGReference + " --no_cmdline_in_header -I " + privateTestDir + "bamExample.ReducedRead.ADAnnotation.bam -o %s -L 1:67,225,396-67,288,518", 1,
- Arrays.asList("adb08cb25e902cfe0129404a682b2169"));
+ Arrays.asList("cf0a1bfded656153578df6cf68aa68a2"));
executeTest("HC calling on a ReducedRead BAM", spec);
}
@@ -142,7 +142,7 @@ public class HaplotypeCallerIntegrationTest extends WalkerTest {
public void testReducedBamWithReadsNotFullySpanningDeletion() {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T HaplotypeCaller -R " + b37KGReference + " --no_cmdline_in_header -I " + privateTestDir + "reduced.readNotFullySpanningDeletion.bam -o %s -L 1:167871297", 1,
- Arrays.asList("a43c595a617589388ff3d7e2ddc661e7"));
+ Arrays.asList("addceb63f5bfa9f11e15335d5bf641e9"));
executeTest("test calling on a ReducedRead BAM where the reads do not fully span a deletion", spec);
}
}
diff --git a/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java b/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java
index fa0187728..99f1d99c7 100644
--- a/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/fragments/FragmentUtils.java
@@ -263,6 +263,7 @@ public final class FragmentUtils {
}
final GATKSAMRecord returnRead = new GATKSAMRecord( firstRead.getHeader() );
+ returnRead.setIsStrandless(true);
returnRead.setAlignmentStart( firstRead.getSoftStart() );
returnRead.setReadBases( bases );
returnRead.setBaseQualities( quals );
diff --git a/public/java/src/org/broadinstitute/sting/utils/sam/GATKSAMRecord.java b/public/java/src/org/broadinstitute/sting/utils/sam/GATKSAMRecord.java
index 01a8c1996..c5f9f606b 100644
--- a/public/java/src/org/broadinstitute/sting/utils/sam/GATKSAMRecord.java
+++ b/public/java/src/org/broadinstitute/sting/utils/sam/GATKSAMRecord.java
@@ -74,6 +74,8 @@ public class GATKSAMRecord extends BAMRecord {
private int softEnd = UNINITIALIZED;
private Integer adapterBoundary = null;
+ private boolean isStrandlessRead = false;
+
// because some values can be null, we don't want to duplicate effort
private boolean retrievedReadGroup = false;
private boolean retrievedReduceReadCounts = false;
@@ -141,6 +143,45 @@ public class GATKSAMRecord extends BAMRecord {
return ArtificialSAMUtils.createArtificialRead(cigar);
}
+ ///////////////////////////////////////////////////////////////////////////////
+ // *** support for reads without meaningful strand information ***//
+ ///////////////////////////////////////////////////////////////////////////////
+
+ /**
+ * Does this read have a meaningful strandedness value?
+ *
+ * Some advanced types of reads, such as reads coming from merged fragments,
+ * don't have meaningful strandedness values, as they are composites of multiple
+ * other reads. Strandless reads need to be handled specially by code that cares about
+ * stranded information, such as FS.
+ *
+ * @return true if this read doesn't have meaningful strand information
+ */
+ public boolean isStrandless() {
+ return isStrandlessRead;
+ }
+
+ /**
+ * Set the strandless state of this read to isStrandless
+ * @param isStrandless true if this read doesn't have a meaningful strandedness value
+ */
+ public void setIsStrandless(final boolean isStrandless) {
+ this.isStrandlessRead = isStrandless;
+ }
+
+ @Override
+ public boolean getReadNegativeStrandFlag() {
+ return ! isStrandless() && super.getReadNegativeStrandFlag();
+ }
+
+ @Override
+ public void setReadNegativeStrandFlag(boolean flag) {
+ if ( isStrandless() )
+ throw new IllegalStateException("Cannot set the strand of a strandless read");
+ super.setReadNegativeStrandFlag(flag);
+ }
+
+
///////////////////////////////////////////////////////////////////////////////
// *** The following methods are overloaded to cache the appropriate data ***//
///////////////////////////////////////////////////////////////////////////////
@@ -313,6 +354,15 @@ public class GATKSAMRecord extends BAMRecord {
return getReducedReadCounts() != null;
}
+ /**
+ * Set the reduced read counts for this record to counts
+ * @param counts the count array
+ */
+ public void setReducedReadCounts(final byte[] counts) {
+ retrievedReduceReadCounts = false;
+ setAttribute(REDUCED_READ_CONSENSUS_TAG, counts);
+ }
+
/**
* The number of bases corresponding the i'th base of the reduced read.
*
diff --git a/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java
index 89d192f9e..4f49eb933 100644
--- a/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/fragments/FragmentUtilsUnitTest.java
@@ -229,6 +229,7 @@ public class FragmentUtilsUnitTest extends BaseTest {
if ( expectedMerged == null ) {
Assert.assertNull(actual, "Expected reads not to merge, but got non-null result from merging");
} else {
+ Assert.assertTrue(actual.isStrandless(), "Merged reads should be strandless");
Assert.assertNotNull(actual, "Expected reads to merge, but got null result from merging");
// I really care about the bases, the quals, the CIGAR, and the read group tag
Assert.assertEquals(actual.getCigarString(), expectedMerged.getCigarString());
diff --git a/public/java/test/org/broadinstitute/sting/utils/sam/GATKSAMRecordUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/sam/GATKSAMRecordUnitTest.java
index baf4bfbb0..38840fab1 100644
--- a/public/java/test/org/broadinstitute/sting/utils/sam/GATKSAMRecordUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/sam/GATKSAMRecordUnitTest.java
@@ -64,6 +64,7 @@ public class GATKSAMRecordUnitTest extends BaseTest {
for (int i = 0; i < reducedRead.getReadLength(); i++) {
Assert.assertEquals(reducedRead.getReducedCount(i), REDUCED_READ_COUNTS[i], "Reduced read count not set to the expected value at " + i);
}
+ Assert.assertEquals(reducedRead.isStrandless(), false, "Reduced reads don't have meaningful strandedness information");
}
@Test
@@ -103,7 +104,35 @@ public class GATKSAMRecordUnitTest extends BaseTest {
read.setAttribute(GATKSAMRecord.REDUCED_READ_ORIGINAL_ALIGNMENT_START_SHIFT, null);
Assert.assertEquals(read.getAlignmentStart(), read.getOriginalAlignmentStart());
Assert.assertEquals(read.getAlignmentEnd() - alignmentShift, read.getOriginalAlignmentEnd());
-
}
+ @Test
+ public void testStrandlessReads() {
+ final byte [] bases = {'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'};
+ final byte [] quals = {20 , 20 , 20 , 20 , 20 , 20 , 20 , 20 };
+ GATKSAMRecord read = ArtificialSAMUtils.createArtificialRead(bases, quals, "6M");
+ Assert.assertEquals(read.isStrandless(), false);
+
+ read.setReadNegativeStrandFlag(false);
+ Assert.assertEquals(read.isStrandless(), false);
+ Assert.assertEquals(read.getReadNegativeStrandFlag(), false);
+
+ read.setReadNegativeStrandFlag(true);
+ Assert.assertEquals(read.isStrandless(), false);
+ Assert.assertEquals(read.getReadNegativeStrandFlag(), true);
+
+ read.setReadNegativeStrandFlag(true);
+ read.setIsStrandless(true);
+ Assert.assertEquals(read.isStrandless(), true);
+ Assert.assertEquals(read.getReadNegativeStrandFlag(), false, "negative strand flag should return false even through its set for a strandless read");
+ }
+
+ @Test(expectedExceptions = IllegalStateException.class)
+ public void testStrandlessReadsFailSetStrand() {
+ final byte [] bases = {'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'};
+ final byte [] quals = {20 , 20 , 20 , 20 , 20 , 20 , 20 , 20 };
+ GATKSAMRecord read = ArtificialSAMUtils.createArtificialRead(bases, quals, "6M");
+ read.setIsStrandless(true);
+ read.setReadNegativeStrandFlag(true);
+ }
}
From ff87b62fe3e711775c4995facd2c31718d029f79 Mon Sep 17 00:00:00 2001
From: Eric Banks
Date: Tue, 12 Mar 2013 13:58:20 -0400
Subject: [PATCH 021/211] Fixed bug in SelectVariants where maxIndelSize
argument wasn't getting applied to deletions.
Added unit tests and docs.
---
.../walkers/variantutils/SelectVariants.java | 20 +++--
.../variantutils/SelectVariantsUnitTest.java | 88 +++++++++++++++++++
2 files changed, 102 insertions(+), 6 deletions(-)
create mode 100644 public/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariantsUnitTest.java
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
index f72ce3bd6..b64c64d11 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariants.java
@@ -507,7 +507,7 @@ public class SelectVariants extends RodWalker implements TreeR
if (!selectedTypes.contains(vc.getType()))
continue;
- if ( badIndelSize(vc) )
+ if ( containsIndelLargerThan(vc, maxIndelSize) )
continue;
VariantContext sub = subsetRecord(vc, EXCLUDE_NON_VARIANTS);
@@ -531,12 +531,20 @@ public class SelectVariants extends RodWalker implements TreeR
return 1;
}
- private boolean badIndelSize(final VariantContext vc) {
- List lengths = vc.getIndelLengths();
+ /*
+ * Determines if any of the alternate alleles are greater than the max indel size
+ *
+ * @param vc the variant context to check
+ * @param maxIndelSize the maximum size of allowed indels
+ * @return true if the VC contains an indel larger than maxIndelSize and false otherwise
+ */
+ protected static boolean containsIndelLargerThan(final VariantContext vc, final int maxIndelSize) {
+ final List lengths = vc.getIndelLengths();
if ( lengths == null )
- return false; // VC does not harbor indel
- for ( Integer indelLength : vc.getIndelLengths() ) {
- if ( indelLength > maxIndelSize )
+ return false;
+
+ for ( Integer indelLength : lengths ) {
+ if ( Math.abs(indelLength) > maxIndelSize )
return true;
}
diff --git a/public/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariantsUnitTest.java b/public/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariantsUnitTest.java
new file mode 100644
index 000000000..ca60c6cfe
--- /dev/null
+++ b/public/java/test/org/broadinstitute/sting/gatk/walkers/variantutils/SelectVariantsUnitTest.java
@@ -0,0 +1,88 @@
+/*
+* Copyright (c) 2012 The Broad Institute
+*
+* Permission is hereby granted, free of charge, to any person
+* obtaining a copy of this software and associated documentation
+* files (the "Software"), to deal in the Software without
+* restriction, including without limitation the rights to use,
+* copy, modify, merge, publish, distribute, sublicense, and/or sell
+* copies of the Software, and to permit persons to whom the
+* Software is furnished to do so, subject to the following
+* conditions:
+*
+* The above copyright notice and this permission notice shall be
+* included in all copies or substantial portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.variantutils;
+
+import org.broadinstitute.sting.BaseTest;
+import org.broadinstitute.sting.utils.Utils;
+import org.broadinstitute.variant.variantcontext.Allele;
+import org.broadinstitute.variant.variantcontext.VariantContext;
+import org.broadinstitute.variant.variantcontext.VariantContextBuilder;
+import org.testng.Assert;
+import org.testng.annotations.DataProvider;
+import org.testng.annotations.Test;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+
+public class SelectVariantsUnitTest extends BaseTest {
+
+ //////////////////////////////////////////
+ // Tests for maxIndelSize functionality //
+ //////////////////////////////////////////
+
+ @DataProvider(name = "MaxIndelSize")
+ public Object[][] MaxIndelSizeTestData() {
+
+ List tests = new ArrayList();
+
+ for ( final int size : Arrays.asList(1, 3, 10, 100) ) {
+ for ( final int otherSize : Arrays.asList(0, 1) ) {
+ for ( final int max : Arrays.asList(0, 1, 5, 50, 100000) ) {
+ for ( final String op : Arrays.asList("D", "I") ) {
+ tests.add(new Object[]{size, otherSize, max, op});
+ }
+ }
+ }
+ }
+
+ return tests.toArray(new Object[][]{});
+ }
+
+ @Test(dataProvider = "MaxIndelSize")
+ public void maxIndelSizeTest(final int size, final int otherSize, final int max, final String op) {
+
+ final byte[] largerAllele = Utils.dupBytes((byte) 'A', size+1);
+ final byte[] smallerAllele = Utils.dupBytes((byte) 'A', 1);
+
+ final List alleles = new ArrayList(2);
+ final Allele ref = Allele.create(op.equals("I") ? smallerAllele : largerAllele, true);
+ final Allele alt = Allele.create(op.equals("D") ? smallerAllele : largerAllele, false);
+ alleles.add(ref);
+ alleles.add(alt);
+ if ( otherSize > 0 && otherSize != size ) {
+ final Allele otherAlt = Allele.create(op.equals("D") ? Utils.dupBytes((byte) 'A', size-otherSize+1) : Utils.dupBytes((byte) 'A', otherSize+1), false);
+ alleles.add(otherAlt);
+ }
+
+ final VariantContext vc = new VariantContextBuilder("test", "1", 10, 10 + ref.length() - 1, alleles).make();
+
+ boolean hasTooLargeIndel = SelectVariants.containsIndelLargerThan(vc, max);
+ Assert.assertEquals(hasTooLargeIndel, size > max);
+ }
+
+}
\ No newline at end of file
From 573ed07ad06ec022b4bf9896384cd452162ca116 Mon Sep 17 00:00:00 2001
From: Eric Banks
Date: Thu, 14 Mar 2013 11:06:45 -0400
Subject: [PATCH 024/211] Fixed reported bug in BQSR for RNA seq alignments
with Ns.
* ClippingOp updated to incorporate Ns in the hard clips.
* ReadUtils.getReadCoordinateForReferenceCoordinate() updated to account for Ns.
* Added test that covers the BQSR case we saw.
* Created GSA-856 (for Mauricio) to add lots of tests to ReadUtils.
* It will require refactoring code and not in the scope of what I was willing to do to fix this.
---
.../sting/utils/clipping/ClippingOp.java | 4 ++--
.../sting/utils/sam/ReadUtils.java | 2 +-
.../sting/utils/sam/ReadUtilsUnitTest.java | 22 +++++++++++++++++++
3 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/public/java/src/org/broadinstitute/sting/utils/clipping/ClippingOp.java b/public/java/src/org/broadinstitute/sting/utils/clipping/ClippingOp.java
index fe1a386fb..ad6f05563 100644
--- a/public/java/src/org/broadinstitute/sting/utils/clipping/ClippingOp.java
+++ b/public/java/src/org/broadinstitute/sting/utils/clipping/ClippingOp.java
@@ -581,8 +581,8 @@ public class ClippingOp {
if (cigarElement.getOperator() == CigarOperator.INSERTION)
return -clippedLength;
- // Deletions should be added to the total hard clip count
- else if (cigarElement.getOperator() == CigarOperator.DELETION)
+ // Deletions and Ns should be added to the total hard clip count (because we want to maintain the original alignment start)
+ else if (cigarElement.getOperator() == CigarOperator.DELETION || cigarElement.getOperator() == CigarOperator.SKIPPED_REGION)
return cigarElement.getLength();
// There is no shift if we are not clipping an indel
diff --git a/public/java/src/org/broadinstitute/sting/utils/sam/ReadUtils.java b/public/java/src/org/broadinstitute/sting/utils/sam/ReadUtils.java
index 95e0d55f3..c84e4245d 100644
--- a/public/java/src/org/broadinstitute/sting/utils/sam/ReadUtils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/sam/ReadUtils.java
@@ -524,7 +524,7 @@ public class ReadUtils {
// If we reached our goal inside a deletion, but the deletion is the next cigar element then we need
// to add the shift of the current cigar element but go back to it's last element to return the last
// base before the deletion (see warning in function contracts)
- else if (fallsInsideDeletion && !endsWithinCigar)
+ else if (fallsInsideDeletion && !endsWithinCigar && cigarElement.getOperator().consumesReadBases())
readBases += shift - 1;
// If we reached our goal inside a deletion then we must backtrack to the last base before the deletion
diff --git a/public/java/test/org/broadinstitute/sting/utils/sam/ReadUtilsUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/sam/ReadUtilsUnitTest.java
index baad67d53..331121c55 100644
--- a/public/java/test/org/broadinstitute/sting/utils/sam/ReadUtilsUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/sam/ReadUtilsUnitTest.java
@@ -25,13 +25,19 @@
package org.broadinstitute.sting.utils.sam;
+import net.sf.picard.reference.IndexedFastaSequenceFile;
+import net.sf.samtools.SAMFileHeader;
import org.broadinstitute.sting.BaseTest;
import org.broadinstitute.sting.gatk.GenomeAnalysisEngine;
import org.broadinstitute.sting.utils.BaseUtils;
+import org.broadinstitute.sting.utils.Utils;
+import org.broadinstitute.sting.utils.fasta.CachingIndexedFastaSequenceFile;
import org.testng.Assert;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
+import java.io.File;
+import java.io.FileNotFoundException;
import java.util.*;
@@ -179,4 +185,20 @@ public class ReadUtilsUnitTest extends BaseTest {
final List reads = new LinkedList();
Assert.assertEquals(ReadUtils.getMaxReadLength(reads), 0, "Empty list should have max length of zero");
}
+
+ @Test (enabled = true)
+ public void testReadWithNs() throws FileNotFoundException {
+
+ final IndexedFastaSequenceFile seq = new CachingIndexedFastaSequenceFile(new File(b37KGReference));
+ final SAMFileHeader header = ArtificialSAMUtils.createArtificialSamHeader(seq.getSequenceDictionary());
+ final int readLength = 76;
+
+ final GATKSAMRecord read = ArtificialSAMUtils.createArtificialRead(header, "myRead", 0, 8975, readLength);
+ read.setReadBases(Utils.dupBytes((byte) 'A', readLength));
+ read.setBaseQualities(Utils.dupBytes((byte)30, readLength));
+ read.setCigarString("3M414N1D73M");
+
+ final int result = ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, 9392, ReadUtils.ClippingTail.LEFT_TAIL);
+ Assert.assertEquals(result, 3);
+ }
}
From 7cab709a88c86145d3be601c5ec2ea6476aa02a3 Mon Sep 17 00:00:00 2001
From: Eric Banks
Date: Wed, 13 Mar 2013 14:57:28 -0400
Subject: [PATCH 025/211] Fixed the logic of the @Output annotation and its
interaction with 'required'.
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:
I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
* The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
* The logic for @Output is now:
* if required==true then -o MUST be provided or a User Error is generated.
* if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
* this is the default behavior (i.e. @Output with no modifiers).
* if required==false and defaultToStdout==false then the output object is null.
* use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).
* I have updated walkers so that previous behavior has been maintained (as best I could).
* In general, all @Outputs with default long/short names have required=false.
* Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
* I added unit tests for @Output changes with David's help (thanks!).
* #resolve GSA-837
---
.../bqsr/RecalibrationArgumentCollection.java | 4 +-
.../bqsr/RecalibrationPerformance.java | 2 +-
.../compression/reducereads/ReduceReads.java | 11 +-
.../targets/BaseCoverageDistribution.java | 2 +-
.../diagnostics/targets/DiagnoseTargets.java | 2 +-
.../targets/FindCoveredIntervals.java | 2 +-
.../walkers/genotyper/UnifiedGenotyper.java | 2 +-
.../haplotypecaller/HaplotypeCaller.java | 8 +-
.../haplotypecaller/HaplotypeResolver.java | 2 +-
.../gatk/walkers/indels/IndelRealigner.java | 8 +-
.../walkers/phasing/ReadBackedPhasing.java | 2 +-
.../ValidationSiteSelector.java | 2 +-
.../ApplyRecalibration.java | 2 +-
.../VariantRecalibrator.java | 2 +-
.../variantutils/RegenotypeVariants.java | 2 +-
.../sting/commandline/ArgumentSource.java | 8 +
.../sting/commandline/Output.java | 7 +
.../OutputStreamArgumentTypeDescriptor.java | 6 +-
.../SAMFileWriterArgumentTypeDescriptor.java | 6 +-
.../VCFWriterArgumentTypeDescriptor.java | 8 +-
.../gatk/walkers/ActiveRegionWalker.java | 4 +-
.../walkers/annotator/VariantAnnotator.java | 2 +-
.../walkers/beagle/BeagleOutputToVCF.java | 2 +-
.../walkers/beagle/ProduceBeagleInput.java | 4 +-
.../beagle/VariantsToBeagleUnphased.java | 2 +-
.../diagnostics/CoveredByNSamplesSites.java | 2 +-
.../gatk/walkers/diffengine/DiffObjects.java | 2 +-
.../walkers/filters/VariantFiltration.java | 2 +-
.../gatk/walkers/qc/DocumentationTest.java | 2 +-
.../gatk/walkers/readutils/ClipReads.java | 4 +-
.../gatk/walkers/readutils/PrintReads.java | 2 +-
.../walkers/variantutils/CombineVariants.java | 2 +-
.../variantutils/FilterLiftedVariants.java | 2 +-
.../variantutils/LeftAlignVariants.java | 2 +-
.../variantutils/LiftoverVariants.java | 2 +-
.../walkers/variantutils/SelectHeaders.java | 2 +-
.../walkers/variantutils/SelectVariants.java | 2 +-
.../VariantValidationAssessor.java | 2 +-
.../VariantsToAllelicPrimitives.java | 2 +-
.../walkers/variantutils/VariantsToTable.java | 2 +-
.../walkers/variantutils/VariantsToVCF.java | 2 +-
.../ArgumentTypeDescriptorUnitTest.java | 183 ++++++++++++++++++
42 files changed, 262 insertions(+), 57 deletions(-)
create mode 100644 public/java/test/org/broadinstitute/sting/commandline/ArgumentTypeDescriptorUnitTest.java
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java
index ee2edee5a..447569643 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationArgumentCollection.java
@@ -91,7 +91,7 @@ public class RecalibrationArgumentCollection {
* If not provided, then no plots will be generated (useful for queue scatter/gathering).
* However, we *highly* recommend that users generate these plots whenever possible for QC checking.
*/
- @Output(fullName = "plot_pdf_file", shortName = "plots", doc = "The output recalibration pdf file to create", required = false)
+ @Output(fullName = "plot_pdf_file", shortName = "plots", doc = "The output recalibration pdf file to create", required = false, defaultToStdout = false)
public File RECAL_PDF_FILE = null;
/**
@@ -220,7 +220,7 @@ public class RecalibrationArgumentCollection {
public String FORCE_PLATFORM = null;
@Hidden
- @Output(fullName = "recal_table_update_log", shortName = "recal_table_update_log", required = false, doc = "If provided, log all updates to the recalibration tables to the given file. For debugging/testing purposes only")
+ @Output(fullName = "recal_table_update_log", shortName = "recal_table_update_log", required = false, doc = "If provided, log all updates to the recalibration tables to the given file. For debugging/testing purposes only", defaultToStdout = false)
public PrintStream RECAL_TABLE_UPDATE_LOG = null;
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationPerformance.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationPerformance.java
index fb11f6249..d0af08d90 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationPerformance.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/bqsr/RecalibrationPerformance.java
@@ -66,7 +66,7 @@ import java.io.*;
@PartitionBy(PartitionType.READ)
public class RecalibrationPerformance extends RodWalker implements NanoSchedulable {
- @Output(doc="Write output to this file", required = true)
+ @Output(doc="Write output to this file")
public PrintStream out;
@Input(fullName="recal", shortName="recal", required=false, doc="The input covariates table file")
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java
index bc582fd49..da9bc1b37 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/ReduceReads.java
@@ -69,6 +69,7 @@ import org.broadinstitute.sting.utils.GenomeLoc;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.clipping.ReadClipper;
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
+import org.broadinstitute.sting.utils.exceptions.UserException;
import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
import org.broadinstitute.sting.utils.help.HelpConstants;
import org.broadinstitute.sting.utils.sam.BySampleSAMFileWriter;
@@ -112,7 +113,7 @@ import org.broadinstitute.sting.utils.sam.ReadUtils;
@Downsample(by=DownsampleType.BY_SAMPLE, toCoverage=40)
public class ReduceReads extends ReadWalker, ReduceReadsStash> {
- @Output(required=true)
+ @Output(required = false, defaultToStdout = false)
private StingSAMFileWriter out = null;
private SAMFileWriter writerToUse = null;
@@ -259,6 +260,13 @@ public class ReduceReads extends ReadWalker, Redu
@Override
public void initialize() {
super.initialize();
+
+ if ( !nwayout && out == null )
+ throw new UserException.MissingArgument("out", "the output must be provided and is optional only for certain debugging modes");
+
+ if ( nwayout && out != null )
+ throw new UserException.CommandLineException("--out and --nwayout can not be used simultaneously; please use one or the other");
+
GenomeAnalysisEngine toolkit = getToolkit();
readNameHash = new Object2LongOpenHashMap(100000); // prepare the read name hash to keep track of what reads have had their read names compressed
intervalList = new ObjectAVLTreeSet(); // get the interval list from the engine. If no interval list was provided, the walker will work in WGS mode
@@ -266,7 +274,6 @@ public class ReduceReads extends ReadWalker, Redu
if (toolkit.getIntervals() != null)
intervalList.addAll(toolkit.getIntervals());
-
final boolean preSorted = true;
final boolean indexOnTheFly = true;
final boolean keep_records = true;
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/BaseCoverageDistribution.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/BaseCoverageDistribution.java
index 9bd08a020..b70581dd3 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/BaseCoverageDistribution.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/BaseCoverageDistribution.java
@@ -99,7 +99,7 @@ public class BaseCoverageDistribution extends LocusWalker, Ma
/**
* The output GATK Report table
*/
- @Output(required = true, doc = "The output GATK Report table")
+ @Output(doc = "The output GATK Report table")
private PrintStream out;
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/DiagnoseTargets.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/DiagnoseTargets.java
index e4310588e..b302a967c 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/DiagnoseTargets.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/DiagnoseTargets.java
@@ -110,7 +110,7 @@ import java.util.*;
@PartitionBy(PartitionType.INTERVAL)
public class DiagnoseTargets extends LocusWalker {
- @Output(doc = "File to which variants should be written", required = true)
+ @Output(doc = "File to which variants should be written")
private VariantContextWriter vcfWriter = null;
@Argument(fullName = "minimum_base_quality", shortName = "BQ", doc = "The minimum Base Quality that is considered for calls", required = false)
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java
index 6b4d1f7a8..eef581160 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/targets/FindCoveredIntervals.java
@@ -92,7 +92,7 @@ import java.io.PrintStream;
@PartitionBy(PartitionType.CONTIG)
@ActiveRegionTraversalParameters(extension = 0, maxRegion = 50000)
public class FindCoveredIntervals extends ActiveRegionWalker {
- @Output(required = true)
+ @Output
private PrintStream out;
@Argument(fullName = "uncovered", shortName = "u", required = false, doc = "output intervals that fail the coverage threshold instead")
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyper.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyper.java
index 4347a1a84..54fcad1df 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyper.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyper.java
@@ -180,7 +180,7 @@ public class UnifiedGenotyper extends LocusWalker, Unif
* A raw, unfiltered, highly sensitive callset in VCF format.
*/
//@Gather(className = "org.broadinstitute.sting.queue.extensions.gatk.CatVariantsGatherer")
- @Output(doc="File to which variants should be written",required=true)
+ @Output(doc="File to which variants should be written")
protected VariantContextWriter writer = null;
@Hidden
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
index 7948b93a9..4bf09ad2d 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
@@ -139,10 +139,10 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
/**
* A raw, unfiltered, highly sensitive callset in VCF format.
*/
- @Output(doc="File to which variants should be written", required = true)
+ @Output(doc="File to which variants should be written")
protected VariantContextWriter vcfWriter = null;
- @Output(fullName="graphOutput", shortName="graph", doc="File to which debug assembly graph information should be written", required = false)
+ @Output(fullName="graphOutput", shortName="graph", doc="File to which debug assembly graph information should be written", required = false, defaultToStdout = false)
protected PrintStream graphWriter = null;
/**
@@ -170,14 +170,14 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
* in the following screenshot: https://www.dropbox.com/s/xvy7sbxpf13x5bp/haplotypecaller%20bamout%20for%20docs.png
*
*/
- @Output(fullName="bamOutput", shortName="bamout", doc="File to which assembled haplotypes should be written", required = false)
+ @Output(fullName="bamOutput", shortName="bamout", doc="File to which assembled haplotypes should be written", required = false, defaultToStdout = false)
protected StingSAMFileWriter bamWriter = null;
private HaplotypeBAMWriter haplotypeBAMWriter;
/**
* The type of BAM output we want to see.
*/
- @Output(fullName="bamWriterType", shortName="bamWriterType", doc="How should haplotypes be written to the BAM?", required = false)
+ @Argument(fullName="bamWriterType", shortName="bamWriterType", doc="How should haplotypes be written to the BAM?", required = false)
public HaplotypeBAMWriter.Type bamWriterType = HaplotypeBAMWriter.Type.CALLED_HAPLOTYPES;
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java
index 4de9488e9..facc929cd 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeResolver.java
@@ -125,7 +125,7 @@ public class HaplotypeResolver extends RodWalker {
@Input(fullName="variant", shortName = "V", doc="Input VCF file", required=true)
public List> variants;
- @Output(doc="File to which variants should be written", required=true)
+ @Output(doc="File to which variants should be written")
protected VariantContextWriter baseWriter = null;
private VariantContextWriter writer;
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java
index d3a13df29..7d8243c98 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/indels/IndelRealigner.java
@@ -189,7 +189,7 @@ public class IndelRealigner extends ReadWalker {
/**
* The realigned bam file.
*/
- @Output(required=false, doc="Output bam")
+ @Output(required=false, doc="Output bam", defaultToStdout=false)
protected StingSAMFileWriter writer = null;
protected ConstrainedMateFixingManager manager = null;
protected SAMFileWriter writerToUse = null;
@@ -295,15 +295,15 @@ public class IndelRealigner extends ReadWalker {
protected boolean KEEP_ALL_PG_RECORDS = false;
@Hidden
- @Output(fullName="indelsFileForDebugging", shortName="indels", required=false, doc="Output file (text) for the indels found; FOR DEBUGGING PURPOSES ONLY")
+ @Output(fullName="indelsFileForDebugging", shortName="indels", required=false, defaultToStdout=false, doc="Output file (text) for the indels found; FOR DEBUGGING PURPOSES ONLY")
protected String OUT_INDELS = null;
@Hidden
- @Output(fullName="statisticsFileForDebugging", shortName="stats", doc="print out statistics (what does or doesn't get cleaned); FOR DEBUGGING PURPOSES ONLY", required=false)
+ @Output(fullName="statisticsFileForDebugging", shortName="stats", doc="print out statistics (what does or doesn't get cleaned); FOR DEBUGGING PURPOSES ONLY", required=false, defaultToStdout=false)
protected String OUT_STATS = null;
@Hidden
- @Output(fullName="SNPsFileForDebugging", shortName="snps", doc="print out whether mismatching columns do or don't get cleaned out; FOR DEBUGGING PURPOSES ONLY", required=false)
+ @Output(fullName="SNPsFileForDebugging", shortName="snps", doc="print out whether mismatching columns do or don't get cleaned out; FOR DEBUGGING PURPOSES ONLY", required=false, defaultToStdout=false)
protected String OUT_SNPS = null;
// fasta reference reader to supplement the edges of the reference sequence
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/ReadBackedPhasing.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/ReadBackedPhasing.java
index c1b484542..a297b38cf 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/ReadBackedPhasing.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/phasing/ReadBackedPhasing.java
@@ -131,7 +131,7 @@ public class ReadBackedPhasing extends RodWalker {
/**
* The output VCF file
*/
- @Output(doc="File to which variants should be written",required=true)
+ @Output(doc="File to which variants should be written")
protected VariantContextWriter vcfWriter = null;
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java
index 22425e62e..7de0c7e60 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/ApplyRecalibration.java
@@ -128,7 +128,7 @@ public class ApplyRecalibration extends RodWalker implements T
/////////////////////////////
// Outputs
/////////////////////////////
- @Output( doc="The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value", required=true)
+ @Output( doc="The output filtered and recalibrated VCF file in which each variant is annotated with its VQSLOD value")
private VariantContextWriter vcfWriter = null;
/////////////////////////////
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java
index 99d926ea5..320328ab1 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/variantrecalibration/VariantRecalibrator.java
@@ -194,7 +194,7 @@ public class VariantRecalibrator extends RodWalker implements T
@ArgumentCollection protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
- @Output(doc="File to which variants should be written",required=true)
+ @Output(doc="File to which variants should be written")
protected VariantContextWriter vcfWriter = null;
private UnifiedGenotyperEngine UG_engine = null;
diff --git a/public/java/src/org/broadinstitute/sting/commandline/ArgumentSource.java b/public/java/src/org/broadinstitute/sting/commandline/ArgumentSource.java
index b9c785879..efacde231 100644
--- a/public/java/src/org/broadinstitute/sting/commandline/ArgumentSource.java
+++ b/public/java/src/org/broadinstitute/sting/commandline/ArgumentSource.java
@@ -175,6 +175,14 @@ public class ArgumentSource {
return field.isAnnotationPresent(Deprecated.class);
}
+ /**
+ * Returns whether the field should default to stdout if not provided explicitly on the command-line.
+ * @return True if field should default to stdout.
+ */
+ public boolean defaultsToStdout() {
+ return field.isAnnotationPresent(Output.class) && (Boolean)CommandLineUtils.getValue(ArgumentTypeDescriptor.getArgumentAnnotation(this),"defaultToStdout");
+ }
+
/**
* Returns false if a type-specific default can be employed.
* @return True to throw in a type specific default. False otherwise.
diff --git a/public/java/src/org/broadinstitute/sting/commandline/Output.java b/public/java/src/org/broadinstitute/sting/commandline/Output.java
index 47a47602a..0db870f2e 100644
--- a/public/java/src/org/broadinstitute/sting/commandline/Output.java
+++ b/public/java/src/org/broadinstitute/sting/commandline/Output.java
@@ -66,6 +66,13 @@ public @interface Output {
*/
boolean required() default false;
+ /**
+ * If this argument is not required, should it default to use stdout if no
+ * output file is explicitly provided on the command-line?
+ * @return True if the argument should default to stdout. False otherwise.
+ */
+ boolean defaultToStdout() default true;
+
/**
* Should this command-line argument be exclusive of others. Should be
* a comma-separated list of names of arguments of which this should be
diff --git a/public/java/src/org/broadinstitute/sting/gatk/io/stubs/OutputStreamArgumentTypeDescriptor.java b/public/java/src/org/broadinstitute/sting/gatk/io/stubs/OutputStreamArgumentTypeDescriptor.java
index fbcc32d78..18185f12e 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/io/stubs/OutputStreamArgumentTypeDescriptor.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/io/stubs/OutputStreamArgumentTypeDescriptor.java
@@ -66,7 +66,7 @@ public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {
@Override
public boolean createsTypeDefault(ArgumentSource source) {
- return source.isRequired();
+ return !source.isRequired() && source.defaultsToStdout();
}
@Override
@@ -76,7 +76,7 @@ public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {
@Override
public Object createTypeDefault(ParsingEngine parsingEngine,ArgumentSource source, Type type) {
- if(!source.isRequired())
+ if(source.isRequired() || !source.defaultsToStdout())
throw new ReviewedStingException("BUG: tried to create type default for argument type descriptor that can't support a type default.");
OutputStreamStub stub = new OutputStreamStub(defaultOutputStream);
engine.addOutput(stub);
@@ -90,7 +90,7 @@ public class OutputStreamArgumentTypeDescriptor extends ArgumentTypeDescriptor {
// This parser has been passed a null filename and the GATK is not responsible for creating a type default for the object;
// therefore, the user must have failed to specify a type default
- if(fileName == null && !source.isRequired())
+ if(fileName == null && source.isRequired())
throw new MissingArgumentValueException(definition);
OutputStreamStub stub = new OutputStreamStub(new File(fileName));
diff --git a/public/java/src/org/broadinstitute/sting/gatk/io/stubs/SAMFileWriterArgumentTypeDescriptor.java b/public/java/src/org/broadinstitute/sting/gatk/io/stubs/SAMFileWriterArgumentTypeDescriptor.java
index 34a7f967f..458846db0 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/io/stubs/SAMFileWriterArgumentTypeDescriptor.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/io/stubs/SAMFileWriterArgumentTypeDescriptor.java
@@ -89,7 +89,7 @@ public class SAMFileWriterArgumentTypeDescriptor extends ArgumentTypeDescriptor
@Override
public boolean createsTypeDefault(ArgumentSource source) {
- return source.isRequired();
+ return !source.isRequired() && source.defaultsToStdout();
}
@Override
@@ -99,7 +99,7 @@ public class SAMFileWriterArgumentTypeDescriptor extends ArgumentTypeDescriptor
@Override
public Object createTypeDefault(ParsingEngine parsingEngine,ArgumentSource source, Type type) {
- if(!source.isRequired())
+ if(source.isRequired() || !source.defaultsToStdout())
throw new ReviewedStingException("BUG: tried to create type default for argument type descriptor that can't support a type default.");
SAMFileWriterStub stub = new SAMFileWriterStub(engine,defaultOutputStream);
engine.addOutput(stub);
@@ -162,7 +162,7 @@ public class SAMFileWriterArgumentTypeDescriptor extends ArgumentTypeDescriptor
DEFAULT_ARGUMENT_FULLNAME,
DEFAULT_ARGUMENT_SHORTNAME,
ArgumentDefinition.getDoc(annotation),
- false,
+ source.isRequired(),
false,
source.isMultiValued(),
source.isHidden(),
diff --git a/public/java/src/org/broadinstitute/sting/gatk/io/stubs/VCFWriterArgumentTypeDescriptor.java b/public/java/src/org/broadinstitute/sting/gatk/io/stubs/VCFWriterArgumentTypeDescriptor.java
index 5b03859f5..91013673f 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/io/stubs/VCFWriterArgumentTypeDescriptor.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/io/stubs/VCFWriterArgumentTypeDescriptor.java
@@ -110,7 +110,7 @@ public class VCFWriterArgumentTypeDescriptor extends ArgumentTypeDescriptor {
*/
@Override
public boolean createsTypeDefault(ArgumentSource source) {
- return source.isRequired();
+ return !source.isRequired() && source.defaultsToStdout();
}
@Override
@@ -119,8 +119,8 @@ public class VCFWriterArgumentTypeDescriptor extends ArgumentTypeDescriptor {
}
@Override
- public Object createTypeDefault(ParsingEngine parsingEngine,ArgumentSource source, Type type) {
- if(!source.isRequired())
+ public Object createTypeDefault(ParsingEngine parsingEngine, ArgumentSource source, Type type) {
+ if(source.isRequired() || !source.defaultsToStdout())
throw new ReviewedStingException("BUG: tried to create type default for argument type descriptor that can't support a type default.");
VariantContextWriterStub stub = new VariantContextWriterStub(engine, defaultOutputStream, argumentSources);
engine.addOutput(stub);
@@ -143,7 +143,7 @@ public class VCFWriterArgumentTypeDescriptor extends ArgumentTypeDescriptor {
// This parser has been passed a null filename and the GATK is not responsible for creating a type default for the object;
// therefore, the user must have failed to specify a type default
- if(writerFile == null && !source.isRequired())
+ if(writerFile == null && source.isRequired())
throw new MissingArgumentValueException(defaultArgumentDefinition);
// Create a stub for the given object.
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/ActiveRegionWalker.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/ActiveRegionWalker.java
index e14e50b1a..ebfc52d3f 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/ActiveRegionWalker.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/ActiveRegionWalker.java
@@ -67,7 +67,7 @@ public abstract class ActiveRegionWalker extends Walker
- * User: carneiro
- * Date: 1/27/13
- * Time: 11:16 AM
+ *
+ * @author carneiro
+ * @since 1/27/13
*/
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
public class BaseCoverageDistribution extends LocusWalker, Map>> {
/**
* The output GATK Report table
diff --git a/protected/java/src/org/broadinstitute/sting/utils/recalibration/RecalUtils.java b/protected/java/src/org/broadinstitute/sting/utils/recalibration/RecalUtils.java
index ce2869e94..ae6b56e19 100644
--- a/protected/java/src/org/broadinstitute/sting/utils/recalibration/RecalUtils.java
+++ b/protected/java/src/org/broadinstitute/sting/utils/recalibration/RecalUtils.java
@@ -82,7 +82,7 @@ import java.util.*;
*
* This helper class holds the data HashMap as well as submaps that represent the marginal distributions collapsed over all needed dimensions.
* It also has static methods that are used to perform the various solid recalibration modes that attempt to correct the reference bias.
- * This class holds the parsing methods that are shared between CountCovariates and TableRecalibration.
+ * This class holds the parsing methods that are shared between BaseRecalibrator and PrintReads.
*/
public class RecalUtils {
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java
index 61574d947..29016af43 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/coverage/DepthOfCoverage.java
@@ -117,7 +117,7 @@ import java.util.*;
// todo -- alter logarithmic scaling to spread out bins more
// todo -- allow for user to set linear binning (default is logarithmic)
// todo -- formatting --> do something special for end bins in getQuantile(int[] foo), this gets mushed into the end+-1 bins for now
-@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_DATA, extraDocs = {CommandLineGATK.class} )
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
@By(DataSource.REFERENCE)
@PartitionBy(PartitionType.NONE)
@Downsample(by= DownsampleType.NONE, toCoverage=Integer.MAX_VALUE)
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java
index 92034da70..506ef2c72 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/diagnostics/CoveredByNSamplesSites.java
@@ -29,12 +29,15 @@ package org.broadinstitute.sting.gatk.walkers.diagnostics;
import org.broadinstitute.sting.commandline.Argument;
import org.broadinstitute.sting.commandline.ArgumentCollection;
import org.broadinstitute.sting.commandline.Output;
+import org.broadinstitute.sting.gatk.CommandLineGATK;
import org.broadinstitute.sting.gatk.arguments.StandardVariantContextInputArgumentCollection;
import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.*;
import org.broadinstitute.sting.utils.GenomeLoc;
+import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
+import org.broadinstitute.sting.utils.help.HelpConstants;
import org.broadinstitute.variant.variantcontext.Genotype;
import org.broadinstitute.variant.variantcontext.GenotypesContext;
import org.broadinstitute.variant.variantcontext.VariantContext;
@@ -44,12 +47,15 @@ import java.io.*;
import java.util.Collection;
/**
- * print intervals file with all the variant sites that have "most" ( >= 90% by default) of the samples with "good" (>= 10 by default)coverage ("most" and "good" can be set in the command line).
+ * Print intervals file with all the variant sites for which most of the samples have good coverage
*
*
- * CoveredByNSamplesSites is a GATK tool for filter out sites based on their coverage.
+ * CoveredByNSamplesSites is a GATK tool for filtering out sites based on their coverage.
* The sites that pass the filter are printed out to an intervals file.
*
+ * See argument defaults for what constitutes "most" samples and "good" coverage. These parameters can be modified from the command line.
+ *
+ *
*
Input
*
* A variant file and optionally min coverage and sample percentage values.
@@ -60,7 +66,7 @@ import java.util.Collection;
* An intervals file.
*
*
*/
-
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
@By(DataSource.REFERENCE_ORDERED_DATA)
public class CoveredByNSamplesSites extends RodWalker implements TreeReducible {
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/GenotypeConcordance.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/GenotypeConcordance.java
index 048c7ef77..35213af34 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/GenotypeConcordance.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/GenotypeConcordance.java
@@ -26,6 +26,7 @@
package org.broadinstitute.sting.gatk.walkers.variantutils;
import org.broadinstitute.sting.commandline.*;
+import org.broadinstitute.sting.gatk.CommandLineGATK;
import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
@@ -33,6 +34,8 @@ import org.broadinstitute.sting.gatk.report.GATKReport;
import org.broadinstitute.sting.gatk.report.GATKReportTable;
import org.broadinstitute.sting.gatk.walkers.RodWalker;
import org.broadinstitute.sting.utils.collections.Pair;
+import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
+import org.broadinstitute.sting.utils.help.HelpConstants;
import org.broadinstitute.sting.utils.variant.GATKVCFUtils;
import org.broadinstitute.variant.variantcontext.*;
import org.broadinstitute.variant.vcf.VCFHeader;
@@ -41,29 +44,30 @@ import java.io.PrintStream;
import java.util.*;
/**
- * A simple walker for performing genotype concordance calculations between two callsets. Outputs a GATK table with
- * per-sample and aggregate counts and frequencies, a summary table for NRD/NRS, and a table for site allele overlaps.
+ * Genotype concordance (per-sample and aggregate counts and frequencies, NRD/NRS and site allele overlaps) between two callsets
*
*
- * Genotype concordance takes in two callsets (vcfs) and tabulates the number of sites which overlap and share alleles,
+ * GenotypeConcordance takes in two callsets (vcfs) and tabulates the number of sites which overlap and share alleles,
* and for each sample, the genotype-by-genotype counts (for instance, the number of sites at which a sample was
* called homozygous reference in the EVAL callset, but homozygous variant in the COMP callset). It outputs these
* counts as well as convenient proportions (such as the proportion of het calls in the EVAL which were called REF in
* the COMP) and metrics (such as NRD and NRS).
*
- *
INPUT
+ *
Input
*
* Genotype concordance requires two callsets (as it does a comparison): an EVAL and a COMP callset, specified via
- * the -eval and -comp arguments
- *
+ * the -eval and -comp arguments.
+ *
* (Optional) Jexl expressions for genotype-level filtering of EVAL or COMP genotypes, specified via the -gfe and
* -cfe arguments, respectively.
+ *
*
- *
OUTPUT
- * Genotype Concordance writes a GATK report to the specified (via -o) file, consisting of multiple tables of counts
+ *
Output
+ * Genotype Concordance writes a GATK report to the specified file (via -o) , consisting of multiple tables of counts
* and proportions. These tables may be optionally moltenized via the -moltenize argument.
*
*/
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP, extraDocs = {CommandLineGATK.class} )
public class GenotypeConcordance extends RodWalker>,ConcordanceMetrics> {
/**
From 6b4d88ebe96d3383a0778c6f8a3bbf6bd88ccaee Mon Sep 17 00:00:00 2001
From: Geraldine Van der Auwera
Date: Fri, 15 Mar 2013 16:34:29 -0400
Subject: [PATCH 048/211] Created ListAnnotations utility (extends
CommandLineProgram) --Refactored listAnnotations basic method out of VA
into HelpUtils --HelpUtils.listAnnotations() is now called by both VA
and the new ListAnnotations utility (lives in sting.tools) --This way we
keep the VA --list option but we also offer a way to list annotations without
a full valid VA command-line, which was a pain users continually complained
about --We could get rid of the VA --list option altogether ...?
---
.../walkers/annotator/VariantAnnotator.java | 30 ++-----
.../sting/tools/CatVariants.java | 15 ++--
.../sting/tools/ListAnnotations.java | 85 +++++++++++++++++++
.../sting/utils/help/HelpConstants.java | 1 +
.../sting/utils/help/HelpUtils.java | 29 +++++++
5 files changed, 129 insertions(+), 31 deletions(-)
create mode 100644 public/java/src/org/broadinstitute/sting/tools/ListAnnotations.java
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
index 330d29c79..301baaba3 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
@@ -36,10 +36,10 @@ import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.*;
import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.*;
import org.broadinstitute.sting.utils.help.HelpConstants;
+import org.broadinstitute.sting.utils.help.HelpUtils;
import org.broadinstitute.sting.utils.variant.GATKVCFUtils;
import org.broadinstitute.sting.utils.BaseUtils;
import org.broadinstitute.sting.utils.SampleUtils;
-import org.broadinstitute.sting.utils.classloader.PluginManager;
import org.broadinstitute.variant.vcf.*;
import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
import org.broadinstitute.variant.variantcontext.VariantContext;
@@ -47,7 +47,6 @@ import org.broadinstitute.variant.variantcontext.writer.VariantContextWriter;
import java.util.*;
-
/**
* Annotates variant calls with context information.
*
@@ -165,7 +164,7 @@ public class VariantAnnotator extends RodWalker implements Ann
protected Boolean USE_ALL_ANNOTATIONS = false;
/**
- * Note that the --list argument requires a fully resolved and correct command-line to work.
+ * Note that the --list argument requires a fully resolved and correct command-line to work. As a simpler alternative, you can use ListAnnotations (see Help Utilities).
*/
@Argument(fullName="list", shortName="ls", doc="List the available annotations and exit", required=false)
protected Boolean LIST = false;
@@ -177,7 +176,7 @@ public class VariantAnnotator extends RodWalker implements Ann
protected Boolean ALWAYS_APPEND_DBSNP_ID = false;
public boolean alwaysAppendDbsnpId() { return ALWAYS_APPEND_DBSNP_ID; }
- @Argument(fullName="MendelViolationGenotypeQualityThreshold",shortName="mvq",required=false,doc="The genotype quality treshold in order to annotate mendelian violation ratio")
+ @Argument(fullName="MendelViolationGenotypeQualityThreshold",shortName="mvq",required=false,doc="The genotype quality threshold in order to annotate mendelian violation ratio")
public double minGenotypeQualityP = 0.0;
@Argument(fullName="requireStrictAlleleMatch", shortName="strict", doc="If provided only comp tracks that exactly match both reference and alternate alleles will be counted as concordant", required=false)
@@ -185,33 +184,14 @@ public class VariantAnnotator extends RodWalker implements Ann
private VariantAnnotatorEngine engine;
-
- private void listAnnotationsAndExit() {
- System.out.println("\nStandard annotations in the list below are marked with a '*'.");
- List> infoAnnotationClasses = new PluginManager(InfoFieldAnnotation.class).getPlugins();
- System.out.println("\nAvailable annotations for the VCF INFO field:");
- for (int i = 0; i < infoAnnotationClasses.size(); i++)
- System.out.println("\t" + (StandardAnnotation.class.isAssignableFrom(infoAnnotationClasses.get(i)) ? "*" : "") + infoAnnotationClasses.get(i).getSimpleName());
- System.out.println();
- List> genotypeAnnotationClasses = new PluginManager(GenotypeAnnotation.class).getPlugins();
- System.out.println("\nAvailable annotations for the VCF FORMAT field:");
- for (int i = 0; i < genotypeAnnotationClasses.size(); i++)
- System.out.println("\t" + (StandardAnnotation.class.isAssignableFrom(genotypeAnnotationClasses.get(i)) ? "*" : "") + genotypeAnnotationClasses.get(i).getSimpleName());
- System.out.println();
- System.out.println("\nAvailable classes/groups of annotations:");
- for ( Class c : new PluginManager(AnnotationType.class).getInterfaces() )
- System.out.println("\t" + c.getSimpleName());
- System.out.println();
- System.exit(0);
- }
-
/**
* Prepare the output file and the list of available features.
*/
public void initialize() {
if ( LIST )
- listAnnotationsAndExit();
+ HelpUtils.listAnnotations();
+ System.exit(0);
// get the list of all sample names from the variant VCF input rod, if applicable
List rodName = Arrays.asList(variantCollection.variants.getName());
diff --git a/public/java/src/org/broadinstitute/sting/tools/CatVariants.java b/public/java/src/org/broadinstitute/sting/tools/CatVariants.java
index e1dd2c255..ad77b2548 100644
--- a/public/java/src/org/broadinstitute/sting/tools/CatVariants.java
+++ b/public/java/src/org/broadinstitute/sting/tools/CatVariants.java
@@ -35,7 +35,6 @@ import org.broadinstitute.sting.commandline.Argument;
import org.broadinstitute.sting.commandline.Input;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.commandline.CommandLineProgram;
-import org.broadinstitute.sting.gatk.CommandLineGATK;
import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
import org.broadinstitute.sting.utils.help.HelpConstants;
import org.broadinstitute.variant.bcf2.BCF2Codec;
@@ -54,7 +53,7 @@ import java.util.*;
/**
*
- * Concatenates VCF files of non-overlapped genome intervals, all with the same set of samples.
+ * Concatenates VCF files of non-overlapped genome intervals, all with the same set of samples
*
*
* The main purpose of this tool is to speed up the gather function when using scatter-gather parallelization.
@@ -80,10 +79,14 @@ import java.util.*;
* A combined VCF. The output file should be 'name.vcf' or 'name.VCF'.
* <\p>
*
+ *
Important note
+ *
This is a command-line utility that bypasses the GATK engine. As a result, the command-line you must use to
+ * invoke it is a little different from other GATK tools (see example below), and it does not accept any of the
+ * classic "CommandLineGATK" arguments.
*
- *
Examples
+ *
Example
*
- * java -cp dist/GenomeAnalysisTK.jar org.broadinstitute.sting.tools.CatVariants \
+ * java -cp GenomeAnalysisTK.jar org.broadinstitute.sting.tools.CatVariants \
* -R ref.fasta \
* -V input1.vcf \
* -V input2.vcf \
@@ -95,7 +98,7 @@ import java.util.*;
* @since Jan 2012
*/
-@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP, extraDocs = {CommandLineGATK.class} )
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP )
public class CatVariants extends CommandLineProgram {
// setup the logging system, used by some codecs
private static org.apache.log4j.Logger logger = org.apache.log4j.Logger.getRootLogger();
@@ -124,7 +127,7 @@ public class CatVariants extends CommandLineProgram {
* print usage information
*/
private static void printUsage() {
- System.err.println("Usage: java -cp dist/GenomeAnalysisTK.jar org.broadinstitute.sting.tools.AppendVariants [sorted (optional)]");
+ System.err.println("Usage: java -cp dist/GenomeAnalysisTK.jar org.broadinstitute.sting.tools.CatVariants [sorted (optional)]");
System.err.println(" The input files can be of type: VCF (ends in .vcf or .VCF)");
System.err.println(" BCF2 (ends in .bcf or .BCF)");
System.err.println(" Output file must be vcf or bcf file (.vcf or .bcf)");
diff --git a/public/java/src/org/broadinstitute/sting/tools/ListAnnotations.java b/public/java/src/org/broadinstitute/sting/tools/ListAnnotations.java
new file mode 100644
index 000000000..fabcf828a
--- /dev/null
+++ b/public/java/src/org/broadinstitute/sting/tools/ListAnnotations.java
@@ -0,0 +1,85 @@
+/*
+* Copyright (c) 2012 The Broad Institute
+*
+* Permission is hereby granted, free of charge, to any person
+* obtaining a copy of this software and associated documentation
+* files (the "Software"), to deal in the Software without
+* restriction, including without limitation the rights to use,
+* copy, modify, merge, publish, distribute, sublicense, and/or sell
+* copies of the Software, and to permit persons to whom the
+* Software is furnished to do so, subject to the following
+* conditions:
+*
+* The above copyright notice and this permission notice shall be
+* included in all copies or substantial portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+*/
+
+package org.broadinstitute.sting.tools;
+
+import org.broadinstitute.sting.commandline.CommandLineProgram;
+import org.broadinstitute.sting.utils.exceptions.UserException;
+import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
+import org.broadinstitute.sting.utils.help.HelpConstants;
+import org.broadinstitute.sting.utils.help.HelpUtils;
+
+/**
+ * Utility program to print a list of available annotations
+ *
+ *
This is a very simple utility tool that retrieves available annotations for use with tools such as
+ * UnifiedGenotyper, HaplotypeCaller and VariantAnnotator.
+ *
+ *
Important note
+ *
This is a command-line utility that bypasses the GATK engine. As a result, the command-line you must use to
+ * invoke it is a little different from other GATK tools (see usage below), and it does not accept any of the
+ * classic "CommandLineGATK" arguments.
+ *
+ * @author vdauwera
+ * @since 3/14/13
+ */
+@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_HELPUTILS )
+public class ListAnnotations extends CommandLineProgram {
+
+ /*
+ * Print usage information
+ *
+ * TODO: would be more convenient if we could just call the program by name instead of the full classpath
+ */
+ private static void printUsage() {
+ System.err.println("Usage: java -cp dist/GenomeAnalysisTK.jar org.broadinstitute.sting.tools.ListAnnotations");
+ System.err.println(" Prints a list of available annotations and exits.");
+ }
+
+ // TODO: override CommandLineProgram bit that offers version, logging etc arguments. We don't need that stuff here and it makes the doc confusing.
+
+ @Override
+ protected int execute() throws Exception {
+
+ HelpUtils.listAnnotations();
+ return 0;
+ }
+
+ public static void main(String[] args){
+ try {
+ ListAnnotations instance = new ListAnnotations();
+ start(instance, args);
+ System.exit(CommandLineProgram.result);
+ } catch ( UserException e ) {
+ printUsage();
+ exitSystemWithUserError(e);
+ } catch ( Exception e ) {
+ exitSystemWithError(e);
+ }
+ }
+}
diff --git a/public/java/src/org/broadinstitute/sting/utils/help/HelpConstants.java b/public/java/src/org/broadinstitute/sting/utils/help/HelpConstants.java
index f99ff7538..2ed35d848 100644
--- a/public/java/src/org/broadinstitute/sting/utils/help/HelpConstants.java
+++ b/public/java/src/org/broadinstitute/sting/utils/help/HelpConstants.java
@@ -56,6 +56,7 @@ public class HelpConstants {
public final static String DOCS_CAT_VARDISC = "Variant Discovery Tools";
public final static String DOCS_CAT_VARMANIP = "Variant Evaluation and Manipulation Tools";
public final static String DOCS_CAT_TEST = "Testing Tools";
+ public final static String DOCS_CAT_HELPUTILS = "Help Utilities";
public static String forumPost(String post) {
return GATK_FORUM_URL + post;
diff --git a/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java b/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java
index 81606d2f3..9a23fd022 100644
--- a/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/help/HelpUtils.java
@@ -28,9 +28,15 @@ package org.broadinstitute.sting.utils.help;
import com.sun.javadoc.FieldDoc;
import com.sun.javadoc.PackageDoc;
import com.sun.javadoc.ProgramElementDoc;
+import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.AnnotationType;
+import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.GenotypeAnnotation;
+import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.InfoFieldAnnotation;
+import org.broadinstitute.sting.gatk.walkers.annotator.interfaces.StandardAnnotation;
import org.broadinstitute.sting.utils.classloader.JVMUtils;
+import org.broadinstitute.sting.utils.classloader.PluginManager;
import java.lang.reflect.Field;
+import java.util.List;
public class HelpUtils {
@@ -70,4 +76,27 @@ public class HelpUtils {
String.format("%s", doc.name());
}
+ /**
+ * Simple method to print a list of available annotations.
+ */
+ public static void listAnnotations() {
+ System.out.println("\nThis is a list of available Variant Annotations for use with tools such as UnifiedGenotyper, HaplotypeCaller and VariantAnnotator. Please see the Technical Documentation for more details about these annotations:");
+ System.out.println("http://www.broadinstitute.org/gatk/gatkdocs/");
+ System.out.println("\nStandard annotations in the list below are marked with a '*'.");
+ List> infoAnnotationClasses = new PluginManager(InfoFieldAnnotation.class).getPlugins();
+ System.out.println("\nAvailable annotations for the VCF INFO field:");
+ for (int i = 0; i < infoAnnotationClasses.size(); i++)
+ System.out.println("\t" + (StandardAnnotation.class.isAssignableFrom(infoAnnotationClasses.get(i)) ? "*" : "") + infoAnnotationClasses.get(i).getSimpleName());
+ System.out.println();
+ List> genotypeAnnotationClasses = new PluginManager(GenotypeAnnotation.class).getPlugins();
+ System.out.println("\nAvailable annotations for the VCF FORMAT field:");
+ for (int i = 0; i < genotypeAnnotationClasses.size(); i++)
+ System.out.println("\t" + (StandardAnnotation.class.isAssignableFrom(genotypeAnnotationClasses.get(i)) ? "*" : "") + genotypeAnnotationClasses.get(i).getSimpleName());
+ System.out.println();
+ System.out.println("\nAvailable classes/groups of annotations:");
+ for ( Class c : new PluginManager(AnnotationType.class).getInterfaces() )
+ System.out.println("\t" + c.getSimpleName());
+ System.out.println();
+ }
+
}
\ No newline at end of file
From d70bf647379cc8f1eb9a2018d539ec88a86f4702 Mon Sep 17 00:00:00 2001
From: Geraldine Van der Auwera
Date: Fri, 15 Mar 2013 16:41:14 -0400
Subject: [PATCH 049/211] Created new DeprecatedToolChecks class
--Based on existing code in GenomeAnalysisEngine --Hashmaps hold
mapping of deprecated tool name to version number and recommended replacement
(if any) --Using FastUtils for maps; specifically Object2ObjectMap but
there could be a better type for Strings... --Added user exception for
deprecated annotations --Added deprecation check to
AnnotationInterfaceManager.validateAnnotations --Run when annotations
are initialized --Made annotation sets instead of lists
---
.../sting/gatk/GenomeAnalysisEngine.java | 41 ++------
.../walkers/annotator/VariantAnnotator.java | 6 +-
.../annotator/VariantAnnotatorEngine.java | 2 +-
.../AnnotationInterfaceManager.java | 12 ++-
.../sting/utils/DeprecatedToolChecks.java | 95 +++++++++++++++++++
.../sting/utils/exceptions/UserException.java | 8 +-
6 files changed, 121 insertions(+), 43 deletions(-)
create mode 100644 public/java/src/org/broadinstitute/sting/utils/DeprecatedToolChecks.java
diff --git a/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java b/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java
index e45a750ba..2d8b9cd9a 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/GenomeAnalysisEngine.java
@@ -67,6 +67,9 @@ import java.io.File;
import java.util.*;
import java.util.concurrent.TimeUnit;
+import static org.broadinstitute.sting.utils.DeprecatedToolChecks.getWalkerDeprecationInfo;
+import static org.broadinstitute.sting.utils.DeprecatedToolChecks.isDeprecatedWalker;
+
/**
* A GenomeAnalysisEngine that runs a specified walker.
*/
@@ -288,40 +291,6 @@ public class GenomeAnalysisEngine {
//return result;
}
- // TODO -- Let's move this to a utility class in unstable - but which one?
- // **************************************************************************************
- // * Handle Deprecated Walkers *
- // **************************************************************************************
-
- // Mapping from walker name to major version number where the walker first disappeared
- private static Map deprecatedGATKWalkers = new HashMap();
- static {
- deprecatedGATKWalkers.put("CountCovariates", "2.0");
- deprecatedGATKWalkers.put("TableRecalibration", "2.0");
- deprecatedGATKWalkers.put("AlignmentWalker", "2.2");
- deprecatedGATKWalkers.put("CountBestAlignments", "2.2");
- }
-
- /**
- * Utility method to check whether a given walker has been deprecated in a previous GATK release
- *
- * @param walkerName the walker class name (not the full package) to check
- */
- public static boolean isDeprecatedWalker(final String walkerName) {
- return deprecatedGATKWalkers.containsKey(walkerName);
- }
-
- /**
- * Utility method to check whether a given walker has been deprecated in a previous GATK release
- *
- * @param walkerName the walker class name (not the full package) to check
- */
- public static String getDeprecatedMajorVersionNumber(final String walkerName) {
- return deprecatedGATKWalkers.get(walkerName);
- }
-
- // **************************************************************************************
-
/**
* Retrieves an instance of the walker based on the walker name.
*
@@ -333,7 +302,7 @@ public class GenomeAnalysisEngine {
return walkerManager.createByName(walkerName);
} catch ( UserException e ) {
if ( isDeprecatedWalker(walkerName) ) {
- e = new UserException.DeprecatedWalker(walkerName, getDeprecatedMajorVersionNumber(walkerName));
+ e = new UserException.DeprecatedWalker(walkerName, getWalkerDeprecationInfo(walkerName));
}
throw e;
}
@@ -565,6 +534,8 @@ public class GenomeAnalysisEngine {
if ( intervals != null && intervals.isEmpty() ) {
logger.warn("The given combination of -L and -XL options results in an empty set. No intervals to process.");
}
+
+ // TODO: add a check for ActiveRegion walkers to prevent users from passing an entire contig/chromosome
}
/**
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
index 301baaba3..f2bd6c14c 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotator.java
@@ -44,6 +44,7 @@ import org.broadinstitute.variant.vcf.*;
import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
import org.broadinstitute.variant.variantcontext.VariantContext;
import org.broadinstitute.variant.variantcontext.writer.VariantContextWriter;
+import it.unimi.dsi.fastutil.objects.ObjectOpenHashSet;
import java.util.*;
@@ -155,7 +156,7 @@ public class VariantAnnotator extends RodWalker implements Ann
* If multiple records in the rod overlap the given position, one is chosen arbitrarily.
*/
@Argument(fullName="expression", shortName="E", doc="One or more specific expressions to apply to variant calls; see documentation for more details", required=false)
- protected List expressionsToUse = new ArrayList();
+ protected Set expressionsToUse = new ObjectOpenHashSet();
/**
* Note that the -XL argument can be used along with this one to exclude annotations.
@@ -189,9 +190,10 @@ public class VariantAnnotator extends RodWalker implements Ann
*/
public void initialize() {
- if ( LIST )
+ if ( LIST ) {
HelpUtils.listAnnotations();
System.exit(0);
+ }
// get the list of all sample names from the variant VCF input rod, if applicable
List rodName = Arrays.asList(variantCollection.variants.getName());
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotatorEngine.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotatorEngine.java
index c5703afc8..695868bb1 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotatorEngine.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/VariantAnnotatorEngine.java
@@ -104,7 +104,7 @@ public class VariantAnnotatorEngine {
}
// select specific expressions to use
- public void initializeExpressions(List expressionsToUse) {
+ public void initializeExpressions(Set expressionsToUse) {
// set up the expressions
for ( String expression : expressionsToUse )
requestedExpressions.add(new VAExpression(expression, walker.getResourceRodBindings()));
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/interfaces/AnnotationInterfaceManager.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/interfaces/AnnotationInterfaceManager.java
index 221887158..59b4b1b3b 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/interfaces/AnnotationInterfaceManager.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/annotator/interfaces/AnnotationInterfaceManager.java
@@ -25,6 +25,7 @@
package org.broadinstitute.sting.gatk.walkers.annotator.interfaces;
+import org.broadinstitute.sting.utils.DeprecatedToolChecks;
import org.broadinstitute.sting.utils.classloader.PluginManager;
import org.broadinstitute.sting.utils.exceptions.UserException;
@@ -58,7 +59,7 @@ public class AnnotationInterfaceManager {
if ( interfaceClass == null )
interfaceClass = classMap.get(group + "Annotation");
if ( interfaceClass == null )
- throw new UserException.BadArgumentValue("group", "Class " + group + " is not found; please check that you have specified the class name correctly");
+ throw new UserException.BadArgumentValue("group", "Annotation group " + group + " was not found; please check that you have specified the group name correctly");
}
}
@@ -67,8 +68,13 @@ public class AnnotationInterfaceManager {
Class annotationClass = classMap.get(annotation);
if ( annotationClass == null )
annotationClass = classMap.get(annotation + "Annotation");
- if ( annotationClass == null )
- throw new UserException.BadArgumentValue("annotation", "Class " + annotation + " is not found; please check that you have specified the class name correctly");
+ if ( annotationClass == null ) {
+ if (DeprecatedToolChecks.isDeprecatedAnnotation(annotation) ) {
+ throw new UserException.DeprecatedAnnotation(annotation, DeprecatedToolChecks.getAnnotationDeprecationInfo(annotation));
+ } else {
+ throw new UserException.BadArgumentValue("annotation", "Annotation " + annotation + " was not found; please check that you have specified the annotation name correctly");
+ }
+ }
}
}
diff --git a/public/java/src/org/broadinstitute/sting/utils/DeprecatedToolChecks.java b/public/java/src/org/broadinstitute/sting/utils/DeprecatedToolChecks.java
new file mode 100644
index 000000000..e20872c5b
--- /dev/null
+++ b/public/java/src/org/broadinstitute/sting/utils/DeprecatedToolChecks.java
@@ -0,0 +1,95 @@
+/*
+* Copyright (c) 2012 The Broad Institute
+*
+* Permission is hereby granted, free of charge, to any person
+* obtaining a copy of this software and associated documentation
+* files (the "Software"), to deal in the Software without
+* restriction, including without limitation the rights to use,
+* copy, modify, merge, publish, distribute, sublicense, and/or sell
+* copies of the Software, and to permit persons to whom the
+* Software is furnished to do so, subject to the following
+* conditions:
+*
+* The above copyright notice and this permission notice shall be
+* included in all copies or substantial portions of the Software.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+*/
+
+package org.broadinstitute.sting.utils;
+
+import it.unimi.dsi.fastutil.objects.Object2ObjectMap;
+import it.unimi.dsi.fastutil.objects.Object2ObjectOpenHashMap;
+
+import java.util.*;
+
+/**
+ * Utility class for handling deprecated tools gracefully
+ *
+ * @author vdauwera
+ * @since 3/11/13
+ */
+public class DeprecatedToolChecks {
+
+ // Mapping from walker name to major version number where the walker first disappeared and optional replacement options
+ private static Object2ObjectMap deprecatedGATKWalkers = new Object2ObjectOpenHashMap();
+ static {
+ // Indicate recommended replacement in parentheses if applicable
+ deprecatedGATKWalkers.put("CountCovariates", "2.0 (use BaseRecalibrator instead; see documentation for usage)");
+ deprecatedGATKWalkers.put("AnalyzeCovariates", "2.0 (use BaseRecalibrator instead; see documentation for usage)");
+ deprecatedGATKWalkers.put("TableRecalibration", "2.0 (use PrintReads with -BQSR instead; see documentation for usage)");
+ deprecatedGATKWalkers.put("AlignmentWalker", "2.2 (no replacement)");
+ deprecatedGATKWalkers.put("CountBestAlignments", "2.2 (no replacement)");
+ }
+
+ // Mapping from walker name to major version number where the walker first disappeared and optional replacement options
+ private static Object2ObjectMap deprecatedGATKAnnotations = new Object2ObjectOpenHashMap();
+ static {
+ // Same comments as for walkers
+ deprecatedGATKAnnotations.put("DepthOfCoverage", "2.4 (renamed to Coverage)");
+ }
+
+ /**
+ * Utility method to check whether a given walker has been deprecated in a previous GATK release
+ *
+ * @param walkerName the walker class name (not the full package) to check
+ */
+ public static boolean isDeprecatedWalker(final String walkerName) {
+ return deprecatedGATKWalkers.containsKey(walkerName);
+ }
+
+ /**
+ * Utility method to check whether a given annotation has been deprecated in a previous GATK release
+ *
+ * @param annotationName the annotation class name (not the full package) to check
+ */
+ public static boolean isDeprecatedAnnotation(final String annotationName) {
+ return deprecatedGATKAnnotations.containsKey(annotationName);
+ }
+
+ /**
+ * Utility method to pull up the version number at which a walker was deprecated and the suggested replacement, if any
+ *
+ * @param walkerName the walker class name (not the full package) to check
+ */
+ public static String getWalkerDeprecationInfo(final String walkerName) {
+ return deprecatedGATKWalkers.get(walkerName).toString();
+ }
+
+ /**
+ * Utility method to pull up the version number at which an annotation was deprecated and the suggested replacement, if any
+ *
+ * @param annotationName the annotation class name (not the full package) to check
+ */
+ public static String getAnnotationDeprecationInfo(final String annotationName) {
+ return deprecatedGATKAnnotations.get(annotationName).toString();
+ }
+
+}
diff --git a/public/java/src/org/broadinstitute/sting/utils/exceptions/UserException.java b/public/java/src/org/broadinstitute/sting/utils/exceptions/UserException.java
index b3c5bd2c7..fcc132ffe 100644
--- a/public/java/src/org/broadinstitute/sting/utils/exceptions/UserException.java
+++ b/public/java/src/org/broadinstitute/sting/utils/exceptions/UserException.java
@@ -371,14 +371,18 @@ public class UserException extends ReviewedStingException {
}
}
-
-
public static class DeprecatedWalker extends UserException {
public DeprecatedWalker(String walkerName, String version) {
super(String.format("Walker %s is no longer available in the GATK; it has been deprecated since version %s", walkerName, version));
}
}
+ public static class DeprecatedAnnotation extends UserException {
+ public DeprecatedAnnotation(String annotationName, String version) {
+ super(String.format("Annotation %s is no longer available in the GATK; it has been deprecated since version %s", annotationName, version));
+ }
+ }
+
public static class CannotExecuteQScript extends UserException {
public CannotExecuteQScript(String message) {
super(String.format("Unable to execute QScript: " + message));
From ea01dbf1309b56657477c6f0886b577fe0844be3 Mon Sep 17 00:00:00 2001
From: Guillermo del Angel
Date: Tue, 19 Mar 2013 15:26:50 -0400
Subject: [PATCH 050/211] Fix to issue encountered when running HaplotypeCaller
in GGA mode with data from other 1000G callers. In particular, someone
produced a tandem repeat site with 57 alt alleles (sic) which made the caller
blow up. Inelegant fix is to detect if # of alleles is > our max cached
capacity, and if so, emit an informative warning and skip site. -- Added unit
test to UG engine to cover this case. -- Commit to posterity private scala
script currently used for 1000G indel consensus (still very much subject to
changes). GSA-878 #resolve
---
.../genotyper/UnifiedGenotyperEngine.java | 29 +++++++++++++++++++
.../UnifiedGenotyperEngineUnitTest.java | 25 ++++++++++++++++
2 files changed, 54 insertions(+)
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java
index 1d0c10795..4259dbdb6 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngine.java
@@ -385,11 +385,23 @@ public class UnifiedGenotyperEngine {
boolean limitedContext = tracker == null || refContext == null || rawContext == null || stratifiedContexts == null;
+ // TODO TODO TODO TODO
+ // REFACTOR THIS FUNCTION, TOO UNWIELDY!!
+
// initialize the data for this thread if that hasn't been done yet
if ( afcm.get() == null ) {
afcm.set(AFCalcFactory.createAFCalc(UAC, N, logger));
}
+ // if input VC can't be genotyped, exit with either null VCC or, in case where we need to emit all sites, an empty call
+ if (!canVCbeGenotyped(vc)) {
+ if (UAC.OutputMode == OUTPUT_MODE.EMIT_ALL_SITES && !limitedContext)
+ return generateEmptyContext(tracker, refContext, stratifiedContexts, rawContext);
+ else
+ return null;
+
+ }
+
// estimate our confidence in a reference call and return
if ( vc.getNSamples() == 0 ) {
if ( limitedContext )
@@ -544,6 +556,23 @@ public class UnifiedGenotyperEngine {
return new VariantCallContext(vcCall, confidentlyCalled(phredScaledConfidence, PoFGT0));
}
+ /**
+ * Determine whether input VC to calculateGenotypes() can be genotyped and AF can be computed.
+ * @param vc Input VC
+ * @return Status check
+ */
+ @Requires("vc != null")
+ protected boolean canVCbeGenotyped(final VariantContext vc) {
+ // protect against too many alternate alleles that we can't even run AF on:
+ if (vc.getNAlleles()> GenotypeLikelihoods.MAX_ALT_ALLELES_THAT_CAN_BE_GENOTYPED) {
+ logger.warn("Attempting to genotype more than "+GenotypeLikelihoods.MAX_ALT_ALLELES_THAT_CAN_BE_GENOTYPED +
+ " alleles. Site will be skipped at location "+vc.getChr()+":"+vc.getStart());
+ return false;
+ }
+ else return true;
+
+ }
+
private Map getFilteredAndStratifiedContexts(UnifiedArgumentCollection UAC, ReferenceContext refContext, AlignmentContext rawContext, final GenotypeLikelihoodsCalculationModel.Model model) {
if ( !BaseUtils.isRegularBase(refContext.getBase()) )
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngineUnitTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngineUnitTest.java
index 23596db83..657cd9c0c 100644
--- a/protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngineUnitTest.java
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/genotyper/UnifiedGenotyperEngineUnitTest.java
@@ -50,10 +50,16 @@ package org.broadinstitute.sting.gatk.walkers.genotyper;
// the imports for unit testing.
+import org.apache.commons.lang.ArrayUtils;
import org.broadinstitute.sting.BaseTest;
import org.broadinstitute.sting.gatk.GenomeAnalysisEngine;
import org.broadinstitute.sting.gatk.arguments.GATKArgumentCollection;
import org.broadinstitute.sting.utils.MathUtils;
+import org.broadinstitute.sting.utils.Utils;
+import org.broadinstitute.variant.variantcontext.Allele;
+import org.broadinstitute.variant.variantcontext.GenotypeLikelihoods;
+import org.broadinstitute.variant.variantcontext.VariantContext;
+import org.broadinstitute.variant.variantcontext.VariantContextBuilder;
import org.testng.Assert;
import org.testng.annotations.BeforeClass;
import org.testng.annotations.BeforeMethod;
@@ -102,4 +108,23 @@ public class UnifiedGenotyperEngineUnitTest extends BaseTest {
Assert.assertTrue(MathUtils.goodLog10Probability(ref), "Reference calculation wasn't a well formed log10 prob " + ref);
Assert.assertEquals(ref, expected, TOLERANCE, "Failed reference confidence for single sample");
}
+
+ @Test(enabled=true)
+ public void testTooManyAlleles() {
+
+ for ( Integer numAltAlleles = 0; numAltAlleles < 100; numAltAlleles++ ) {
+
+ Set alleles = new HashSet();
+ alleles.add(Allele.create("A", true)); // ref allele
+
+ for (int len = 1; len <=numAltAlleles; len++) {
+ // add alt allele of length len+1
+ alleles.add(Allele.create(Utils.dupString('A', len + 1), false));
+ }
+ final VariantContext vc = new VariantContextBuilder("test", "chr1", 1000, 1000, alleles).make();
+ final boolean result = ugEngine.canVCbeGenotyped(vc);
+ Assert.assertTrue(result == (vc.getNAlleles()<= GenotypeLikelihoods.MAX_ALT_ALLELES_THAT_CAN_BE_GENOTYPED));
+ }
+ }
+
}
\ No newline at end of file
From a783f19ab12060084c9811902365d7629b1631ca Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Wed, 6 Mar 2013 13:45:53 -0500
Subject: [PATCH 052/211] Fix for potential HaplotypeCaller bug in annotation
ordering
-- Annotations were being called on VariantContext that might needed to be trimmed. Simply inverted the order of operations so trimming occurs before the annotations are added.
-- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine
---
.../walkers/haplotypecaller/GenotypingEngine.java | 13 ++++++++-----
.../LikelihoodCalculationEngine.java | 9 ++++++---
2 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
index 1cfc65581..400de6485 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/GenotypingEngine.java
@@ -273,16 +273,19 @@ public class GenotypingEngine {
final Map alleleReadMap_annotations = ( USE_FILTERED_READ_MAP_FOR_ANNOTATIONS ? alleleReadMap :
convertHaplotypeReadMapToAlleleReadMap( haplotypeReadMap, alleleMapper, 0.0, UG_engine.getUAC().contaminationLog ) );
final Map stratifiedReadMap = filterToOnlyOverlappingReads( genomeLocParser, alleleReadMap_annotations, perSampleFilteredReadList, call );
- VariantContext annotatedCall = annotationEngine.annotateContext(stratifiedReadMap, call);
+
+ VariantContext annotatedCall = call;
+ // TODO -- should be before annotated call, so that QDL works correctly
+ if( annotatedCall.getAlleles().size() != mergedVC.getAlleles().size() ) { // some alleles were removed so reverseTrimming might be necessary!
+ annotatedCall = GATKVariantContextUtils.reverseTrimAlleles(annotatedCall);
+ }
+
+ annotatedCall = annotationEngine.annotateContext(stratifiedReadMap, annotatedCall);
// maintain the set of all called haplotypes
for ( final Allele calledAllele : call.getAlleles() )
calledHaplotypes.addAll(alleleMapper.get(calledAllele));
- if( annotatedCall.getAlleles().size() != mergedVC.getAlleles().size() ) { // some alleles were removed so reverseTrimming might be necessary!
- annotatedCall = GATKVariantContextUtils.reverseTrimAlleles(annotatedCall);
- }
-
returnCalls.add( annotatedCall );
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
index a7d85b969..87b488b3e 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/LikelihoodCalculationEngine.java
@@ -151,9 +151,12 @@ public class LikelihoodCalculationEngine {
final int haplotypeStart = ( previousHaplotypeSeen == null ? 0 : PairHMM.findFirstPositionWhereHaplotypesDiffer(haplotype.getBases(), previousHaplotypeSeen.getBases()) );
previousHaplotypeSeen = haplotype;
- perReadAlleleLikelihoodMap.add(read, alleleVersions.get(haplotype),
- pairHMM.computeReadLikelihoodGivenHaplotypeLog10(haplotype.getBases(), read.getReadBases(),
- readQuals, readInsQuals, readDelQuals, overallGCP, haplotypeStart, jjj == 0));
+ final boolean isFirstHaplotype = jjj == 0;
+ final double log10l = pairHMM.computeReadLikelihoodGivenHaplotypeLog10(haplotype.getBases(),
+ read.getReadBases(), readQuals, readInsQuals, readDelQuals,
+ overallGCP, haplotypeStart, isFirstHaplotype);
+
+ perReadAlleleLikelihoodMap.add(read, alleleVersions.get(haplotype), log10l);
}
}
return perReadAlleleLikelihoodMap;
From 752440707d6005104410ff67f79fe410723df964 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Wed, 6 Mar 2013 13:52:53 -0500
Subject: [PATCH 053/211] AlignmentUtils.calcNumDifferentBases computes the
number of bases that differ between a reference and read sequence given a
cigar between the two.
---
.../sting/utils/sam/AlignmentUtils.java | 39 +++++++++++++++++++
.../utils/sam/AlignmentUtilsUnitTest.java | 30 +++++++++++++-
2 files changed, 68 insertions(+), 1 deletion(-)
diff --git a/public/java/src/org/broadinstitute/sting/utils/sam/AlignmentUtils.java b/public/java/src/org/broadinstitute/sting/utils/sam/AlignmentUtils.java
index d59d0ef63..58f70d4b6 100644
--- a/public/java/src/org/broadinstitute/sting/utils/sam/AlignmentUtils.java
+++ b/public/java/src/org/broadinstitute/sting/utils/sam/AlignmentUtils.java
@@ -48,6 +48,45 @@ public final class AlignmentUtils {
// cannot be instantiated
private AlignmentUtils() { }
+ /**
+ * Get the number of bases at which refSeq and readSeq differ, given their alignment
+ *
+ * @param cigar the alignment of readSeq to refSeq
+ * @param refSeq the bases of the reference sequence
+ * @param readSeq the bases of the read sequence
+ * @return the number of bases that differ between refSeq and readSeq
+ */
+ public static int calcNumDifferentBases(final Cigar cigar, final byte[] refSeq, final byte[] readSeq) {
+ int refIndex = 0, readIdx = 0, delta = 0;
+
+ for (final CigarElement ce : cigar.getCigarElements()) {
+ final int elementLength = ce.getLength();
+ switch (ce.getOperator()) {
+ case X:case EQ:case M:
+ for (int j = 0; j < elementLength; j++, refIndex++, readIdx++)
+ delta += refSeq[refIndex] != readSeq[readIdx] ? 1 : 0;
+ break;
+ case I:
+ delta += elementLength;
+ case S:
+ readIdx += elementLength;
+ break;
+ case D:
+ delta += elementLength;
+ case N:
+ refIndex += elementLength;
+ break;
+ case H:
+ case P:
+ break;
+ default:
+ throw new ReviewedStingException("The " + ce.getOperator() + " cigar element is not currently supported");
+ }
+ }
+
+ return delta;
+ }
+
public static class MismatchCount {
public int numMismatches = 0;
public long mismatchQualities = 0;
diff --git a/public/java/test/org/broadinstitute/sting/utils/sam/AlignmentUtilsUnitTest.java b/public/java/test/org/broadinstitute/sting/utils/sam/AlignmentUtilsUnitTest.java
index ae01c6c63..660dadc00 100644
--- a/public/java/test/org/broadinstitute/sting/utils/sam/AlignmentUtilsUnitTest.java
+++ b/public/java/test/org/broadinstitute/sting/utils/sam/AlignmentUtilsUnitTest.java
@@ -37,7 +37,7 @@ import org.testng.annotations.Test;
import java.util.*;
public class AlignmentUtilsUnitTest {
- private final static boolean DEBUG = false;
+ private final static boolean DEBUG = true;
private SAMFileHeader header;
/** Basic aligned and mapped read. */
@@ -145,6 +145,34 @@ public class AlignmentUtilsUnitTest {
}
+ @DataProvider(name = "CalcNumDifferentBasesData")
+ public Object[][] makeCalcNumDifferentBasesData() {
+ List tests = new ArrayList();
+
+ tests.add(new Object[]{"5M", "ACGTA", "ACGTA", 0});
+ tests.add(new Object[]{"5M", "ACGTA", "ACGTT", 1});
+ tests.add(new Object[]{"5M", "ACGTA", "TCGTT", 2});
+ tests.add(new Object[]{"5M", "ACGTA", "TTGTT", 3});
+ tests.add(new Object[]{"5M", "ACGTA", "TTTTT", 4});
+ tests.add(new Object[]{"5M", "ACGTA", "TTTCT", 5});
+ tests.add(new Object[]{"2M3I3M", "ACGTA", "ACNNNGTA", 3});
+ tests.add(new Object[]{"2M3I3M", "ACGTA", "ACNNNGTT", 4});
+ tests.add(new Object[]{"2M3I3M", "ACGTA", "TCNNNGTT", 5});
+ tests.add(new Object[]{"2M2D1M", "ACGTA", "ACA", 2});
+ tests.add(new Object[]{"2M2D1M", "ACGTA", "ACT", 3});
+ tests.add(new Object[]{"2M2D1M", "ACGTA", "TCT", 4});
+ tests.add(new Object[]{"2M2D1M", "ACGTA", "TGT", 5});
+
+ return tests.toArray(new Object[][]{});
+ }
+
+ @Test(enabled = true, dataProvider = "CalcNumDifferentBasesData")
+ public void testCalcNumDifferentBases(final String cigarString, final String ref, final String read, final int expectedDifferences) {
+ final Cigar cigar = TextCigarCodec.getSingleton().decode(cigarString);
+ Assert.assertEquals(AlignmentUtils.calcNumDifferentBases(cigar, ref.getBytes(), read.getBytes()), expectedDifferences);
+ }
+
+
@DataProvider(name = "NumAlignedBasesCountingSoftClips")
public Object[][] makeNumAlignedBasesCountingSoftClips() {
List tests = new ArrayList();
From a8fb26bf0167147bae2c3896e41be5049dd0bb48 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Wed, 6 Mar 2013 21:39:18 -0500
Subject: [PATCH 054/211] A generic downsampler that reduces coverage for a
bunch of reads
-- Exposed the underlying minElementsPerStack parameter for LevelingDownsampler
---
.../gatk/downsampling/DownsamplingUtils.java | 107 ++++++++++++++++++
.../downsampling/LevelingDownsampler.java | 26 ++++-
.../walkers/readutils/DownsampleReadsQC.java | 105 +++++++++++++++++
3 files changed, 235 insertions(+), 3 deletions(-)
create mode 100644 public/java/src/org/broadinstitute/sting/gatk/downsampling/DownsamplingUtils.java
create mode 100644 public/java/src/org/broadinstitute/sting/gatk/walkers/readutils/DownsampleReadsQC.java
diff --git a/public/java/src/org/broadinstitute/sting/gatk/downsampling/DownsamplingUtils.java b/public/java/src/org/broadinstitute/sting/gatk/downsampling/DownsamplingUtils.java
new file mode 100644
index 000000000..877083829
--- /dev/null
+++ b/public/java/src/org/broadinstitute/sting/gatk/downsampling/DownsamplingUtils.java
@@ -0,0 +1,107 @@
+/*
+ * Copyright (c) 2012 The Broad Institute
+ *
+ * Permission is hereby granted, free of charge, to any person
+ * obtaining a copy of this software and associated documentation
+ * files (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use,
+ * copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following
+ * conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+ * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+ * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+ * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+ * THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+package org.broadinstitute.sting.gatk.downsampling;
+
+import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
+import org.broadinstitute.sting.utils.sam.ReadUtils;
+
+import java.util.*;
+
+/**
+ * Utilities for using the downsamplers for common tasks
+ *
+ * User: depristo
+ * Date: 3/6/13
+ * Time: 4:26 PM
+ */
+public class DownsamplingUtils {
+ private DownsamplingUtils() { }
+
+ /**
+ * Level the coverage of the reads in each sample to no more than downsampleTo reads, no reducing
+ * coverage at any read start to less than minReadsPerAlignmentStart
+ *
+ * This algorithm can be used to handle the situation where you have lots of coverage in some interval, and
+ * want to reduce the coverage of the big peak down without removing the many reads at the edge of this
+ * interval that are in fact good
+ *
+ * This algorithm separately operates on the reads for each sample independently.
+ *
+ * @param reads a sorted list of reads
+ * @param downsampleTo the targeted number of reads we want from reads per sample
+ * @param minReadsPerAlignmentStart don't reduce the number of reads starting at a specific alignment start
+ * to below this. That is, if this value is 2, we'll never reduce the number
+ * of reads starting at a specific start site to less than 2
+ * @return a sorted list of reads
+ */
+ public static List levelCoverageByPosition(final List reads, final int downsampleTo, final int minReadsPerAlignmentStart) {
+ if ( reads == null ) throw new IllegalArgumentException("reads must not be null");
+
+ final List downsampled = new ArrayList(reads.size());
+
+ final Map>> readsBySampleByStart = partitionReadsBySampleAndStart(reads);
+ for ( final Map> readsByPosMap : readsBySampleByStart.values() ) {
+ final LevelingDownsampler, GATKSAMRecord> downsampler = new LevelingDownsampler, GATKSAMRecord>(downsampleTo, minReadsPerAlignmentStart);
+ downsampler.submit(readsByPosMap.values());
+ downsampler.signalEndOfInput();
+ for ( final List downsampledReads : downsampler.consumeFinalizedItems())
+ downsampled.addAll(downsampledReads);
+ }
+
+ return ReadUtils.sortReadsByCoordinate(downsampled);
+ }
+
+ /**
+ * Build the data structure mapping for each sample -> (position -> reads at position)
+ *
+ * Note that the map position -> reads isn't ordered in any meaningful way
+ *
+ * @param reads a list of sorted reads
+ * @return a map containing the list of reads at each start location, for each sample independently
+ */
+ private static Map>> partitionReadsBySampleAndStart(final List reads) {
+ final Map>> readsBySampleByStart = new LinkedHashMap>>();
+
+ for ( final GATKSAMRecord read : reads ) {
+ Map> readsByStart = readsBySampleByStart.get(read.getReadGroup().getSample());
+
+ if ( readsByStart == null ) {
+ readsByStart = new LinkedHashMap>();
+ readsBySampleByStart.put(read.getReadGroup().getSample(), readsByStart);
+ }
+
+ List readsAtStart = readsByStart.get(read.getAlignmentStart());
+ if ( readsAtStart == null ) {
+ readsAtStart = new LinkedList();
+ readsByStart.put(read.getAlignmentStart(), readsAtStart);
+ }
+
+ readsAtStart.add(read);
+ }
+
+ return readsBySampleByStart;
+ }
+}
diff --git a/public/java/src/org/broadinstitute/sting/gatk/downsampling/LevelingDownsampler.java b/public/java/src/org/broadinstitute/sting/gatk/downsampling/LevelingDownsampler.java
index 9b4b2adcb..a8a808333 100644
--- a/public/java/src/org/broadinstitute/sting/gatk/downsampling/LevelingDownsampler.java
+++ b/public/java/src/org/broadinstitute/sting/gatk/downsampling/LevelingDownsampler.java
@@ -47,8 +47,8 @@ import java.util.*;
* @author David Roazen
*/
public class LevelingDownsampler, E> implements Downsampler {
-
- private int targetSize;
+ private final int minElementsPerStack;
+ private final int targetSize;
private List groups;
@@ -59,12 +59,32 @@ public class LevelingDownsampler, E> implements Downsampler
/**
* Construct a LevelingDownsampler
*
+ * Uses the default minElementsPerStack of 1
+ *
* @param targetSize the sum of the sizes of all individual Lists this downsampler is fed may not exceed
* this value -- if it does, items are removed from Lists evenly until the total size
* is <= this value
*/
public LevelingDownsampler( int targetSize ) {
+ this(targetSize, 1);
+ }
+
+ /**
+ * Construct a LevelingDownsampler
+ *
+ * @param targetSize the sum of the sizes of all individual Lists this downsampler is fed may not exceed
+ * this value -- if it does, items are removed from Lists evenly until the total size
+ * is <= this value
+ * @param minElementsPerStack no stack will be reduced below this size during downsampling. That is,
+ * if a stack has only 3 elements and minElementsPerStack is 3, no matter what
+ * we'll not reduce this stack below 3.
+ */
+ public LevelingDownsampler(final int targetSize, final int minElementsPerStack) {
+ if ( targetSize < 0 ) throw new IllegalArgumentException("targetSize must be >= 0 but got " + targetSize);
+ if ( minElementsPerStack < 0 ) throw new IllegalArgumentException("minElementsPerStack must be >= 0 but got " + minElementsPerStack);
+
this.targetSize = targetSize;
+ this.minElementsPerStack = minElementsPerStack;
clear();
reset();
}
@@ -148,7 +168,7 @@ public class LevelingDownsampler, E> implements Downsampler
// remove any more items without violating the constraint that all groups must
// be left with at least one item
while ( numItemsToRemove > 0 && numConsecutiveUmodifiableGroups < groupSizes.length ) {
- if ( groupSizes[currentGroupIndex] > 1 ) {
+ if ( groupSizes[currentGroupIndex] > minElementsPerStack ) {
groupSizes[currentGroupIndex]--;
numItemsToRemove--;
numConsecutiveUmodifiableGroups = 0;
diff --git a/public/java/src/org/broadinstitute/sting/gatk/walkers/readutils/DownsampleReadsQC.java b/public/java/src/org/broadinstitute/sting/gatk/walkers/readutils/DownsampleReadsQC.java
new file mode 100644
index 000000000..1141a9164
--- /dev/null
+++ b/public/java/src/org/broadinstitute/sting/gatk/walkers/readutils/DownsampleReadsQC.java
@@ -0,0 +1,105 @@
+/*
+ * Copyright (c) 2012 The Broad Institute
+ *
+ * Permission is hereby granted, free of charge, to any person
+ * obtaining a copy of this software and associated documentation
+ * files (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use,
+ * copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following
+ * conditions:
+ *
+ * The above copyright notice and this permission notice shall be
+ * included in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
+ * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
+ * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
+ * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
+ * THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+ */
+
+package org.broadinstitute.sting.gatk.walkers.readutils;
+
+import org.broadinstitute.sting.commandline.Argument;
+import org.broadinstitute.sting.commandline.Output;
+import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
+import org.broadinstitute.sting.gatk.downsampling.DownsamplingUtils;
+import org.broadinstitute.sting.gatk.io.StingSAMFileWriter;
+import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
+import org.broadinstitute.sting.gatk.walkers.DataSource;
+import org.broadinstitute.sting.gatk.walkers.NanoSchedulable;
+import org.broadinstitute.sting.gatk.walkers.ReadWalker;
+import org.broadinstitute.sting.gatk.walkers.Requires;
+import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.LinkedList;
+
+/**
+ */
+@Requires({DataSource.READS, DataSource.REFERENCE})
+public class DownsampleReadsQC extends ReadWalker> implements NanoSchedulable {
+ @Output(doc="Write output to this BAM filename instead of STDOUT", required = true)
+ StingSAMFileWriter out;
+
+ @Argument(fullName = "minReadsPerAlignmentStart", shortName = "minReadsPerAlignmentStart", doc ="", required = false)
+ private int minReadsPerAlignmentStart = 5;
+
+ @Argument(fullName = "downsampleTo", shortName = "downsampleTo", doc ="", required = false)
+ private int downsampleTo = 1000;
+
+ /**
+ * The initialize function.
+ */
+ public void initialize() {
+// final boolean preSorted = true;
+// if (getToolkit() != null && getToolkit().getArguments().BQSR_RECAL_FILE != null && !NO_PG_TAG ) {
+// Utils.setupWriter(out, getToolkit(), getToolkit().getSAMFileHeader(), !preSorted, keep_records, this, PROGRAM_RECORD_NAME);
+// }
+ }
+
+ /**
+ * The reads map function.
+ *
+ * @param ref the reference bases that correspond to our read, if a reference was provided
+ * @param readIn the read itself, as a GATKSAMRecord
+ * @return the read itself
+ */
+ public GATKSAMRecord map( ReferenceContext ref, GATKSAMRecord readIn, RefMetaDataTracker metaDataTracker ) {
+ return readIn;
+ }
+
+ /**
+ * reduceInit is called once before any calls to the map function. We use it here to setup the output
+ * bam file, if it was specified on the command line
+ *
+ * @return SAMFileWriter, set to the BAM output file if the command line option was set, null otherwise
+ */
+ public Collection reduceInit() {
+ return new LinkedList();
+ }
+
+ /**
+ * given a read and a output location, reduce by emitting the read
+ *
+ * @param read the read itself
+ * @param output the output source
+ * @return the SAMFileWriter, so that the next reduce can emit to the same source
+ */
+ public Collection reduce( GATKSAMRecord read, Collection output ) {
+ output.add(read);
+ return output;
+ }
+
+ @Override
+ public void onTraversalDone(Collection result) {
+ for ( final GATKSAMRecord read : DownsamplingUtils.levelCoverageByPosition(new ArrayList(result), downsampleTo, minReadsPerAlignmentStart) )
+ out.addAlignment(read);
+ }
+}
From ffea6dd95f34de0c979273c0783d6da75bbe16f0 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Mon, 18 Mar 2013 17:06:32 -0400
Subject: [PATCH 055/211] HaplotypeCaller now has the ability to only consider
the best N haplotypes for genotyping
-- Added a -dontGenotype mode for testing assembly efficiency
-- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted
---
.../haplotypecaller/DeBruijnAssembler.java | 74 +++++++++++++------
.../haplotypecaller/HaplotypeCaller.java | 22 +++++-
.../broadinstitute/sting/utils/Haplotype.java | 32 +++++++-
3 files changed, 101 insertions(+), 27 deletions(-)
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
index 566605a8c..bf08d1526 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
@@ -52,6 +52,7 @@ import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
import org.apache.commons.lang.ArrayUtils;
+import org.apache.log4j.Logger;
import org.broadinstitute.sting.utils.GenomeLoc;
import org.broadinstitute.sting.utils.Haplotype;
import org.broadinstitute.sting.utils.MathUtils;
@@ -73,6 +74,7 @@ import java.util.*;
*/
public class DeBruijnAssembler extends LocalAssemblyEngine {
+ private final static Logger logger = Logger.getLogger(DeBruijnAssembler.class);
private static final int KMER_OVERLAP = 5; // the additional size of a valid chunk of sequence, used to string together k-mers
private static final int NUM_BEST_PATHS_PER_KMER_GRAPH = 11;
@@ -85,18 +87,20 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
private static final double SW_GAP = -22.0; //-1.0-1.0/3.0;
private static final double SW_GAP_EXTEND = -1.2; //-1.0/.0;
- private final boolean DEBUG;
- private final PrintStream GRAPH_WRITER;
+ private final boolean debug;
+ private final PrintStream graphWriter;
private final List graphs = new ArrayList();
- private final int MIN_KMER;
+ private final int minKmer;
+ private final int maxHaplotypesToConsider;
private int PRUNE_FACTOR = 2;
- public DeBruijnAssembler(final boolean debug, final PrintStream graphWriter, final int minKmer) {
+ public DeBruijnAssembler(final boolean debug, final PrintStream graphWriter, final int minKmer, final int maxHaplotypesToConsider) {
super();
- DEBUG = debug;
- GRAPH_WRITER = graphWriter;
- MIN_KMER = minKmer;
+ this.debug = debug;
+ this.graphWriter = graphWriter;
+ this.minKmer = minKmer;
+ this.maxHaplotypesToConsider = maxHaplotypesToConsider;
}
/**
@@ -123,7 +127,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
createDeBruijnGraphs( activeRegion.getReads(), refHaplotype );
// print the graphs if the appropriate debug option has been turned on
- if( GRAPH_WRITER != null ) {
+ if( graphWriter != null ) {
printGraphs();
}
@@ -136,11 +140,12 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
graphs.clear();
final int maxKmer = ReadUtils.getMaxReadLength(reads) - KMER_OVERLAP - 1;
- if( maxKmer < MIN_KMER ) { return; } // Reads are too small for assembly so don't try to create any assembly graphs
+ if( maxKmer < minKmer) { return; } // Reads are too small for assembly so don't try to create any assembly graphs
// create the graph for each possible kmer
- for( int kmer = maxKmer; kmer >= MIN_KMER; kmer -= GRAPH_KMER_STEP ) {
- final DeBruijnAssemblyGraph graph = createGraphFromSequences( reads, kmer, refHaplotype, DEBUG );
+ for( int kmer = maxKmer; kmer >= minKmer; kmer -= GRAPH_KMER_STEP ) {
+ //if ( debug ) logger.info("Creating de Bruijn graph for " + kmer + " kmer using " + reads.size() + " reads");
+ final DeBruijnAssemblyGraph graph = createGraphFromSequences( reads, kmer, refHaplotype, debug);
if( graph != null ) { // graphs that fail during creation ( for example, because there are cycles in the reference graph ) will show up here as a null graph object
// do a series of steps to clean up the raw assembly graph to make it analysis-ready
pruneGraph(graph, PRUNE_FACTOR);
@@ -320,22 +325,22 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
protected void printGraphs() {
- GRAPH_WRITER.println("digraph assemblyGraphs {");
+ graphWriter.println("digraph assemblyGraphs {");
for( final DeBruijnAssemblyGraph graph : graphs ) {
for( final DeBruijnEdge edge : graph.edgeSet() ) {
if( edge.getMultiplicity() > PRUNE_FACTOR ) {
- GRAPH_WRITER.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [" + (edge.getMultiplicity() <= PRUNE_FACTOR ? "style=dotted,color=grey" : "label=\""+ edge.getMultiplicity() +"\"") + "];");
+ graphWriter.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [" + (edge.getMultiplicity() <= PRUNE_FACTOR ? "style=dotted,color=grey" : "label=\"" + edge.getMultiplicity() + "\"") + "];");
}
if( edge.isRef() ) {
- GRAPH_WRITER.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [color=red];");
+ graphWriter.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [color=red];");
}
if( !edge.isRef() && edge.getMultiplicity() <= PRUNE_FACTOR ) { System.out.println("Graph pruning warning!"); }
}
for( final DeBruijnVertex v : graph.vertexSet() ) {
- GRAPH_WRITER.println("\t" + v.toString() + " [label=\"" + new String(graph.getAdditionalSequence(v)) + "\"]");
+ graphWriter.println("\t" + v.toString() + " [label=\"" + new String(graph.getAdditionalSequence(v)) + "\"]");
}
}
- GRAPH_WRITER.println("}");
+ graphWriter.println("}");
}
@Requires({"refWithPadding.length > refHaplotype.getBases().length", "refLoc.containsP(activeRegionWindow)"})
@@ -343,6 +348,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
private List findBestPaths( final Haplotype refHaplotype, final byte[] refWithPadding, final GenomeLoc refLoc, final List activeAllelesToGenotype, final GenomeLoc activeRegionWindow ) {
// add the reference haplotype separately from all the others to ensure that it is present in the list of haplotypes
+ // TODO -- this use of an array with contains lower may be a performance problem returning in an O(N^2) algorithm
final List returnHaplotypes = new ArrayList();
refHaplotype.setAlignmentStartHapwrtRef(activeRegionWindow.getStart() - refLoc.getStart());
final Cigar c = new Cigar();
@@ -383,7 +389,8 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
if( !returnHaplotypes.contains(h) ) {
h.setAlignmentStartHapwrtRef(activeRegionStart);
- h.setCigar( leftAlignedCigar );
+ h.setCigar(leftAlignedCigar);
+ h.setScore(path.getScore());
returnHaplotypes.add(h);
// for GGA mode, add the desired allele into the haplotype if it isn't already present
@@ -409,18 +416,39 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
}
- if( DEBUG ) {
- if( returnHaplotypes.size() > 1 ) {
- System.out.println("Found " + returnHaplotypes.size() + " candidate haplotypes to evaluate every read against.");
+ final List finalHaplotypes = selectHighestScoringHaplotypes(returnHaplotypes);
+ if ( finalHaplotypes.size() < returnHaplotypes.size() )
+ logger.info("Found " + finalHaplotypes.size() + " candidate haplotypes of " + returnHaplotypes.size() + " possible combinations to evaluate every read against at " + refLoc);
+
+ if( debug ) {
+ if( finalHaplotypes.size() > 1 ) {
+ System.out.println("Found " + finalHaplotypes.size() + " candidate haplotypes of " + returnHaplotypes.size() + " possible combinations to evaluate every read against.");
} else {
System.out.println("Found only the reference haplotype in the assembly graph.");
}
- for( final Haplotype h : returnHaplotypes ) {
+ for( final Haplotype h : finalHaplotypes ) {
System.out.println( h.toString() );
- System.out.println( "> Cigar = " + h.getCigar() + " : " + h.getCigar().getReferenceLength() );
+ System.out.println( "> Cigar = " + h.getCigar() + " : " + h.getCigar().getReferenceLength() + " score " + h.getScore() );
}
}
- return returnHaplotypes;
+
+ return finalHaplotypes;
+ }
+
+ /**
+ * Select the best scoring haplotypes among all present, returning no more than maxHaplotypesToConsider
+ *
+ * @param haplotypes a list of haplotypes to consider
+ * @return a sublist of the best haplotypes, with size() <= maxHaplotypesToConsider
+ */
+ private List selectHighestScoringHaplotypes(final List haplotypes) {
+ if ( haplotypes.size() <= maxHaplotypesToConsider )
+ return haplotypes;
+ else {
+ final List sorted = new ArrayList(haplotypes);
+ Collections.sort(sorted, new Haplotype.ScoreComparator());
+ return sorted.subList(0, maxHaplotypesToConsider);
+ }
}
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
index 4fc075807..cff631802 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
@@ -55,6 +55,7 @@ import org.broadinstitute.sting.gatk.contexts.AlignmentContext;
import org.broadinstitute.sting.gatk.contexts.AlignmentContextUtils;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.downsampling.DownsampleType;
+import org.broadinstitute.sting.gatk.downsampling.DownsamplingUtils;
import org.broadinstitute.sting.gatk.filters.BadMateFilter;
import org.broadinstitute.sting.gatk.io.StingSAMFileWriter;
import org.broadinstitute.sting.gatk.iterators.ReadTransformer;
@@ -205,6 +206,10 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
@Argument(fullName="minKmer", shortName="minKmer", doc="Minimum kmer length to use in the assembly graph", required = false)
protected int minKmer = 11;
+ @Advanced
+ @Argument(fullName="maxHaplotypesToConsider", shortName="maxHaplotypesToConsider", doc="Maximum number of haplotypes to consider in the likelihood calculation. Setting this number too high can have dramatic performance implications", required = false)
+ protected int maxHaplotypesToConsider = 100000;
+
/**
* If this flag is provided, the haplotype caller will include unmapped reads in the assembly and calling
* when these reads occur in the region being analyzed. Typically, for paired end analyses, one pair of the
@@ -227,6 +232,10 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
@Argument(fullName="justDetermineActiveRegions", shortName="justDetermineActiveRegions", doc = "If specified, the HC won't actually do any assembly or calling, it'll just run the upfront active region determination code. Useful for benchmarking and scalability testing", required=false)
protected boolean justDetermineActiveRegions = false;
+ @Hidden
+ @Argument(fullName="dontGenotype", shortName="dontGenotype", doc = "If specified, the HC will do any assembly but won't do calling. Useful for benchmarking and scalability testing", required=false)
+ protected boolean dontGenotype = false;
+
/**
* rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate.
* dbSNP is not used in any way for the calculations themselves.
@@ -296,6 +305,9 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
// reference base padding size
private static final int REFERENCE_PADDING = 500;
+ private final static int maxReadsInRegionPerSample = 1000; // TODO -- should be an argument
+ private final static int minReadsPerAlignmentStart = 5; // TODO -- should be an argument
+
// bases with quality less than or equal to this value are trimmed off the tails of the reads
private static final byte MIN_TAIL_QUALITY = 20;
@@ -374,7 +386,7 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
throw new UserException.CouldNotReadInputFile(getToolkit().getArguments().referenceFile, e);
}
- assemblyEngine = new DeBruijnAssembler( DEBUG, graphWriter, minKmer );
+ assemblyEngine = new DeBruijnAssembler( DEBUG, graphWriter, minKmer, maxHaplotypesToConsider );
likelihoodCalculationEngine = new LikelihoodCalculationEngine( (byte)gcpHMM, DEBUG, pairHMM );
genotypingEngine = new GenotypingEngine( DEBUG, annotationEngine, USE_FILTERED_READ_MAP_FOR_ANNOTATIONS );
@@ -514,6 +526,9 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
// sort haplotypes to take full advantage of haplotype start offset optimizations in PairHMM
Collections.sort( haplotypes, new Haplotype.HaplotypeBaseComparator() );
+ if (dontGenotype)
+ return 1;
+
// evaluate each sample's reads against all haplotypes
final Map stratifiedReadMap = likelihoodCalculationEngine.computeReadLikelihoods( haplotypes, splitReadsBySample( activeRegion.getReads() ) );
final Map> perSampleFilteredReadList = splitReadsBySample( filteredReads );
@@ -575,7 +590,7 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
//
//---------------------------------------------------------------------------------------------------------------
- private void finalizeActiveRegion( final org.broadinstitute.sting.utils.activeregion.ActiveRegion activeRegion ) {
+ private void finalizeActiveRegion( final ActiveRegion activeRegion ) {
if( DEBUG ) { System.out.println("\nAssembling " + activeRegion.getLocation() + " with " + activeRegion.size() + " reads: (with overlap region = " + activeRegion.getExtendedLoc() + ")"); }
final List finalizedReadList = new ArrayList();
final FragmentCollection fragmentCollection = FragmentUtils.create( activeRegion.getReads() );
@@ -599,7 +614,8 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
}
}
}
- activeRegion.addAll(ReadUtils.sortReadsByCoordinate(readsToUse));
+
+ activeRegion.addAll(DownsamplingUtils.levelCoverageByPosition(ReadUtils.sortReadsByCoordinate(readsToUse), maxReadsInRegionPerSample, minReadsPerAlignmentStart));
}
private List filterNonPassingReads( final org.broadinstitute.sting.utils.activeregion.ActiveRegion activeRegion ) {
diff --git a/public/java/src/org/broadinstitute/sting/utils/Haplotype.java b/public/java/src/org/broadinstitute/sting/utils/Haplotype.java
index 415cb73ac..070ae4f5d 100644
--- a/public/java/src/org/broadinstitute/sting/utils/Haplotype.java
+++ b/public/java/src/org/broadinstitute/sting/utils/Haplotype.java
@@ -41,12 +41,12 @@ import java.io.Serializable;
import java.util.*;
public class Haplotype extends Allele {
-
private GenomeLoc genomeLocation = null;
private Map eventMap = null;
private Cigar cigar;
private int alignmentStartHapwrtRef;
private Event artificialEvent = null;
+ private double score = 0;
/**
* Main constructor
@@ -259,4 +259,34 @@ public class Haplotype extends Allele {
this.pos = pos;
}
}
+
+ /**
+ * Get the score (an estimate of the support) of this haplotype
+ * @return a double, where higher values are better
+ */
+ public double getScore() {
+ return this.isReference() ? Double.MAX_VALUE : score;
+ }
+
+ /**
+ * Set the score (an estimate of the support) of this haplotype.
+ *
+ * Note that if this is the reference haplotype it is always given Double.MAX_VALUE score
+ *
+ * @param score a double, where higher values are better
+ */
+ public void setScore(double score) {
+ this.score = this.isReference() ? Double.MAX_VALUE : score;
+ }
+
+ /**
+ * A comparator that sorts haplotypes in decreasing order of score, so that the best supported
+ * haplotypes are at the top
+ */
+ public static class ScoreComparator implements Comparator {
+ @Override
+ public int compare(Haplotype o1, Haplotype o2) {
+ return -1 * Double.valueOf(o1.getScore()).compareTo(o2.getScore());
+ }
+ }
}
From 53a904bcbd8ec63420a76e98e7dda6432d2907f8 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Fri, 8 Mar 2013 11:28:22 -0500
Subject: [PATCH 056/211] Bugfix for HaplotypeCaller: GSA-822 for trimming
softclipped reads
-- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire. The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims. Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference. Needs to be fixed. JIRA added
-- See https://jira.broadinstitute.org/browse/GSA-822
-- #resolve #fix GSA-822
---
.../haplotypecaller/DeBruijnAssembler.java | 18 +++++++++++--
.../DeBruijnAssemblyGraph.java | 27 ++++++++++++++++---
.../haplotypecaller/HaplotypeCaller.java | 12 +++++++++
3 files changed, 52 insertions(+), 5 deletions(-)
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
index bf08d1526..33198ce8c 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
@@ -271,9 +271,10 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
@Requires({"reads != null", "KMER_LENGTH > 0", "refHaplotype != null"})
protected static DeBruijnAssemblyGraph createGraphFromSequences( final List reads, final int KMER_LENGTH, final Haplotype refHaplotype, final boolean DEBUG ) {
- final DeBruijnAssemblyGraph graph = new DeBruijnAssemblyGraph();
+ final DeBruijnAssemblyGraph graph = new DeBruijnAssemblyGraph(KMER_LENGTH);
// First pull kmers from the reference haplotype and add them to the graph
+ //logger.info("Adding reference sequence to graph " + refHaplotype.getBaseString());
final byte[] refSequence = refHaplotype.getBases();
if( refSequence.length >= KMER_LENGTH + KMER_OVERLAP ) {
final int kmersInSequence = refSequence.length - KMER_LENGTH + 1;
@@ -289,6 +290,8 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
// Next pull kmers out of every read and throw them on the graph
for( final GATKSAMRecord read : reads ) {
+ //if ( ! read.getReadName().equals("H06JUADXX130110:1:1213:15422:11590")) continue;
+ //logger.info("Adding read " + read + " with sequence " + read.getReadString());
final byte[] sequence = read.getReadBases();
final byte[] qualities = read.getBaseQualities();
final byte[] reducedReadCounts = read.getReducedReadCounts(); // will be null if read is not reduced
@@ -325,8 +328,16 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
protected void printGraphs() {
+ final boolean onlyWriteOneGraph = false; // debugging flag -- if true we'll only write a graph for a single kmer size
+ final int writeFirstGraphWithSizeSmallerThan = 50;
+
graphWriter.println("digraph assemblyGraphs {");
for( final DeBruijnAssemblyGraph graph : graphs ) {
+ if ( onlyWriteOneGraph && graph.getKmerSize() >= writeFirstGraphWithSizeSmallerThan ) {
+ logger.info("Skipping writing of graph with kmersize " + graph.getKmerSize());
+ continue;
+ }
+
for( final DeBruijnEdge edge : graph.edgeSet() ) {
if( edge.getMultiplicity() > PRUNE_FACTOR ) {
graphWriter.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [" + (edge.getMultiplicity() <= PRUNE_FACTOR ? "style=dotted,color=grey" : "label=\"" + edge.getMultiplicity() + "\"") + "];");
@@ -337,8 +348,11 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
if( !edge.isRef() && edge.getMultiplicity() <= PRUNE_FACTOR ) { System.out.println("Graph pruning warning!"); }
}
for( final DeBruijnVertex v : graph.vertexSet() ) {
- graphWriter.println("\t" + v.toString() + " [label=\"" + new String(graph.getAdditionalSequence(v)) + "\"]");
+ graphWriter.println("\t" + v.toString() + " [label=\"" + new String(graph.getAdditionalSequence(v)) + "\",shape=box]");
}
+
+ if ( onlyWriteOneGraph )
+ break;
}
graphWriter.println("}");
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
index 6a95049d1..d28f81b55 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
@@ -47,9 +47,7 @@
package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
import com.google.java.contract.Ensures;
-import com.google.java.contract.Requires;
import org.apache.commons.lang.ArrayUtils;
-import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import org.jgrapht.graph.DefaultDirectedGraph;
import java.io.PrintStream;
@@ -62,9 +60,32 @@ import java.util.Arrays;
*/
public class DeBruijnAssemblyGraph extends DefaultDirectedGraph {
+ private final int kmerSize;
- public DeBruijnAssemblyGraph() {
+ /**
+ * Construct a DeBruijnAssemblyGraph with kmerSize
+ * @param kmerSize
+ */
+ public DeBruijnAssemblyGraph(final int kmerSize) {
super(DeBruijnEdge.class);
+
+ if ( kmerSize < 1 ) throw new IllegalArgumentException("kmerSize must be >= 1 but got " + kmerSize);
+ this.kmerSize = kmerSize;
+ }
+
+ /**
+ * Test construct that makes DeBruijnAssemblyGraph assuming a kmerSize of 11
+ */
+ protected DeBruijnAssemblyGraph() {
+ this(11);
+ }
+
+ /**
+ * How big of a kmer did we use to create this graph?
+ * @return
+ */
+ public int getKmerSize() {
+ return kmerSize;
}
/**
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
index cff631802..affad6450 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
@@ -608,8 +608,20 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
final GATKSAMRecord postAdapterRead = ( myRead.getReadUnmappedFlag() ? myRead : ReadClipper.hardClipAdaptorSequence( myRead ) );
if( postAdapterRead != null && !postAdapterRead.isEmpty() && postAdapterRead.getCigar().getReadLength() > 0 ) {
GATKSAMRecord clippedRead = ReadClipper.hardClipLowQualEnds( postAdapterRead, MIN_TAIL_QUALITY );
+
+ // revert soft clips so that we see the alignment start and end assuming the soft clips are all matches
+ // TODO -- WARNING -- still possibility that unclipping the soft clips will introduce bases that aren't
+ // TODO -- truly in the extended region, as the unclipped bases might actually include a deletion
+ // TODO -- w.r.t. the reference. What really needs to happen is that kmers that occur before the
+ // TODO -- reference haplotype start must be removed
+ clippedRead = ReadClipper.revertSoftClippedBases(clippedRead);
+
+ // uncomment to remove hard clips from consideration at all
+ //clippedRead = ReadClipper.hardClipSoftClippedBases(clippedRead);
+
clippedRead = ReadClipper.hardClipToRegion( clippedRead, activeRegion.getExtendedLoc().getStart(), activeRegion.getExtendedLoc().getStop() );
if( activeRegion.readOverlapsRegion(clippedRead) && clippedRead.getReadLength() > 0 ) {
+ //logger.info("Keeping read " + clippedRead + " start " + clippedRead.getAlignmentStart() + " end " + clippedRead.getAlignmentEnd());
readsToUse.add(clippedRead);
}
}
From 0f4328f6fe0bdb08e0d82553a27bd2fd0d5668d5 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Fri, 8 Mar 2013 13:10:15 -0500
Subject: [PATCH 057/211] Basic kmer error correction algorithm xfor the
HaplotypeCaller
-- Error correction algorithm for the assembler. Only error correct reads to others that are exactly 1 mismatch away
-- The assembler logic is now: build initial graph, error correct*, merge nodes*, prune dead nodes, merge again, make haplotypes. The * elements are new
-- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing.
-- Easier way to control the testing of the graph assembly algorithms
-- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler
-- Simple protected parsing function for making DeBruijnAssemblyGraph
-- Change the default prune factor for the graph to 1, from 2
-- debugging graph transformations are controllable from command line
---
.../haplotypecaller/DeBruijnAssembler.java | 107 ++++++--
.../DeBruijnAssemblyGraph.java | 115 ++++++--
.../haplotypecaller/DeBruijnVertex.java | 12 +
.../haplotypecaller/HaplotypeCaller.java | 7 +-
.../haplotypecaller/KMerErrorCorrector.java | 253 ++++++++++++++++++
.../DeBruijnAssemblerUnitTest.java | 68 ++++-
.../KMerErrorCorrectorUnitTest.java | 78 ++++++
7 files changed, 594 insertions(+), 46 deletions(-)
create mode 100644 protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrectorUnitTest.java
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
index 33198ce8c..0caebebee 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
@@ -64,6 +64,9 @@ import org.broadinstitute.sting.utils.sam.ReadUtils;
import org.broadinstitute.variant.variantcontext.Allele;
import org.broadinstitute.variant.variantcontext.VariantContext;
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.*;
@@ -88,16 +91,19 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
private static final double SW_GAP_EXTEND = -1.2; //-1.0/.0;
private final boolean debug;
+ private final int onlyBuildKmerGraphOfThisSite = -1; // 35;
+ private final boolean debugGraphTransformations;
private final PrintStream graphWriter;
private final List graphs = new ArrayList();
private final int minKmer;
private final int maxHaplotypesToConsider;
private int PRUNE_FACTOR = 2;
-
- public DeBruijnAssembler(final boolean debug, final PrintStream graphWriter, final int minKmer, final int maxHaplotypesToConsider) {
+
+ public DeBruijnAssembler(final boolean debug, final boolean debugGraphTransformations, final PrintStream graphWriter, final int minKmer, final int maxHaplotypesToConsider) {
super();
this.debug = debug;
+ this.debugGraphTransformations = debugGraphTransformations;
this.graphWriter = graphWriter;
this.minKmer = minKmer;
this.maxHaplotypesToConsider = maxHaplotypesToConsider;
@@ -144,13 +150,23 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
// create the graph for each possible kmer
for( int kmer = maxKmer; kmer >= minKmer; kmer -= GRAPH_KMER_STEP ) {
- //if ( debug ) logger.info("Creating de Bruijn graph for " + kmer + " kmer using " + reads.size() + " reads");
- final DeBruijnAssemblyGraph graph = createGraphFromSequences( reads, kmer, refHaplotype, debug);
+ if ( onlyBuildKmerGraphOfThisSite != -1 && kmer != onlyBuildKmerGraphOfThisSite )
+ continue;
+
+ if ( debug ) logger.info("Creating de Bruijn graph for " + kmer + " kmer using " + reads.size() + " reads");
+ DeBruijnAssemblyGraph graph = createGraphFromSequences( reads, kmer, refHaplotype, debug);
if( graph != null ) { // graphs that fail during creation ( for example, because there are cycles in the reference graph ) will show up here as a null graph object
// do a series of steps to clean up the raw assembly graph to make it analysis-ready
- pruneGraph(graph, PRUNE_FACTOR);
+ if ( debugGraphTransformations ) graph.printGraph(new File("unpruned.dot"), PRUNE_FACTOR);
+ graph = graph.errorCorrect();
+ if ( debugGraphTransformations ) graph.printGraph(new File("errorCorrected.dot"), PRUNE_FACTOR);
cleanNonRefPaths(graph);
mergeNodes(graph);
+ if ( debugGraphTransformations ) graph.printGraph(new File("merged.dot"), PRUNE_FACTOR);
+ pruneGraph(graph, PRUNE_FACTOR);
+ if ( debugGraphTransformations ) graph.printGraph(new File("pruned.dot"), PRUNE_FACTOR);
+ mergeNodes(graph);
+ if ( debugGraphTransformations ) graph.printGraph(new File("merged2.dot"), PRUNE_FACTOR);
if( graph.getReferenceSourceVertex() != null ) { // if the graph contains interesting variation from the reference
sanityCheckReferenceGraph(graph, refHaplotype);
graphs.add(graph);
@@ -169,7 +185,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
final DeBruijnVertex outgoingVertex = graph.getEdgeTarget(e);
final DeBruijnVertex incomingVertex = graph.getEdgeSource(e);
if( !outgoingVertex.equals(incomingVertex) && graph.outDegreeOf(incomingVertex) == 1 && graph.inDegreeOf(outgoingVertex) == 1 &&
- graph.inDegreeOf(incomingVertex) <= 1 && graph.outDegreeOf(outgoingVertex) <= 1 && graph.isReferenceNode(incomingVertex) == graph.isReferenceNode(outgoingVertex) ) {
+ graph.inDegreeOf(incomingVertex) <= 1 && graph.outDegreeOf(outgoingVertex) <= 1 && graph.isReferenceNode(incomingVertex) == graph.isReferenceNode(outgoingVertex) ) {
final Set outEdges = graph.outgoingEdgesOf(outgoingVertex);
final Set inEdges = graph.incomingEdgesOf(incomingVertex);
if( inEdges.size() == 1 && outEdges.size() == 1 ) {
@@ -199,6 +215,59 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
}
+ //
+ // X -> ABC -> Y
+ // -> aBC -> Y
+ //
+ // becomes
+ //
+ // X -> A -> BCY
+ // -> a -> BCY
+ //
+// @Requires({"graph != null"})
+// protected static void simplifyMergedGraph(final DeBruijnAssemblyGraph graph) {
+// boolean foundNodesToMerge = true;
+// while( foundNodesToMerge ) {
+// foundNodesToMerge = false;
+//
+// for( final DeBruijnVertex v : graph.vertexSet() ) {
+// if ( isRootOfComplexDiamond(v) ) {
+// foundNodesToMerge = simplifyComplexDiamond(graph, v);
+// if ( foundNodesToMerge )
+// break;
+// }
+// }
+// }
+// }
+//
+// private static boolean simplifyComplexDiamond(final DeBruijnAssemblyGraph graph, final DeBruijnVertex root) {
+// final Set outEdges = graph.outgoingEdgesOf(root);
+// final DeBruijnVertex diamondBottom = graph.getEdge(graph.getEdgeTarget(outEdges.iterator().next());
+// // all of the edges point to the same sink, so it's time to merge
+// final byte[] commonSuffix = commonSuffixOfEdgeTargets(outEdges, targetSink);
+// if ( commonSuffix != null ) {
+// final DeBruijnVertex suffixVertex = new DeBruijnVertex(commonSuffix, graph.getKmerSize());
+// graph.addVertex(suffixVertex);
+// graph.addEdge(suffixVertex, targetSink);
+//
+// for( final DeBruijnEdge edge : outEdges ) {
+// final DeBruijnVertex target = graph.getEdgeTarget(edge);
+// final DeBruijnVertex prefix = target.withoutSuffix(commonSuffix);
+// graph.addEdge(prefix, suffixVertex, new DeBruijnEdge(edge.isRef(), edge.getMultiplicity()));
+// graph.removeVertex(graph.getEdgeTarget(edge));
+// graph.removeAllEdges(root, target);
+// graph.removeAllEdges(target, targetSink);
+// }
+//
+// graph.removeAllEdges(outEdges);
+// graph.removeVertex(targetSink);
+//
+// return true;
+// } else {
+// return false;
+// }
+// }
+
protected static void cleanNonRefPaths( final DeBruijnAssemblyGraph graph ) {
if( graph.getReferenceSourceVertex() == null || graph.getReferenceSinkVertex() == null ) {
return;
@@ -279,7 +348,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
if( refSequence.length >= KMER_LENGTH + KMER_OVERLAP ) {
final int kmersInSequence = refSequence.length - KMER_LENGTH + 1;
for( int iii = 0; iii < kmersInSequence - 1; iii++ ) {
- if( !graph.addKmersToGraph(Arrays.copyOfRange(refSequence, iii, iii + KMER_LENGTH), Arrays.copyOfRange(refSequence, iii + 1, iii + 1 + KMER_LENGTH), true) ) {
+ if( !graph.addKmersToGraph(Arrays.copyOfRange(refSequence, iii, iii + KMER_LENGTH), Arrays.copyOfRange(refSequence, iii + 1, iii + 1 + KMER_LENGTH), true, 1) ) {
if( DEBUG ) {
System.out.println("Cycle detected in reference graph for kmer = " + KMER_LENGTH + " ...skipping");
}
@@ -297,7 +366,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
final byte[] reducedReadCounts = read.getReducedReadCounts(); // will be null if read is not reduced
if( sequence.length > KMER_LENGTH + KMER_OVERLAP ) {
final int kmersInSequence = sequence.length - KMER_LENGTH + 1;
- for( int iii = 0; iii < kmersInSequence - 1; iii++ ) {
+ for( int iii = 0; iii < kmersInSequence - 1; iii++ ) {
// if the qualities of all the bases in the kmers are high enough
boolean badKmer = false;
for( int jjj = iii; jjj < iii + KMER_LENGTH + 1; jjj++) {
@@ -318,42 +387,32 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
final byte[] kmer2 = Arrays.copyOfRange(sequence, iii + 1, iii + 1 + KMER_LENGTH);
for( int kkk=0; kkk < countNumber; kkk++ ) {
- graph.addKmersToGraph(kmer1, kmer2, false);
+ graph.addKmersToGraph(kmer1, kmer2, false, 1);
}
}
}
}
}
+
return graph;
}
protected void printGraphs() {
- final boolean onlyWriteOneGraph = false; // debugging flag -- if true we'll only write a graph for a single kmer size
final int writeFirstGraphWithSizeSmallerThan = 50;
graphWriter.println("digraph assemblyGraphs {");
for( final DeBruijnAssemblyGraph graph : graphs ) {
- if ( onlyWriteOneGraph && graph.getKmerSize() >= writeFirstGraphWithSizeSmallerThan ) {
+ if ( debugGraphTransformations && graph.getKmerSize() >= writeFirstGraphWithSizeSmallerThan ) {
logger.info("Skipping writing of graph with kmersize " + graph.getKmerSize());
continue;
}
- for( final DeBruijnEdge edge : graph.edgeSet() ) {
- if( edge.getMultiplicity() > PRUNE_FACTOR ) {
- graphWriter.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [" + (edge.getMultiplicity() <= PRUNE_FACTOR ? "style=dotted,color=grey" : "label=\"" + edge.getMultiplicity() + "\"") + "];");
- }
- if( edge.isRef() ) {
- graphWriter.println("\t" + graph.getEdgeSource(edge).toString() + " -> " + graph.getEdgeTarget(edge).toString() + " [color=red];");
- }
- if( !edge.isRef() && edge.getMultiplicity() <= PRUNE_FACTOR ) { System.out.println("Graph pruning warning!"); }
- }
- for( final DeBruijnVertex v : graph.vertexSet() ) {
- graphWriter.println("\t" + v.toString() + " [label=\"" + new String(graph.getAdditionalSequence(v)) + "\",shape=box]");
- }
+ graph.printGraph(graphWriter, false, PRUNE_FACTOR);
- if ( onlyWriteOneGraph )
+ if ( debugGraphTransformations )
break;
}
+
graphWriter.println("}");
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
index d28f81b55..a78a5c627 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
@@ -48,8 +48,12 @@ package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
import com.google.java.contract.Ensures;
import org.apache.commons.lang.ArrayUtils;
+import org.apache.log4j.Logger;
import org.jgrapht.graph.DefaultDirectedGraph;
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.Arrays;
@@ -60,6 +64,7 @@ import java.util.Arrays;
*/
public class DeBruijnAssemblyGraph extends DefaultDirectedGraph {
+ private final static Logger logger = Logger.getLogger(DeBruijnAssemblyGraph.class);
private final int kmerSize;
/**
@@ -73,6 +78,24 @@ public class DeBruijnAssemblyGraph extends DefaultDirectedGraph " + getEdgeTarget(edge).toString() + " [" + "label=\""+ edge.getMultiplicity() +"\"" + "];");
+// if( edge.getMultiplicity() > PRUNE_FACTOR ) {
+ graphWriter.println("\t" + getEdgeSource(edge).toString() + " -> " + getEdgeTarget(edge).toString() + " [" + (edge.getMultiplicity() <= pruneFactor ? "style=dotted,color=grey," : "") + "label=\"" + edge.getMultiplicity() + "\"];");
+// }
if( edge.isRef() ) {
- GRAPH_WRITER.println("\t" + getEdgeSource(edge).toString() + " -> " + getEdgeTarget(edge).toString() + " [color=red];");
+ graphWriter.println("\t" + getEdgeSource(edge).toString() + " -> " + getEdgeTarget(edge).toString() + " [color=red];");
}
+ //if( !edge.isRef() && edge.getMultiplicity() <= PRUNE_FACTOR ) { System.out.println("Graph pruning warning!"); }
}
+
for( final DeBruijnVertex v : vertexSet() ) {
- final String label = ( inDegreeOf(v) == 0 ? v.toString() : v.getSuffixString() );
- GRAPH_WRITER.println("\t" + v.toString() + " [label=\"" + label + "\"]");
+ graphWriter.println("\t" + v.toString() + " [label=\"" + new String(getAdditionalSequence(v)) + "\",shape=box]");
}
- GRAPH_WRITER.println("}");
+
+ if ( writeHeader )
+ graphWriter.println("}");
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java
index 1390b0ee9..aa8e24576 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java
@@ -68,6 +68,18 @@ public class DeBruijnVertex {
this.kmer = kmer;
}
+ protected DeBruijnVertex( final String sequence, final int kmer ) {
+ this(sequence.getBytes(), kmer);
+ }
+
+ protected DeBruijnVertex( final String sequence ) {
+ this(sequence.getBytes(), sequence.length());
+ }
+
+ public int getKmer() {
+ return kmer;
+ }
+
@Override
public boolean equals( Object v ) {
return v instanceof DeBruijnVertex && Arrays.equals(sequence, ((DeBruijnVertex) v).sequence);
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
index affad6450..d5f283475 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
@@ -192,7 +192,7 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
protected String keepRG = null;
@Argument(fullName="minPruning", shortName="minPruning", doc = "The minimum allowed pruning factor in assembly graph. Paths with <= X supporting kmers are pruned from the graph", required = false)
- protected int MIN_PRUNE_FACTOR = 2;
+ protected int MIN_PRUNE_FACTOR = 1;
@Advanced
@Argument(fullName="gcpHMM", shortName="gcpHMM", doc="Flat gap continuation penalty for use in the Pair HMM", required = false)
@@ -284,6 +284,9 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
@Argument(fullName="debug", shortName="debug", doc="If specified, print out very verbose debug information about each triggering active region", required = false)
protected boolean DEBUG;
+ @Argument(fullName="debugGraphTransformations", shortName="debugGraphTransformations", doc="If specified, we will write DOT formatted graph files out of the assembler", required = false)
+ protected boolean debugGraphTransformations = false;
+
// the UG engines
private UnifiedGenotyperEngine UG_engine = null;
private UnifiedGenotyperEngine UG_engine_simple_genotyper = null;
@@ -386,7 +389,7 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
throw new UserException.CouldNotReadInputFile(getToolkit().getArguments().referenceFile, e);
}
- assemblyEngine = new DeBruijnAssembler( DEBUG, graphWriter, minKmer, maxHaplotypesToConsider );
+ assemblyEngine = new DeBruijnAssembler( DEBUG, debugGraphTransformations, graphWriter, minKmer, maxHaplotypesToConsider );
likelihoodCalculationEngine = new LikelihoodCalculationEngine( (byte)gcpHMM, DEBUG, pairHMM );
genotypingEngine = new GenotypingEngine( DEBUG, annotationEngine, USE_FILTERED_READ_MAP_FOR_ANNOTATIONS );
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java
new file mode 100644
index 000000000..66ea8a078
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java
@@ -0,0 +1,253 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
+
+import java.util.*;
+
+/**
+ * generic utility function that error corrects kmers based on counts
+ *
+ * This class provides a generic facility for remapping kmers (byte[] of constant size)
+ * that occur infrequently to those that occur frequently, based on their simple edit distance
+ * as measured by mismatches.
+ *
+ * The overall workflow of using this class is simple. First, you create the class with
+ * parameters determining how the error correction should proceed. Next, you provide all
+ * of the kmers you see in your data. Once all kmers have been added, you call computeErrorCorrectionMap
+ * to tell this class that all kmers have been added and its time to determine error correcting
+ * mapping from observed kmers to corrected kmers. This correction looks for low-count (as determined
+ * by maxCountToCorrect) kmers and chooses the best kmer (minimizing mismatches) among those
+ * with at least minCountOfKmerToBeCorrection occurrences to error correct the kmer to. If
+ * there is no kmer with less than maxMismatchesToCorrect then the kmer will be mapped to
+ * null, indicating the kmer should not be used.
+ *
+ * TODO -- for ease of implementation this class uses strings instead of byte[] as those cannot
+ * TODO -- be added to hashmaps (more specifically, those don't implement .equals). A more efficient
+ * TODO -- version would use the byte[] directly
+ *
+ * User: depristo
+ * Date: 3/8/13
+ * Time: 1:16 PM
+ */
+public class KMerErrorCorrector {
+ /**
+ * A map of for each kmer to its num occurrences in addKmers
+ */
+ Map countsByKMer = new HashMap();
+
+ /**
+ * A map from raw kmer -> error corrected kmer
+ */
+ Map rawToErrorCorrectedMap = null;
+
+ final int kmerLength;
+ final int maxCountToCorrect;
+ final int maxMismatchesToCorrect;
+ final int minCountOfKmerToBeCorrection;
+
+ /**
+ * Create a new kmer corrector
+ *
+ * @param kmerLength the length of kmers we'll be counting to error correct, must be >= 1
+ * @param maxCountToCorrect kmers with < maxCountToCorrect will try to be error corrected to another kmer, must be >= 0
+ * @param maxMismatchesToCorrect the maximum number of mismatches between a to-be-corrected kmer and its
+ * best match that we attempt to error correct. If no sufficiently similar
+ * kmer exists, it will be remapped to null. Must be >= 1
+ * @param minCountOfKmerToBeCorrection the minimum count of a kmer to be considered a target for correction.
+ * That is, kmers that need correction will only be matched with kmers
+ * with at least minCountOfKmerToBeCorrection occurrences. Must be >= 1
+ */
+ public KMerErrorCorrector(final int kmerLength,
+ final int maxCountToCorrect,
+ final int maxMismatchesToCorrect,
+ final int minCountOfKmerToBeCorrection) {
+ if ( kmerLength < 1 ) throw new IllegalArgumentException("kmerLength must be > 0 but got " + kmerLength);
+ if ( maxCountToCorrect < 0 ) throw new IllegalArgumentException("maxCountToCorrect must be >= 0 but got " + maxCountToCorrect);
+ if ( maxMismatchesToCorrect < 1 ) throw new IllegalArgumentException("maxMismatchesToCorrect must be >= 1 but got " + maxMismatchesToCorrect);
+ if ( minCountOfKmerToBeCorrection < 1 ) throw new IllegalArgumentException("minCountOfKmerToBeCorrection must be >= 1 but got " + minCountOfKmerToBeCorrection);
+
+ this.kmerLength = kmerLength;
+ this.maxCountToCorrect = maxCountToCorrect;
+ this.maxMismatchesToCorrect = maxMismatchesToCorrect;
+ this.minCountOfKmerToBeCorrection = minCountOfKmerToBeCorrection;
+ }
+
+ /**
+ * For testing purposes
+ *
+ * @param kmers
+ */
+ protected void addKmers(final String ... kmers) {
+ for ( final String kmer : kmers )
+ addKmer(kmer, 1);
+ computeErrorCorrectionMap();
+ }
+
+ /**
+ * Add a kmer that occurred kmerCount times
+ *
+ * @param rawKmer a kmer
+ * @param kmerCount the number of occurrences
+ */
+ public void addKmer(final byte[] rawKmer, final int kmerCount) {
+ addKmer(new String(rawKmer), kmerCount);
+ }
+
+
+ /**
+ * Get the error corrected kmer for rawKmer
+ *
+ * @param rawKmer a kmer that was already added that we want to get an error corrected version for
+ * @return an error corrected kmer to use instead of rawKmer. May be == rawKmer if no error correction
+ * is not necessary. May be null, indicating the rawKmer shouldn't be used at all
+ */
+ public byte[] getErrorCorrectedKmer(final byte[] rawKmer) {
+ final String result = getErrorCorrectedKmer(new String(rawKmer));
+ return result == null ? null : result.getBytes();
+ }
+
+ /**
+ * Indicate that no more kmers will be added to the kmer error corrector, so that the
+ * error correction data structure should be computed from the added kmers. Enabled calls
+ * to getErrorCorrectedKmer, and disable calls to addKmer.
+ */
+ public void computeErrorCorrectionMap() {
+ if ( countsByKMer == null )
+ throw new IllegalStateException("computeErrorCorrectionMap can only be called once");
+
+ final LinkedList needsCorrection = new LinkedList();
+ final LinkedList goodKmers = new LinkedList();
+
+ rawToErrorCorrectedMap = new HashMap();
+ for ( Map.Entry kmerCounts: countsByKMer.entrySet() ) {
+ if ( kmerCounts.getValue() <= maxCountToCorrect )
+ needsCorrection.add(kmerCounts.getKey());
+ else {
+ // todo -- optimization could make not in map mean ==
+ rawToErrorCorrectedMap.put(kmerCounts.getKey(), kmerCounts.getKey());
+
+ // only allow corrections to kmers with at least this count
+ if ( kmerCounts.getValue() >= minCountOfKmerToBeCorrection )
+ goodKmers.add(kmerCounts.getKey());
+ }
+ }
+
+ for ( final String toCorrect : needsCorrection ) {
+ final String corrected = findClosestKMer(toCorrect, goodKmers);
+ rawToErrorCorrectedMap.put(toCorrect, corrected);
+ }
+
+ // cleanup memory -- we don't need the counts for each kmer any longer
+ countsByKMer = null;
+ }
+
+ protected void addKmer(final String rawKmer, final int kmerCount) {
+ if ( rawKmer.length() != kmerLength ) throw new IllegalArgumentException("bad kmer length " + rawKmer + " expected size " + kmerLength);
+ if ( kmerCount < 0 ) throw new IllegalArgumentException("bad kmerCount " + kmerCount);
+ if ( countsByKMer == null ) throw new IllegalStateException("Cannot add kmers to an already finalized error corrector");
+
+ final Integer countFromMap = countsByKMer.get(rawKmer);
+ final int count = countFromMap == null ? 0 : countFromMap;
+ countsByKMer.put(rawKmer, count + kmerCount);
+ }
+
+ protected String findClosestKMer(final String kmer, final Collection goodKmers) {
+ String bestMatch = null;
+ int minMismatches = Integer.MAX_VALUE;
+
+ for ( final String goodKmer : goodKmers ) {
+ final int mismatches = countMismatches(kmer, goodKmer);
+ if ( mismatches < minMismatches ) {
+ minMismatches = mismatches;
+ bestMatch = goodKmer;
+ }
+ }
+
+ return minMismatches > maxMismatchesToCorrect ? null : bestMatch;
+ }
+
+ protected int countMismatches(final String one, final String two) {
+ int mismatches = 0;
+ for ( int i = 0; i < one.length(); i++ )
+ mismatches += one.charAt(i) == two.charAt(i) ? 0 : 1;
+ return mismatches;
+ }
+
+ protected String getErrorCorrectedKmer(final String rawKmer) {
+ if ( rawToErrorCorrectedMap == null ) throw new IllegalStateException("Cannot get error corrected kmers until after computeErrorCorrectionMap has been called");
+ if ( rawKmer.length() != kmerLength ) throw new IllegalArgumentException("bad kmer length " + rawKmer + " expected size " + kmerLength);
+ return rawToErrorCorrectedMap.get(rawKmer);
+ }
+
+ @Override
+ public String toString() {
+ final StringBuilder b = new StringBuilder("KMerErrorCorrector{");
+ for ( Map.Entry toCorrect : rawToErrorCorrectedMap.entrySet() ) {
+ final boolean correcting = ! toCorrect.getKey().equals(toCorrect.getValue());
+ if ( correcting )
+ b.append(String.format("%n\t%s / %d -> %s / %d [correcting? %b]",
+ toCorrect.getKey(), getCounts(toCorrect.getKey()),
+ toCorrect.getValue(), getCounts(toCorrect.getValue()),
+ correcting));
+ }
+ b.append("\n}");
+ return b.toString();
+ }
+
+ /**
+ * Get a simple count estimate for printing for kmer
+ * @param kmer the kmer
+ * @return an integer count for kmer
+ */
+ private int getCounts(final String kmer) {
+ if ( kmer == null ) return 0;
+ final Integer count = countsByKMer == null ? -1 : countsByKMer.get(kmer);
+ if ( count == null )
+ throw new IllegalArgumentException("kmer not found in counts -- bug " + kmer);
+ return count;
+ }
+}
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblerUnitTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblerUnitTest.java
index f4a6d5494..2096b487e 100644
--- a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblerUnitTest.java
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblerUnitTest.java
@@ -67,6 +67,7 @@ import org.testng.annotations.Test;
import java.util.*;
public class DeBruijnAssemblerUnitTest extends BaseTest {
+ private final static boolean DEBUG = true;
private class MergeNodesWithNoVariationTestProvider extends TestDataProvider {
@@ -97,7 +98,7 @@ public class DeBruijnAssemblerUnitTest extends BaseTest {
final byte[] kmer2 = new byte[KMER_LENGTH];
System.arraycopy(sequence, i+1, kmer2, 0, KMER_LENGTH);
- graph.addKmersToGraph(kmer1, kmer2, false);
+ graph.addKmersToGraph(kmer1, kmer2, false, 1);
}
DeBruijnAssembler.mergeNodes(graph);
return graph;
@@ -118,13 +119,70 @@ public class DeBruijnAssemblerUnitTest extends BaseTest {
return MergeNodesWithNoVariationTestProvider.getTests(MergeNodesWithNoVariationTestProvider.class);
}
- @Test(dataProvider = "MergeNodesWithNoVariationTestProvider", enabled = true)
+ @Test(dataProvider = "MergeNodesWithNoVariationTestProvider", enabled = !DEBUG)
public void testMergeNodesWithNoVariation(MergeNodesWithNoVariationTestProvider cfg) {
logger.warn(String.format("Test: %s", cfg.toString()));
Assert.assertTrue(graphEquals(cfg.calcGraph(), cfg.expectedGraph()));
}
- @Test(enabled = true)
+// @DataProvider(name = "SimpleMergeOperationsData")
+// public Object[][] makeSimpleMergeOperationsData() {
+// List tests = new ArrayList();
+//
+// {
+// DeBruijnAssemblyGraph graph = new DeBruijnAssemblyGraph();
+// DeBruijnVertex v1 = new DeBruijnVertex("AT");
+// DeBruijnVertex v2 = new DeBruijnVertex("TC");
+// DeBruijnVertex v3 = new DeBruijnVertex("CT");
+// DeBruijnVertex v4 = new DeBruijnVertex("TG");
+// DeBruijnVertex v5 = new DeBruijnVertex("AG");
+// DeBruijnVertex v6 = new DeBruijnVertex("GG");
+// DeBruijnVertex v7 = new DeBruijnVertex("GA");
+// DeBruijnVertex v8 = new DeBruijnVertex("AA");
+//
+// graph.addVertices(v1, v2, v3, v4, v5, v6, v7, v8);
+// graph.addEdge(v1, v2, new DeBruijnEdge(false, 2));
+// graph.addEdge(v2, v3, new DeBruijnEdge(false, 3));
+// graph.addEdge(v2, v4, new DeBruijnEdge(false, 5));
+// graph.addEdge(v3, v5, new DeBruijnEdge(false, 3));
+// graph.addEdge(v4, v6, new DeBruijnEdge(false, 3));
+// graph.addEdge(v5, v7, new DeBruijnEdge(false, 2));
+// graph.addEdge(v6, v7, new DeBruijnEdge(false, 6));
+// graph.addEdge(v7, v8, new DeBruijnEdge(false, 2));
+//
+// graph.printGraph(new File("unittest.dot"), 1);
+//
+// DeBruijnAssemblyGraph expected = new DeBruijnAssemblyGraph();
+// DeBruijnVertex e1 = new DeBruijnVertex("ATC");
+// DeBruijnVertex e2 = new DeBruijnVertex("T");
+// DeBruijnVertex e3 = new DeBruijnVertex("G");
+// DeBruijnVertex e4 = new DeBruijnVertex("GAA");
+//
+// expected.addVertices(e1,e2,e3,e4);
+// expected.addEdge(e1, e2, new DeBruijnEdge(false, 3));
+// expected.addEdge(e1, e3, new DeBruijnEdge(false, 5));
+// expected.addEdge(e2, e4, new DeBruijnEdge(false, 2));
+// expected.addEdge(e3, e4, new DeBruijnEdge(false, 6));
+//
+// expected.printGraph(new File("expected.dot"), 1);
+//
+// tests.add(new Object[]{graph.clone(), expected});
+// }
+//
+// return tests.toArray(new Object[][]{});
+// }
+//
+// @Test(dataProvider = "SimpleMergeOperationsData", enabled = true)
+// public void testSimpleMergeOperations(final DeBruijnAssemblyGraph unmergedGraph, final DeBruijnAssemblyGraph expectedGraph) throws Exception {
+// final DeBruijnAssemblyGraph mergedGraph = (DeBruijnAssemblyGraph)unmergedGraph.clone();
+// DeBruijnAssembler.mergeNodes(mergedGraph);
+// mergedGraph.printGraph(new File("merged.dot"), 1);
+// DeBruijnAssembler.simplifyMergedGraph(mergedGraph);
+// mergedGraph.printGraph(new File("reduced.dot"), 1);
+// Assert.assertTrue(graphEquals(mergedGraph, expectedGraph));
+// }
+
+ @Test(enabled = !DEBUG)
public void testPruneGraph() {
DeBruijnAssemblyGraph graph = new DeBruijnAssemblyGraph();
DeBruijnAssemblyGraph expectedGraph = new DeBruijnAssemblyGraph();
@@ -210,7 +268,7 @@ public class DeBruijnAssemblerUnitTest extends BaseTest {
return true;
}
- @Test(enabled = true)
+ @Test(enabled = !DEBUG)
public void testReferenceCycleGraph() {
String refCycle = "ATCGAGGAGAGCGCCCCGAGATATATATATATATATTTGCGAGCGCGAGCGTTTTAAAAATTTTAGACGGAGAGATATATATATATGGGAGAGGGGATATATATATATCCCCCC";
String noCycle = "ATCGAGGAGAGCGCCCCGAGATATTATTTGCGAGCGCGAGCGTTTTAAAAATTTTAGACGGAGAGATGGGAGAGGGGATATATAATATCCCCCC";
@@ -221,7 +279,7 @@ public class DeBruijnAssemblerUnitTest extends BaseTest {
Assert.assertTrue(g2 != null, "Reference non-cycle graph should not return null during creation.");
}
- @Test(enabled = true)
+ @Test(enabled = !DEBUG)
public void testLeftAlignCigarSequentially() {
String preRefString = "GATCGATCGATC";
String postRefString = "TTT";
diff --git a/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrectorUnitTest.java b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrectorUnitTest.java
new file mode 100644
index 000000000..f88d7ee7f
--- /dev/null
+++ b/protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrectorUnitTest.java
@@ -0,0 +1,78 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
+
+import org.broadinstitute.sting.BaseTest;
+import org.testng.Assert;
+import org.testng.annotations.Test;
+
+public class KMerErrorCorrectorUnitTest extends BaseTest {
+ @Test
+ public void testMyData() {
+ final KMerErrorCorrector corrector = new KMerErrorCorrector(3, 1, 2, 2);
+
+ corrector.addKmers(
+ "ATG", "ATG", "ATG", "ATG",
+ "ACC", "ACC", "ACC",
+ "AAA", "AAA",
+ "CTG", // -> ATG
+ "NNA", // -> AAA
+ "CCC", // => ACC
+ "NNN", // => null
+ "NNC" // => ACC [because of min count won't go to NNA]
+ );
+
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("ATG"), "ATG");
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("ACC"), "ACC");
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("AAA"), "AAA");
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("CTG"), "ATG");
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("NNA"), "AAA");
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("CCC"), "ACC");
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("NNN"), null);
+ Assert.assertEquals(corrector.getErrorCorrectedKmer("NNC"), "ACC");
+ }
+}
From 98c4cd060d098323655e9b0899a8253ef1be4b25 Mon Sep 17 00:00:00 2001
From: Mark DePristo
Date: Thu, 14 Mar 2013 10:03:04 -0400
Subject: [PATCH 058/211] HaplotypeCaller now uses SeqGraph instead of kmer
graph to build haplotypes.
-- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
---
.../{DeBruijnEdge.java => BaseEdge.java} | 70 ++--
...ruijnAssemblyGraph.java => BaseGraph.java} | 318 ++++++++++--------
.../walkers/haplotypecaller/BaseVertex.java | 148 ++++++++
.../haplotypecaller/DeBruijnAssembler.java | 249 ++++----------
.../haplotypecaller/DeBruijnGraph.java | 179 ++++++++++
.../haplotypecaller/DeBruijnVertex.java | 63 ++--
.../haplotypecaller/HaplotypeCaller.java | 12 +-
.../walkers/haplotypecaller/KBestPaths.java | 96 +++---
.../haplotypecaller/KMerErrorCorrector.java | 28 +-
.../walkers/haplotypecaller/SeqGraph.java | 280 +++++++++++++++
.../walkers/haplotypecaller/SeqVertex.java | 153 +++++++++
.../haplotypecaller/BaseEdgeUnitTest.java | 105 ++++++
.../haplotypecaller/BaseGraphUnitTest.java | 192 +++++++++++
.../haplotypecaller/BaseVertexUnitTest.java | 91 +++++
.../DeBruijnAssemblerUnitTest.java | 205 +----------
.../DeBruijnAssemblyGraphUnitTest.java | 2 +-
.../DeBruijnVertexUnitTest.java | 69 ++++
.../haplotypecaller/KBestPathsUnitTest.java | 183 ++++++----
.../KMerErrorCorrectorUnitTest.java | 25 +-
.../haplotypecaller/SeqGraphUnitTest.java | 106 ++++++
.../haplotypecaller/SeqVertexUnitTest.java | 109 ++++++
.../org/broadinstitute/sting/utils/Utils.java | 13 +
22 files changed, 1964 insertions(+), 732 deletions(-)
rename protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/{DeBruijnEdge.java => BaseEdge.java} (83%)
rename protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/{DeBruijnAssemblyGraph.java => BaseGraph.java} (70%)
create mode 100644 protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseVertex.java
create mode 100644 protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnGraph.java
create mode 100644 protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqGraph.java
create mode 100644 protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqVertex.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseEdgeUnitTest.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseGraphUnitTest.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseVertexUnitTest.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertexUnitTest.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqGraphUnitTest.java
create mode 100644 protected/java/test/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqVertexUnitTest.java
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnEdge.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseEdge.java
similarity index 83%
rename from protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnEdge.java
rename to protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseEdge.java
index 28c735b5c..053f0e1a1 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnEdge.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseEdge.java
@@ -46,68 +46,94 @@
package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
-import org.jgrapht.graph.DefaultDirectedGraph;
-
import java.io.Serializable;
import java.util.Comparator;
/**
- * Created by IntelliJ IDEA.
+ * simple edge class for connecting nodes in the graph
+ *
+ * Works equally well for all graph types (kmer or sequence)
+ *
* User: ebanks
* Date: Mar 23, 2011
*/
-
-// simple edge class for connecting nodes in the graph
-public class DeBruijnEdge {
-
+public class BaseEdge {
private int multiplicity;
private boolean isRef;
- public DeBruijnEdge() {
- multiplicity = 1;
- isRef = false;
- }
+ /**
+ * Create a new BaseEdge with weight multiplicity and, if isRef == true, indicates a path through the reference
+ *
+ * @param isRef indicates whether this edge is a path through the reference
+ * @param multiplicity the number of observations of this edge
+ */
+ public BaseEdge(final boolean isRef, final int multiplicity) {
+ if ( multiplicity < 0 ) throw new IllegalArgumentException("multiplicity must be >= 0");
- public DeBruijnEdge( final boolean isRef ) {
- multiplicity = 1;
- this.isRef = isRef;
- }
-
- public DeBruijnEdge( final boolean isRef, final int multiplicity ) {
this.multiplicity = multiplicity;
this.isRef = isRef;
}
+ /**
+ * Copy constructor
+ *
+ * @param toCopy
+ */
+ public BaseEdge(final BaseEdge toCopy) {
+ this(toCopy.isRef(), toCopy.getMultiplicity());
+ }
+
+ /**
+ * Get the number of observations of paths connecting two vertices
+ * @return a positive integer >= 0
+ */
public int getMultiplicity() {
return multiplicity;
}
+ /**
+ * Set the multiplicity of this edge to value
+ * @param value an integer >= 0
+ */
public void setMultiplicity( final int value ) {
+ if ( multiplicity < 0 ) throw new IllegalArgumentException("multiplicity must be >= 0");
multiplicity = value;
}
+ /**
+ * Does this edge indicate a path through the reference graph?
+ * @return true if so
+ */
public boolean isRef() {
return isRef;
}
+ /**
+ * Indicate that this edge follows the reference sequence, or not
+ * @param isRef true if this is a reference edge
+ */
public void setIsRef( final boolean isRef ) {
this.isRef = isRef;
}
// For use when comparing edges pulled from the same graph
- public boolean equals( final DeBruijnAssemblyGraph graph, final DeBruijnEdge edge ) {
+ public boolean equals( final BaseGraph graph, final BaseEdge edge ) {
return (graph.getEdgeSource(this).equals(graph.getEdgeSource(edge))) && (graph.getEdgeTarget(this).equals(graph.getEdgeTarget(edge)));
}
// For use when comparing edges across graphs!
- public boolean equals( final DeBruijnAssemblyGraph graph, final DeBruijnEdge edge, final DeBruijnAssemblyGraph graph2 ) {
+ public boolean equals( final BaseGraph graph, final BaseEdge edge, final BaseGraph graph2 ) {
return (graph.getEdgeSource(this).equals(graph2.getEdgeSource(edge))) && (graph.getEdgeTarget(this).equals(graph2.getEdgeTarget(edge)));
}
- public static class EdgeWeightComparator implements Comparator, Serializable {
+ /**
+ * Sorts a collection of BaseEdges in decreasing order of weight, so that the most
+ * heavily weighted is at the start of the list
+ */
+ public static class EdgeWeightComparator implements Comparator, Serializable {
@Override
- public int compare(final DeBruijnEdge edge1, final DeBruijnEdge edge2) {
- return edge1.multiplicity - edge2.multiplicity;
+ public int compare(final BaseEdge edge1, final BaseEdge edge2) {
+ return edge2.multiplicity - edge1.multiplicity;
}
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseGraph.java
similarity index 70%
rename from protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
rename to protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseGraph.java
index a78a5c627..6aa687312 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssemblyGraph.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseGraph.java
@@ -49,13 +49,15 @@ package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
import com.google.java.contract.Ensures;
import org.apache.commons.lang.ArrayUtils;
import org.apache.log4j.Logger;
+import org.jgrapht.EdgeFactory;
import org.jgrapht.graph.DefaultDirectedGraph;
+import org.jgrapht.traverse.DepthFirstIterator;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
-import java.util.Arrays;
+import java.util.*;
/**
* Created with IntelliJ IDEA.
@@ -63,44 +65,37 @@ import java.util.Arrays;
* Date: 2/6/13
*/
-public class DeBruijnAssemblyGraph extends DefaultDirectedGraph {
- private final static Logger logger = Logger.getLogger(DeBruijnAssemblyGraph.class);
+public class BaseGraph extends DefaultDirectedGraph {
+ protected final static Logger logger = Logger.getLogger(BaseGraph.class);
private final int kmerSize;
/**
- * Construct a DeBruijnAssemblyGraph with kmerSize
- * @param kmerSize
+ * Construct an empty BaseGraph
*/
- public DeBruijnAssemblyGraph(final int kmerSize) {
- super(DeBruijnEdge.class);
-
- if ( kmerSize < 1 ) throw new IllegalArgumentException("kmerSize must be >= 1 but got " + kmerSize);
- this.kmerSize = kmerSize;
- }
-
- public static DeBruijnAssemblyGraph parse(final int kmerSize, final int multiplicity, final String ... reads) {
- final DeBruijnAssemblyGraph graph = new DeBruijnAssemblyGraph(kmerSize);
-
- for ( final String read : reads ) {
- final int kmersInSequence = read.length() - kmerSize + 1;
- for (int i = 0; i < kmersInSequence - 1; i++) {
- // get the kmers
- final byte[] kmer1 = new byte[kmerSize];
- System.arraycopy(read.getBytes(), i, kmer1, 0, kmerSize);
- final byte[] kmer2 = new byte[kmerSize];
- System.arraycopy(read.getBytes(), i+1, kmer2, 0, kmerSize);
- graph.addKmersToGraph(kmer1, kmer2, false, multiplicity);
- }
- }
-
- return graph;
+ public BaseGraph() {
+ this(11);
}
/**
- * Test construct that makes DeBruijnAssemblyGraph assuming a kmerSize of 11
+ * Edge factory that creates non-reference multiplicity 1 edges
+ * @param the new of our vertices
*/
- protected DeBruijnAssemblyGraph() {
- this(11);
+ private static class MyEdgeFactory implements EdgeFactory {
+ @Override
+ public BaseEdge createEdge(T sourceVertex, T targetVertex) {
+ return new BaseEdge(false, 1);
+ }
+ }
+
+ /**
+ * Construct a DeBruijnGraph with kmerSize
+ * @param kmerSize
+ */
+ public BaseGraph(final int kmerSize) {
+ super(new MyEdgeFactory());
+
+ if ( kmerSize < 1 ) throw new IllegalArgumentException("kmerSize must be >= 1 but got " + kmerSize);
+ this.kmerSize = kmerSize;
}
/**
@@ -115,9 +110,9 @@ public class DeBruijnAssemblyGraph extends DefaultDirectedGraph outgoingVerticesOf(final T v) {
+ final Set s = new HashSet();
+ for ( final BaseEdge e : outgoingEdgesOf(v) ) {
+ s.add(getEdgeTarget(e));
+ }
+ return s;
+ }
+
+ /**
+ * Get the set of vertices connected to v by incoming edges
+ * @param v a non-null vertex
+ * @return a set of vertices {X} connected X -> v
+ */
+ public Set incomingVerticesOf(final T v) {
+ final Set s = new HashSet();
+ for ( final BaseEdge e : incomingEdgesOf(v) ) {
+ s.add(getEdgeSource(e));
+ }
+ return s;
+ }
+
/**
* Print out the graph in the dot language for visualization
* @param destination File to write to
@@ -403,11 +353,12 @@ public class DeBruijnAssemblyGraph extends DefaultDirectedGraph PRUNE_FACTOR ) {
graphWriter.println("\t" + getEdgeSource(edge).toString() + " -> " + getEdgeTarget(edge).toString() + " [" + (edge.getMultiplicity() <= pruneFactor ? "style=dotted,color=grey," : "") + "label=\"" + edge.getMultiplicity() + "\"];");
// }
@@ -417,11 +368,114 @@ public class DeBruijnAssemblyGraph extends DefaultDirectedGraph edgesToCheck = new HashSet();
+ edgesToCheck.addAll(incomingEdgesOf(getReferenceSourceVertex()));
+ while( !edgesToCheck.isEmpty() ) {
+ final BaseEdge e = edgesToCheck.iterator().next();
+ if( !e.isRef() ) {
+ edgesToCheck.addAll( incomingEdgesOf(getEdgeSource(e)) );
+ removeEdge(e);
+ }
+ edgesToCheck.remove(e);
+ }
+
+ edgesToCheck.addAll(outgoingEdgesOf(getReferenceSinkVertex()));
+ while( !edgesToCheck.isEmpty() ) {
+ final BaseEdge e = edgesToCheck.iterator().next();
+ if( !e.isRef() ) {
+ edgesToCheck.addAll( outgoingEdgesOf(getEdgeTarget(e)) );
+ removeEdge(e);
+ }
+ edgesToCheck.remove(e);
+ }
+
+ // Run through the graph and clean up singular orphaned nodes
+ final List verticesToRemove = new LinkedList();
+ for( final T v : vertexSet() ) {
+ if( inDegreeOf(v) == 0 && outDegreeOf(v) == 0 ) {
+ verticesToRemove.add(v);
+ }
+ }
+ removeAllVertices(verticesToRemove);
+ }
+
+ protected void pruneGraph( final int pruneFactor ) {
+ final List edgesToRemove = new ArrayList();
+ for( final BaseEdge e : edgeSet() ) {
+ if( e.getMultiplicity() <= pruneFactor && !e.isRef() ) { // remove non-reference edges with weight less than or equal to the pruning factor
+ edgesToRemove.add(e);
+ }
+ }
+ removeAllEdges(edgesToRemove);
+
+ // Run through the graph and clean up singular orphaned nodes
+ final List verticesToRemove = new ArrayList();
+ for( final T v : vertexSet() ) {
+ if( inDegreeOf(v) == 0 && outDegreeOf(v) == 0 ) {
+ verticesToRemove.add(v);
+ }
+ }
+
+ removeAllVertices(verticesToRemove);
+ }
+
+ public void removeVerticesNotConnectedToRef() {
+ final HashSet toRemove = new HashSet(vertexSet());
+ final HashSet visited = new HashSet();
+
+ final LinkedList toVisit = new LinkedList();
+ final T refV = getReferenceSourceVertex();
+ if ( refV != null ) {
+ toVisit.add(refV);
+ while ( ! toVisit.isEmpty() ) {
+ final T v = toVisit.pop();
+ if ( ! visited.contains(v) ) {
+ toRemove.remove(v);
+ visited.add(v);
+ for ( final T prev : incomingVerticesOf(v) ) toVisit.add(prev);
+ for ( final T next : outgoingVerticesOf(v) ) toVisit.add(next);
+ }
+ }
+
+// for ( final T remove : toRemove )
+// logger.info("Cleaning up nodes not attached to any reference node: " + remove.toString());
+
+ removeAllVertices(toRemove);
+ }
+ }
+
+ public static boolean graphEquals(final BaseGraph g1, BaseGraph g2) {
+ if( !(g1.vertexSet().containsAll(g2.vertexSet()) && g2.vertexSet().containsAll(g1.vertexSet())) ) {
+ return false;
+ }
+ for( BaseEdge e1 : g1.edgeSet() ) {
+ boolean found = false;
+ for( BaseEdge e2 : g2.edgeSet() ) {
+ if( e1.equals(g1, e2, g2) ) { found = true; break; }
+ }
+ if( !found ) { return false; }
+ }
+ for( BaseEdge e2 : g2.edgeSet() ) {
+ boolean found = false;
+ for( BaseEdge e1 : g1.edgeSet() ) {
+ if( e2.equals(g2, e1, g1) ) { found = true; break; }
+ }
+ if( !found ) { return false; }
+ }
+ return true;
+ }
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseVertex.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseVertex.java
new file mode 100644
index 000000000..fad7a51d1
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/BaseVertex.java
@@ -0,0 +1,148 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
+
+import com.google.java.contract.Ensures;
+
+import java.util.Arrays;
+
+/**
+ * A graph vertex that holds some sequence information
+ *
+ * @author: depristo
+ * @since 03/2013
+ */
+public class BaseVertex {
+ final byte[] sequence;
+
+ /**
+ * Create a new sequence vertex with sequence
+ * @param sequence a non-null, non-empty sequence of bases contained in this vertex
+ */
+ public BaseVertex(final byte[] sequence) {
+ if ( sequence == null ) throw new IllegalArgumentException("Sequence cannot be null");
+ if ( sequence.length == 0 ) throw new IllegalArgumentException("Sequence cannot be empty");
+
+ // TODO -- should we really be cloning here?
+ this.sequence = sequence.clone();
+ }
+
+ /**
+ * Get the length of this sequence
+ * @return a positive integer >= 1
+ */
+ public int length() {
+ return sequence.length;
+ }
+
+ /**
+ * For testing purposes only -- low performance
+ * @param sequence
+ */
+ protected BaseVertex(final String sequence) {
+ this(sequence.getBytes());
+ }
+
+ @Override
+ public boolean equals(Object o) {
+ if (this == o) return true;
+ if (o == null || getClass() != o.getClass()) return false;
+
+ BaseVertex that = (BaseVertex) o;
+
+ if (!Arrays.equals(sequence, that.sequence)) return false;
+
+ return true;
+ }
+
+ @Override
+ public int hashCode() { // necessary to override here so that graph.containsVertex() works the same way as vertex.equals() as one might expect
+ return Arrays.hashCode(sequence);
+ }
+
+ @Override
+ public String toString() {
+ return getSequenceString();
+ }
+
+ /**
+ * Get the sequence of bases contained in this vertex
+ *
+ * Do not modify these bytes in any way!
+ *
+ * @return a non-null pointer to the bases contained in this vertex
+ */
+ @Ensures("result != null")
+ public byte[] getSequence() {
+ // TODO -- why is this cloning? It's likely extremely expensive
+ return sequence.clone();
+ }
+
+ /**
+ * Get a string representation of the bases in this vertex
+ * @return a non-null String
+ */
+ @Ensures("result != null")
+ public String getSequenceString() {
+ return new String(sequence);
+ }
+
+ /**
+ * Get the sequence unique to this vertex
+ *
+ * This function may not return the entire sequence stored in the vertex, as kmer graphs
+ * really only provide 1 base of additional sequence (the last base of the kmer).
+ *
+ * The base implementation simply returns the sequence.
+ *
+ * @param source is this vertex a source vertex (i.e., no in nodes) in the graph
+ * @return a byte[] of the sequence added by this vertex to the overall sequence
+ */
+ public byte[] getAdditionalSequence(final boolean source) {
+ return getSequence();
+ }
+}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
index 0caebebee..9d84d611f 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnAssembler.java
@@ -65,8 +65,6 @@ import org.broadinstitute.variant.variantcontext.Allele;
import org.broadinstitute.variant.variantcontext.VariantContext;
import java.io.File;
-import java.io.FileNotFoundException;
-import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.*;
@@ -81,7 +79,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
private static final int KMER_OVERLAP = 5; // the additional size of a valid chunk of sequence, used to string together k-mers
private static final int NUM_BEST_PATHS_PER_KMER_GRAPH = 11;
- private static final byte MIN_QUALITY = (byte) 16;
+ public static final byte DEFAULT_MIN_BASE_QUALITY_TO_USE = (byte) 16;
private static final int GRAPH_KMER_STEP = 6;
// Smith-Waterman parameters originally copied from IndelRealigner, only used during GGA mode
@@ -91,22 +89,34 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
private static final double SW_GAP_EXTEND = -1.2; //-1.0/.0;
private final boolean debug;
- private final int onlyBuildKmerGraphOfThisSite = -1; // 35;
private final boolean debugGraphTransformations;
private final PrintStream graphWriter;
- private final List graphs = new ArrayList();
private final int minKmer;
private final int maxHaplotypesToConsider;
+ private final byte minBaseQualityToUseInAssembly;
+
+ private final int onlyBuildKmersOfThisSizeWhenDebuggingGraphAlgorithms;
private int PRUNE_FACTOR = 2;
- public DeBruijnAssembler(final boolean debug, final boolean debugGraphTransformations, final PrintStream graphWriter, final int minKmer, final int maxHaplotypesToConsider) {
+ protected DeBruijnAssembler() {
+ this(false, -1, null, 11, 1000, DEFAULT_MIN_BASE_QUALITY_TO_USE);
+ }
+
+ public DeBruijnAssembler(final boolean debug,
+ final int debugGraphTransformations,
+ final PrintStream graphWriter,
+ final int minKmer,
+ final int maxHaplotypesToConsider,
+ final byte minBaseQualityToUseInAssembly) {
super();
this.debug = debug;
- this.debugGraphTransformations = debugGraphTransformations;
+ this.debugGraphTransformations = debugGraphTransformations > 0;
+ this.onlyBuildKmersOfThisSizeWhenDebuggingGraphAlgorithms = debugGraphTransformations;
this.graphWriter = graphWriter;
this.minKmer = minKmer;
this.maxHaplotypesToConsider = maxHaplotypesToConsider;
+ this.minBaseQualityToUseInAssembly = minBaseQualityToUseInAssembly;
}
/**
@@ -130,199 +140,73 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
this.PRUNE_FACTOR = PRUNE_FACTOR;
// create the graphs
- createDeBruijnGraphs( activeRegion.getReads(), refHaplotype );
+ final List graphs = createDeBruijnGraphs( activeRegion.getReads(), refHaplotype );
// print the graphs if the appropriate debug option has been turned on
if( graphWriter != null ) {
- printGraphs();
+ printGraphs(graphs);
}
// find the best paths in the graphs and return them as haplotypes
- return findBestPaths( refHaplotype, fullReferenceWithPadding, refLoc, activeAllelesToGenotype, activeRegion.getExtendedLoc() );
+ return findBestPaths( graphs, refHaplotype, fullReferenceWithPadding, refLoc, activeAllelesToGenotype, activeRegion.getExtendedLoc() );
}
@Requires({"reads != null", "refHaplotype != null"})
- protected void createDeBruijnGraphs( final List reads, final Haplotype refHaplotype ) {
- graphs.clear();
+ protected List createDeBruijnGraphs( final List reads, final Haplotype refHaplotype ) {
+ final List graphs = new LinkedList();
final int maxKmer = ReadUtils.getMaxReadLength(reads) - KMER_OVERLAP - 1;
- if( maxKmer < minKmer) { return; } // Reads are too small for assembly so don't try to create any assembly graphs
-
+ if( maxKmer < minKmer) {
+ // Reads are too small for assembly so don't try to create any assembly graphs
+ return Collections.emptyList();
+ }
// create the graph for each possible kmer
for( int kmer = maxKmer; kmer >= minKmer; kmer -= GRAPH_KMER_STEP ) {
- if ( onlyBuildKmerGraphOfThisSite != -1 && kmer != onlyBuildKmerGraphOfThisSite )
+ if ( debugGraphTransformations && kmer > onlyBuildKmersOfThisSizeWhenDebuggingGraphAlgorithms)
continue;
if ( debug ) logger.info("Creating de Bruijn graph for " + kmer + " kmer using " + reads.size() + " reads");
- DeBruijnAssemblyGraph graph = createGraphFromSequences( reads, kmer, refHaplotype, debug);
+ DeBruijnGraph graph = createGraphFromSequences( reads, kmer, refHaplotype, debug);
if( graph != null ) { // graphs that fail during creation ( for example, because there are cycles in the reference graph ) will show up here as a null graph object
// do a series of steps to clean up the raw assembly graph to make it analysis-ready
if ( debugGraphTransformations ) graph.printGraph(new File("unpruned.dot"), PRUNE_FACTOR);
graph = graph.errorCorrect();
if ( debugGraphTransformations ) graph.printGraph(new File("errorCorrected.dot"), PRUNE_FACTOR);
- cleanNonRefPaths(graph);
- mergeNodes(graph);
- if ( debugGraphTransformations ) graph.printGraph(new File("merged.dot"), PRUNE_FACTOR);
- pruneGraph(graph, PRUNE_FACTOR);
- if ( debugGraphTransformations ) graph.printGraph(new File("pruned.dot"), PRUNE_FACTOR);
- mergeNodes(graph);
- if ( debugGraphTransformations ) graph.printGraph(new File("merged2.dot"), PRUNE_FACTOR);
- if( graph.getReferenceSourceVertex() != null ) { // if the graph contains interesting variation from the reference
- sanityCheckReferenceGraph(graph, refHaplotype);
- graphs.add(graph);
+ graph.cleanNonRefPaths();
+
+ final SeqGraph seqGraph = toSeqGraph(graph);
+
+ if( seqGraph.getReferenceSourceVertex() != null ) { // if the graph contains interesting variation from the reference
+ sanityCheckReferenceGraph(seqGraph, refHaplotype);
+ graphs.add(seqGraph);
+
+ if ( debugGraphTransformations ) // we only want to use one graph size
+ break;
}
}
+
}
+
+ return graphs;
}
- @Requires({"graph != null"})
- protected static void mergeNodes( final DeBruijnAssemblyGraph graph ) {
- boolean foundNodesToMerge = true;
- while( foundNodesToMerge ) {
- foundNodesToMerge = false;
-
- for( final DeBruijnEdge e : graph.edgeSet() ) {
- final DeBruijnVertex outgoingVertex = graph.getEdgeTarget(e);
- final DeBruijnVertex incomingVertex = graph.getEdgeSource(e);
- if( !outgoingVertex.equals(incomingVertex) && graph.outDegreeOf(incomingVertex) == 1 && graph.inDegreeOf(outgoingVertex) == 1 &&
- graph.inDegreeOf(incomingVertex) <= 1 && graph.outDegreeOf(outgoingVertex) <= 1 && graph.isReferenceNode(incomingVertex) == graph.isReferenceNode(outgoingVertex) ) {
- final Set outEdges = graph.outgoingEdgesOf(outgoingVertex);
- final Set inEdges = graph.incomingEdgesOf(incomingVertex);
- if( inEdges.size() == 1 && outEdges.size() == 1 ) {
- inEdges.iterator().next().setMultiplicity( inEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() / 2 ) );
- outEdges.iterator().next().setMultiplicity( outEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() / 2 ) );
- } else if( inEdges.size() == 1 ) {
- inEdges.iterator().next().setMultiplicity( inEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() - 1 ) );
- } else if( outEdges.size() == 1 ) {
- outEdges.iterator().next().setMultiplicity( outEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() - 1 ) );
- }
-
- final DeBruijnVertex addedVertex = new DeBruijnVertex( ArrayUtils.addAll(incomingVertex.getSequence(), outgoingVertex.getSuffix()), outgoingVertex.kmer );
- graph.addVertex(addedVertex);
- for( final DeBruijnEdge edge : outEdges ) {
- graph.addEdge(addedVertex, graph.getEdgeTarget(edge), new DeBruijnEdge(edge.isRef(), edge.getMultiplicity()));
- }
- for( final DeBruijnEdge edge : inEdges ) {
- graph.addEdge(graph.getEdgeSource(edge), addedVertex, new DeBruijnEdge(edge.isRef(), edge.getMultiplicity()));
- }
-
- graph.removeVertex( incomingVertex );
- graph.removeVertex( outgoingVertex );
- foundNodesToMerge = true;
- break;
- }
- }
- }
+ private SeqGraph toSeqGraph(final DeBruijnGraph deBruijnGraph) {
+ final SeqGraph seqGraph = deBruijnGraph.convertToSequenceGraph();
+ if ( debugGraphTransformations ) seqGraph.printGraph(new File("sequenceGraph.1.dot"), PRUNE_FACTOR);
+ seqGraph.pruneGraph(PRUNE_FACTOR);
+ if ( debugGraphTransformations ) seqGraph.printGraph(new File("sequenceGraph.2.pruned.dot"), PRUNE_FACTOR);
+ seqGraph.mergeNodes();
+ if ( debugGraphTransformations ) seqGraph.printGraph(new File("sequenceGraph.3.merged.preclean.dot"), PRUNE_FACTOR);
+ seqGraph.removeVerticesNotConnectedToRef();
+ if ( debugGraphTransformations ) seqGraph.printGraph(new File("sequenceGraph.4.merged.dot"), PRUNE_FACTOR);
+ seqGraph.mergeBranchingNodes();
+ if ( debugGraphTransformations ) seqGraph.printGraph(new File("sequenceGraph.5.simplified.dot"), PRUNE_FACTOR);
+ seqGraph.mergeNodes();
+ if ( debugGraphTransformations ) seqGraph.printGraph(new File("sequenceGraph.6.simplified.merged.dot"), PRUNE_FACTOR);
+ return seqGraph;
}
- //
- // X -> ABC -> Y
- // -> aBC -> Y
- //
- // becomes
- //
- // X -> A -> BCY
- // -> a -> BCY
- //
-// @Requires({"graph != null"})
-// protected static void simplifyMergedGraph(final DeBruijnAssemblyGraph graph) {
-// boolean foundNodesToMerge = true;
-// while( foundNodesToMerge ) {
-// foundNodesToMerge = false;
-//
-// for( final DeBruijnVertex v : graph.vertexSet() ) {
-// if ( isRootOfComplexDiamond(v) ) {
-// foundNodesToMerge = simplifyComplexDiamond(graph, v);
-// if ( foundNodesToMerge )
-// break;
-// }
-// }
-// }
-// }
-//
-// private static boolean simplifyComplexDiamond(final DeBruijnAssemblyGraph graph, final DeBruijnVertex root) {
-// final Set outEdges = graph.outgoingEdgesOf(root);
-// final DeBruijnVertex diamondBottom = graph.getEdge(graph.getEdgeTarget(outEdges.iterator().next());
-// // all of the edges point to the same sink, so it's time to merge
-// final byte[] commonSuffix = commonSuffixOfEdgeTargets(outEdges, targetSink);
-// if ( commonSuffix != null ) {
-// final DeBruijnVertex suffixVertex = new DeBruijnVertex(commonSuffix, graph.getKmerSize());
-// graph.addVertex(suffixVertex);
-// graph.addEdge(suffixVertex, targetSink);
-//
-// for( final DeBruijnEdge edge : outEdges ) {
-// final DeBruijnVertex target = graph.getEdgeTarget(edge);
-// final DeBruijnVertex prefix = target.withoutSuffix(commonSuffix);
-// graph.addEdge(prefix, suffixVertex, new DeBruijnEdge(edge.isRef(), edge.getMultiplicity()));
-// graph.removeVertex(graph.getEdgeTarget(edge));
-// graph.removeAllEdges(root, target);
-// graph.removeAllEdges(target, targetSink);
-// }
-//
-// graph.removeAllEdges(outEdges);
-// graph.removeVertex(targetSink);
-//
-// return true;
-// } else {
-// return false;
-// }
-// }
-
- protected static void cleanNonRefPaths( final DeBruijnAssemblyGraph graph ) {
- if( graph.getReferenceSourceVertex() == null || graph.getReferenceSinkVertex() == null ) {
- return;
- }
- // Remove non-ref edges connected before and after the reference path
- final Set edgesToCheck = new HashSet();
- edgesToCheck.addAll(graph.incomingEdgesOf(graph.getReferenceSourceVertex()));
- while( !edgesToCheck.isEmpty() ) {
- final DeBruijnEdge e = edgesToCheck.iterator().next();
- if( !e.isRef() ) {
- edgesToCheck.addAll( graph.incomingEdgesOf(graph.getEdgeSource(e)) );
- graph.removeEdge(e);
- }
- edgesToCheck.remove(e);
- }
- edgesToCheck.addAll(graph.outgoingEdgesOf(graph.getReferenceSinkVertex()));
- while( !edgesToCheck.isEmpty() ) {
- final DeBruijnEdge e = edgesToCheck.iterator().next();
- if( !e.isRef() ) {
- edgesToCheck.addAll( graph.outgoingEdgesOf(graph.getEdgeTarget(e)) );
- graph.removeEdge(e);
- }
- edgesToCheck.remove(e);
- }
-
- // Run through the graph and clean up singular orphaned nodes
- final List verticesToRemove = new ArrayList();
- for( final DeBruijnVertex v : graph.vertexSet() ) {
- if( graph.inDegreeOf(v) == 0 && graph.outDegreeOf(v) == 0 ) {
- verticesToRemove.add(v);
- }
- }
- graph.removeAllVertices(verticesToRemove);
- }
-
- protected static void pruneGraph( final DeBruijnAssemblyGraph graph, final int pruneFactor ) {
- final List edgesToRemove = new ArrayList();
- for( final DeBruijnEdge e : graph.edgeSet() ) {
- if( e.getMultiplicity() <= pruneFactor && !e.isRef() ) { // remove non-reference edges with weight less than or equal to the pruning factor
- edgesToRemove.add(e);
- }
- }
- graph.removeAllEdges(edgesToRemove);
-
- // Run through the graph and clean up singular orphaned nodes
- final List verticesToRemove = new ArrayList();
- for( final DeBruijnVertex v : graph.vertexSet() ) {
- if( graph.inDegreeOf(v) == 0 && graph.outDegreeOf(v) == 0 ) {
- verticesToRemove.add(v);
- }
- }
- graph.removeAllVertices(verticesToRemove);
- }
-
- protected static void sanityCheckReferenceGraph(final DeBruijnAssemblyGraph graph, final Haplotype refHaplotype) {
+ protected void sanityCheckReferenceGraph(final BaseGraph graph, final Haplotype refHaplotype) {
if( graph.getReferenceSourceVertex() == null ) {
throw new IllegalStateException("All reference graphs must have a reference source vertex.");
}
@@ -338,9 +222,9 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
@Requires({"reads != null", "KMER_LENGTH > 0", "refHaplotype != null"})
- protected static DeBruijnAssemblyGraph createGraphFromSequences( final List reads, final int KMER_LENGTH, final Haplotype refHaplotype, final boolean DEBUG ) {
+ protected DeBruijnGraph createGraphFromSequences( final List reads, final int KMER_LENGTH, final Haplotype refHaplotype, final boolean DEBUG ) {
- final DeBruijnAssemblyGraph graph = new DeBruijnAssemblyGraph(KMER_LENGTH);
+ final DeBruijnGraph graph = new DeBruijnGraph(KMER_LENGTH);
// First pull kmers from the reference haplotype and add them to the graph
//logger.info("Adding reference sequence to graph " + refHaplotype.getBaseString());
@@ -370,7 +254,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
// if the qualities of all the bases in the kmers are high enough
boolean badKmer = false;
for( int jjj = iii; jjj < iii + KMER_LENGTH + 1; jjj++) {
- if( qualities[jjj] < MIN_QUALITY ) {
+ if( qualities[jjj] < minBaseQualityToUseInAssembly ) {
badKmer = true;
break;
}
@@ -397,11 +281,11 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
return graph;
}
- protected void printGraphs() {
+ protected void printGraphs(final List graphs) {
final int writeFirstGraphWithSizeSmallerThan = 50;
graphWriter.println("digraph assemblyGraphs {");
- for( final DeBruijnAssemblyGraph graph : graphs ) {
+ for( final SeqGraph graph : graphs ) {
if ( debugGraphTransformations && graph.getKmerSize() >= writeFirstGraphWithSizeSmallerThan ) {
logger.info("Skipping writing of graph with kmersize " + graph.getKmerSize());
continue;
@@ -418,7 +302,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
@Requires({"refWithPadding.length > refHaplotype.getBases().length", "refLoc.containsP(activeRegionWindow)"})
@Ensures({"result.contains(refHaplotype)"})
- private List findBestPaths( final Haplotype refHaplotype, final byte[] refWithPadding, final GenomeLoc refLoc, final List activeAllelesToGenotype, final GenomeLoc activeRegionWindow ) {
+ private List findBestPaths( final List graphs, final Haplotype refHaplotype, final byte[] refWithPadding, final GenomeLoc refLoc, final List activeAllelesToGenotype, final GenomeLoc activeRegionWindow ) {
// add the reference haplotype separately from all the others to ensure that it is present in the list of haplotypes
// TODO -- this use of an array with contains lower may be a performance problem returning in an O(N^2) algorithm
@@ -440,8 +324,8 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
}
}
- for( final DeBruijnAssemblyGraph graph : graphs ) {
- for ( final KBestPaths.Path path : KBestPaths.getKBestPaths(graph, NUM_BEST_PATHS_PER_KMER_GRAPH) ) {
+ for( final SeqGraph graph : graphs ) {
+ for ( final KBestPaths.Path path : new KBestPaths().getKBestPaths(graph, NUM_BEST_PATHS_PER_KMER_GRAPH) ) {
Haplotype h = new Haplotype( path.getBases() );
if( !returnHaplotypes.contains(h) ) {
final Cigar cigar = path.calculateCigar();
@@ -466,6 +350,9 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
h.setScore(path.getScore());
returnHaplotypes.add(h);
+ if ( debug )
+ logger.info("Adding haplotype " + h.getCigar() + " from debruijn graph with kmer " + graph.getKmerSize());
+
// for GGA mode, add the desired allele into the haplotype if it isn't already present
if( !activeAllelesToGenotype.isEmpty() ) {
final Map eventMap = GenotypingEngine.generateVCsFromAlignment( h, h.getAlignmentStartHapwrtRef(), h.getCigar(), refWithPadding, h.getBases(), refLoc, "HCassembly" ); // BUGBUG: need to put this function in a shared place
@@ -599,7 +486,7 @@ public class DeBruijnAssembler extends LocalAssemblyEngine {
* @return the left-aligned cigar
*/
@Ensures({"cigar != null", "refSeq != null", "readSeq != null", "refIndex >= 0", "readIndex >= 0"})
- protected static Cigar leftAlignCigarSequentially(final Cigar cigar, final byte[] refSeq, final byte[] readSeq, int refIndex, int readIndex) {
+ protected Cigar leftAlignCigarSequentially(final Cigar cigar, final byte[] refSeq, final byte[] readSeq, int refIndex, int readIndex) {
final Cigar cigarToReturn = new Cigar();
Cigar cigarToAlign = new Cigar();
for (int i = 0; i < cigar.numCigarElements(); i++) {
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnGraph.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnGraph.java
new file mode 100644
index 000000000..d9df03539
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnGraph.java
@@ -0,0 +1,179 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
+
+import com.google.java.contract.Ensures;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+
+/**
+ * A DeBruijn kmer graph
+ *
+ * User: rpoplin
+ * Date: 2/6/13
+ */
+public class DeBruijnGraph extends BaseGraph {
+ /**
+ * Create an empty DeBruijnGraph with default kmer size
+ */
+ public DeBruijnGraph() {
+ super();
+ }
+
+ /**
+ * Create an empty DeBruijnGraph with kmer size
+ * @param kmerSize kmer size, must be >= 1
+ */
+ public DeBruijnGraph(int kmerSize) {
+ super(kmerSize);
+ }
+
+ /**
+ * Pull kmers out of the given long sequence and throw them on in the graph
+ * @param sequence byte array holding the sequence with which to build the assembly graph
+ * @param KMER_LENGTH the desired kmer length to use
+ * @param isRef if true the kmers added to the graph will have reference edges linking them
+ */
+ public void addSequenceToGraph( final byte[] sequence, final int KMER_LENGTH, final boolean isRef ) {
+ if( sequence.length < KMER_LENGTH + 1 ) { throw new IllegalArgumentException("Provided sequence is too small for the given kmer length"); }
+ final int kmersInSequence = sequence.length - KMER_LENGTH + 1;
+ for( int iii = 0; iii < kmersInSequence - 1; iii++ ) {
+ addKmersToGraph(Arrays.copyOfRange(sequence, iii, iii + KMER_LENGTH), Arrays.copyOfRange(sequence, iii + 1, iii + 1 + KMER_LENGTH), isRef, 1);
+ }
+ }
+
+ /**
+ * Error correct the kmers in this graph, returning a new graph built from those error corrected kmers
+ * @return a freshly allocated graph
+ */
+ protected DeBruijnGraph errorCorrect() {
+ final KMerErrorCorrector corrector = new KMerErrorCorrector(getKmerSize(), 1, 1, 5); // TODO -- should be static variables
+
+ for( final BaseEdge e : edgeSet() ) {
+ for ( final byte[] kmer : Arrays.asList(getEdgeSource(e).getSequence(), getEdgeTarget(e).getSequence())) {
+ // TODO -- need a cleaner way to deal with the ref weight
+ corrector.addKmer(kmer, e.isRef() ? 1000 : e.getMultiplicity());
+ }
+ }
+ corrector.computeErrorCorrectionMap();
+
+ final DeBruijnGraph correctedGraph = new DeBruijnGraph(getKmerSize());
+
+ for( final BaseEdge e : edgeSet() ) {
+ final byte[] source = corrector.getErrorCorrectedKmer(getEdgeSource(e).getSequence());
+ final byte[] target = corrector.getErrorCorrectedKmer(getEdgeTarget(e).getSequence());
+ if ( source != null && target != null ) {
+ correctedGraph.addKmersToGraph(source, target, e.isRef(), e.getMultiplicity());
+ }
+ }
+
+ return correctedGraph;
+ }
+
+ /**
+ * Add edge to assembly graph connecting the two kmers
+ * @param kmer1 the source kmer for the edge
+ * @param kmer2 the target kmer for the edge
+ * @param isRef true if the added edge is a reference edge
+ * @return will return false if trying to add a reference edge which creates a cycle in the assembly graph
+ */
+ public boolean addKmersToGraph( final byte[] kmer1, final byte[] kmer2, final boolean isRef, final int multiplicity ) {
+ if( kmer1 == null ) { throw new IllegalArgumentException("Attempting to add a null kmer to the graph."); }
+ if( kmer2 == null ) { throw new IllegalArgumentException("Attempting to add a null kmer to the graph."); }
+ if( kmer1.length != kmer2.length ) { throw new IllegalArgumentException("Attempting to add a kmers to the graph with different lengths."); }
+
+ final int numVertexBefore = vertexSet().size();
+ final DeBruijnVertex v1 = new DeBruijnVertex( kmer1 );
+ addVertex(v1);
+ final DeBruijnVertex v2 = new DeBruijnVertex( kmer2 );
+ addVertex(v2);
+ if( isRef && vertexSet().size() == numVertexBefore ) { return false; }
+
+ final BaseEdge targetEdge = getEdge(v1, v2);
+ if ( targetEdge == null ) {
+ addEdge(v1, v2, new BaseEdge( isRef, multiplicity ));
+ } else {
+ if( isRef ) {
+ targetEdge.setIsRef( true );
+ }
+ targetEdge.setMultiplicity(targetEdge.getMultiplicity() + multiplicity);
+ }
+ return true;
+ }
+
+ /**
+ * Convert this kmer graph to a simple sequence graph.
+ *
+ * Each kmer suffix shows up as a distinct SeqVertex, attached in the same structure as in the kmer
+ * graph. Nodes that are sources are mapped to SeqVertex nodes that contain all of their sequence
+ *
+ * @return a newly allocated SequenceGraph
+ */
+ @Ensures({"result != null"})
+ protected SeqGraph convertToSequenceGraph() {
+ final SeqGraph seqGraph = new SeqGraph(getKmerSize());
+ final Map vertexMap = new HashMap();
+
+ // create all of the equivalent seq graph vertices
+ for ( final DeBruijnVertex dv : vertexSet() ) {
+ final SeqVertex sv = new SeqVertex(dv.getAdditionalSequence(isSource(dv)));
+ vertexMap.put(dv, sv);
+ seqGraph.addVertex(sv);
+ }
+
+ // walk through the nodes and connect them to their equivalent seq vertices
+ for( final BaseEdge e : edgeSet() ) {
+ final SeqVertex seqOutV = vertexMap.get(getEdgeTarget(e));
+ final SeqVertex seqInV = vertexMap.get(getEdgeSource(e));
+ seqGraph.addEdge(seqInV, seqOutV, e);
+ }
+
+ return seqGraph;
+ }
+}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java
index aa8e24576..47716b7c5 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/DeBruijnVertex.java
@@ -52,59 +52,50 @@ import com.google.java.contract.Invariant;
import java.util.Arrays;
/**
- * Created by IntelliJ IDEA.
+ * simple node class for storing kmer sequences
+ *
* User: ebanks
* Date: Mar 23, 2011
*/
-// simple node class for storing kmer sequences
-@Invariant("kmer > 0")
-public class DeBruijnVertex {
-
- protected final byte[] sequence;
- public final int kmer;
-
- public DeBruijnVertex( final byte[] sequence, final int kmer ) {
- this.sequence = sequence.clone();
- this.kmer = kmer;
- }
-
- protected DeBruijnVertex( final String sequence, final int kmer ) {
- this(sequence.getBytes(), kmer);
+public class DeBruijnVertex extends BaseVertex {
+ public DeBruijnVertex( final byte[] sequence ) {
+ super(sequence);
}
+ /**
+ * For testing purposes only
+ * @param sequence
+ */
protected DeBruijnVertex( final String sequence ) {
- this(sequence.getBytes(), sequence.length());
+ this(sequence.getBytes());
}
+ /**
+ * Get the kmer size for this DeBruijnVertex
+ * @return integer >= 1
+ */
+ @Ensures("result >= 1")
public int getKmer() {
- return kmer;
+ return sequence.length;
}
- @Override
- public boolean equals( Object v ) {
- return v instanceof DeBruijnVertex && Arrays.equals(sequence, ((DeBruijnVertex) v).sequence);
- }
-
- @Override
- public int hashCode() { // necessary to override here so that graph.containsVertex() works the same way as vertex.equals() as one might expect
- return Arrays.hashCode(sequence);
- }
-
- public String toString() {
- return new String(sequence);
- }
-
+ /**
+ * Get the string representation of the suffix of this DeBruijnVertex
+ * @return a non-null non-empty string
+ */
+ @Ensures({"result != null", "result.length() >= 1"})
public String getSuffixString() {
return new String(getSuffix());
}
@Ensures("result != null")
- public byte[] getSequence() {
- return sequence.clone();
+ // TODO this could be replaced with byte as the suffix is guarenteed to be exactly 1 base
+ public byte[] getSuffix() {
+ return Arrays.copyOfRange( sequence, getKmer() - 1, sequence.length );
}
- @Ensures("result != null")
- public byte[] getSuffix() {
- return Arrays.copyOfRange( sequence, kmer - 1, sequence.length );
+ @Override
+ public byte[] getAdditionalSequence(boolean source) {
+ return source ? super.getAdditionalSequence(source) : getSuffix();
}
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
index d5f283475..7bec4bee5 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
@@ -284,8 +284,11 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
@Argument(fullName="debug", shortName="debug", doc="If specified, print out very verbose debug information about each triggering active region", required = false)
protected boolean DEBUG;
- @Argument(fullName="debugGraphTransformations", shortName="debugGraphTransformations", doc="If specified, we will write DOT formatted graph files out of the assembler", required = false)
- protected boolean debugGraphTransformations = false;
+ @Argument(fullName="debugGraphTransformations", shortName="debugGraphTransformations", doc="If specified, we will write DOT formatted graph files out of the assembler for only this graph size", required = false)
+ protected int debugGraphTransformations = -1;
+
+ @Argument(fullName="useLowQualityBasesForAssembly", shortName="useLowQualityBasesForAssembly", doc="If specified, we will include low quality bases when doing the assembly", required = false)
+ protected boolean useLowQualityBasesForAssembly = false;
// the UG engines
private UnifiedGenotyperEngine UG_engine = null;
@@ -389,7 +392,8 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
throw new UserException.CouldNotReadInputFile(getToolkit().getArguments().referenceFile, e);
}
- assemblyEngine = new DeBruijnAssembler( DEBUG, debugGraphTransformations, graphWriter, minKmer, maxHaplotypesToConsider );
+ final byte minBaseQualityToUseInAssembly = useLowQualityBasesForAssembly ? (byte)1 : DeBruijnAssembler.DEFAULT_MIN_BASE_QUALITY_TO_USE;
+ assemblyEngine = new DeBruijnAssembler( DEBUG, debugGraphTransformations, graphWriter, minKmer, maxHaplotypesToConsider, minBaseQualityToUseInAssembly );
likelihoodCalculationEngine = new LikelihoodCalculationEngine( (byte)gcpHMM, DEBUG, pairHMM );
genotypingEngine = new GenotypingEngine( DEBUG, annotationEngine, USE_FILTERED_READ_MAP_FOR_ANNOTATIONS );
@@ -610,7 +614,7 @@ public class HaplotypeCaller extends ActiveRegionWalker implem
for( final GATKSAMRecord myRead : finalizedReadList ) {
final GATKSAMRecord postAdapterRead = ( myRead.getReadUnmappedFlag() ? myRead : ReadClipper.hardClipAdaptorSequence( myRead ) );
if( postAdapterRead != null && !postAdapterRead.isEmpty() && postAdapterRead.getCigar().getReadLength() > 0 ) {
- GATKSAMRecord clippedRead = ReadClipper.hardClipLowQualEnds( postAdapterRead, MIN_TAIL_QUALITY );
+ GATKSAMRecord clippedRead = useLowQualityBasesForAssembly ? postAdapterRead : ReadClipper.hardClipLowQualEnds( postAdapterRead, MIN_TAIL_QUALITY );
// revert soft clips so that we see the alignment start and end assuming the soft clips are all matches
// TODO -- WARNING -- still possibility that unclipping the soft clips will introduce bases that aren't
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KBestPaths.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KBestPaths.java
index e97fdb3cb..8c29cfa98 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KBestPaths.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KBestPaths.java
@@ -52,13 +52,8 @@ import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
import org.apache.commons.lang.ArrayUtils;
-import org.broadinstitute.sting.utils.GenomeLoc;
-import org.broadinstitute.sting.utils.Haplotype;
import org.broadinstitute.sting.utils.SWPairwiseAlignment;
-import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import org.broadinstitute.sting.utils.sam.AlignmentUtils;
-import org.broadinstitute.variant.variantcontext.Allele;
-import org.broadinstitute.variant.variantcontext.VariantContext;
import java.io.Serializable;
import java.util.*;
@@ -70,28 +65,27 @@ import java.util.*;
*/
// Class for finding the K best paths (as determined by the sum of multiplicities of the edges) in a graph.
// This is different from most graph traversals because we want to test paths from any source node to any sink node.
-public class KBestPaths {
-
+public class KBestPaths {
// static access only
- protected KBestPaths() { }
+ public KBestPaths() { }
+
private static int MAX_PATHS_TO_HOLD = 100;
protected static class MyInt { public int val = 0; }
// class to keep track of paths
- protected static class Path {
-
+ protected static class Path {
// the last vertex seen in the path
- private final DeBruijnVertex lastVertex;
+ private final T lastVertex;
// the list of edges comprising the path
- private final List edges;
+ private final List edges;
// the scores for the path
private final int totalScore;
// the graph from which this path originated
- private final DeBruijnAssemblyGraph graph;
+ private final BaseGraph graph;
// used in the bubble state machine to apply Smith-Waterman to the bubble sequence
// these values were chosen via optimization against the NA12878 knowledge base
@@ -101,19 +95,19 @@ public class KBestPaths {
private static final double SW_GAP_EXTEND = -1.1;
private static final byte[] STARTING_SW_ANCHOR_BYTES = "XXXXXXXXX".getBytes();
- public Path( final DeBruijnVertex initialVertex, final DeBruijnAssemblyGraph graph ) {
+ public Path( final T initialVertex, final BaseGraph graph ) {
lastVertex = initialVertex;
- edges = new ArrayList(0);
+ edges = new ArrayList(0);
totalScore = 0;
this.graph = graph;
}
- public Path( final Path p, final DeBruijnEdge edge ) {
+ public Path( final Path p, final BaseEdge edge ) {
if( !p.graph.getEdgeSource(edge).equals(p.lastVertex) ) { throw new IllegalStateException("Edges added to path must be contiguous."); }
graph = p.graph;
lastVertex = p.graph.getEdgeTarget(edge);
- edges = new ArrayList(p.edges);
+ edges = new ArrayList(p.edges);
edges.add(edge);
totalScore = p.totalScore + edge.getMultiplicity();
}
@@ -123,10 +117,10 @@ public class KBestPaths {
* @param edge the given edge to test
* @return true if the edge is found in this path
*/
- public boolean containsEdge( final DeBruijnEdge edge ) {
+ public boolean containsEdge( final BaseEdge edge ) {
if( edge == null ) { throw new IllegalArgumentException("Attempting to test null edge."); }
- for( final DeBruijnEdge e : edges ) {
+ for( final BaseEdge e : edges ) {
if( e.equals(graph, edge) ) {
return true;
}
@@ -140,11 +134,11 @@ public class KBestPaths {
* @param edge the given edge to test
* @return number of times this edge appears in the path
*/
- public int numInPath( final DeBruijnEdge edge ) {
+ public int numInPath( final BaseEdge edge ) {
if( edge == null ) { throw new IllegalArgumentException("Attempting to test null edge."); }
int numInPath = 0;
- for( final DeBruijnEdge e : edges ) {
+ for( final BaseEdge e : edges ) {
if( e.equals(graph, edge) ) {
numInPath++;
}
@@ -153,22 +147,11 @@ public class KBestPaths {
return numInPath;
}
- /**
- * Does this path contain a reference edge?
- * @return true if the path contains a reference edge
- */
- public boolean containsRefEdge() {
- for( final DeBruijnEdge e : edges ) {
- if( e.isRef() ) { return true; }
- }
- return false;
- }
-
- public List getEdges() { return edges; }
+ public List getEdges() { return edges; }
public int getScore() { return totalScore; }
- public DeBruijnVertex getLastVertexInPath() { return lastVertex; }
+ public T getLastVertexInPath() { return lastVertex; }
/**
* The base sequence for this path. Pull the full sequence for source nodes and then the suffix for all subsequent nodes
@@ -179,7 +162,7 @@ public class KBestPaths {
if( edges.size() == 0 ) { return graph.getAdditionalSequence(lastVertex); }
byte[] bases = graph.getAdditionalSequence(graph.getEdgeSource(edges.get(0)));
- for( final DeBruijnEdge e : edges ) {
+ for( final BaseEdge e : edges ) {
bases = ArrayUtils.addAll(bases, graph.getAdditionalSequence(graph.getEdgeTarget(e)));
}
return bases;
@@ -201,9 +184,9 @@ public class KBestPaths {
}
// reset the bubble state machine
- final BubbleStateMachine bsm = new BubbleStateMachine(cigar);
+ final BubbleStateMachine bsm = new BubbleStateMachine(cigar);
- for( final DeBruijnEdge e : edges ) {
+ for( final BaseEdge e : edges ) {
if( e.equals(graph, edges.get(0)) ) {
advanceBubbleStateMachine( bsm, graph.getEdgeSource(e), null );
}
@@ -231,7 +214,7 @@ public class KBestPaths {
* @param e the edge which generated this node in the path
*/
@Requires({"bsm != null", "graph != null", "node != null"})
- private void advanceBubbleStateMachine( final BubbleStateMachine bsm, final DeBruijnVertex node, final DeBruijnEdge e ) {
+ private void advanceBubbleStateMachine( final BubbleStateMachine bsm, final T node, final BaseEdge e ) {
if( graph.isReferenceNode( node ) ) {
if( !bsm.inBubble ) { // just add the ref bases as M's in the Cigar string, and don't do anything else
if( e !=null && !e.isRef() ) {
@@ -283,7 +266,7 @@ public class KBestPaths {
*/
@Requires({"graph != null"})
@Ensures({"result != null"})
- private Cigar calculateCigarForCompleteBubble( final byte[] bubbleBytes, final DeBruijnVertex fromVertex, final DeBruijnVertex toVertex ) {
+ private Cigar calculateCigarForCompleteBubble( final byte[] bubbleBytes, final T fromVertex, final T toVertex ) {
final byte[] refBytes = graph.getReferenceBytes(fromVertex == null ? graph.getReferenceSourceVertex() : fromVertex, toVertex == null ? graph.getReferenceSinkVertex() : toVertex, fromVertex == null, toVertex == null);
final Cigar returnCigar = new Cigar();
@@ -328,10 +311,10 @@ public class KBestPaths {
}
// class to keep track of the bubble state machine
- protected static class BubbleStateMachine {
+ protected static class BubbleStateMachine {
public boolean inBubble = false;
public byte[] bubbleBytes = null;
- public DeBruijnVertex lastSeenReferenceNode = null;
+ public T lastSeenReferenceNode = null;
public Cigar cigar = null;
public BubbleStateMachine( final Cigar initialCigar ) {
@@ -358,14 +341,14 @@ public class KBestPaths {
* @return a list with at most k top-scoring paths from the graph
*/
@Ensures({"result != null", "result.size() <= k"})
- public static List getKBestPaths( final DeBruijnAssemblyGraph graph, final int k ) {
+ public List getKBestPaths( final BaseGraph graph, final int k ) {
if( graph == null ) { throw new IllegalArgumentException("Attempting to traverse a null graph."); }
if( k > MAX_PATHS_TO_HOLD/2 ) { throw new IllegalArgumentException("Asked for more paths than internal parameters allow for."); }
final ArrayList bestPaths = new ArrayList();
// run a DFS for best paths
- for( final DeBruijnVertex v : graph.vertexSet() ) {
+ for( final T v : graph.vertexSet() ) {
if( graph.inDegreeOf(v) == 0 ) {
findBestPaths(new Path(v, graph), bestPaths);
}
@@ -376,31 +359,28 @@ public class KBestPaths {
return bestPaths.subList(0, Math.min(k, bestPaths.size()));
}
- private static void findBestPaths( final Path path, final List bestPaths ) {
+ private void findBestPaths( final Path path, final List bestPaths ) {
findBestPaths(path, bestPaths, new MyInt());
}
- private static void findBestPaths( final Path path, final List bestPaths, final MyInt n ) {
+ private void findBestPaths( final Path path, final List bestPaths, final MyInt n ) {
// did we hit the end of a path?
if ( allOutgoingEdgesHaveBeenVisited(path) ) {
- if( path.containsRefEdge() ) {
- if ( bestPaths.size() >= MAX_PATHS_TO_HOLD ) {
- // clean out some low scoring paths
- Collections.sort(bestPaths, new PathComparatorTotalScore() );
- for(int iii = 0; iii < 20; iii++) { bestPaths.remove(0); } // BUGBUG: assumes MAX_PATHS_TO_HOLD >> 20
- }
- bestPaths.add(path);
+ if ( bestPaths.size() >= MAX_PATHS_TO_HOLD ) {
+ // clean out some low scoring paths
+ Collections.sort(bestPaths, new PathComparatorTotalScore() );
+ for(int iii = 0; iii < 20; iii++) { bestPaths.remove(0); } // BUGBUG: assumes MAX_PATHS_TO_HOLD >> 20
}
+ bestPaths.add(path);
} else if( n.val > 10000) {
// do nothing, just return
} else {
// recursively run DFS
- final ArrayList edgeArrayList = new ArrayList();
+ final ArrayList edgeArrayList = new ArrayList();
edgeArrayList.addAll(path.graph.outgoingEdgesOf(path.lastVertex));
- Collections.sort(edgeArrayList, new DeBruijnEdge.EdgeWeightComparator());
- Collections.reverse(edgeArrayList);
- for ( final DeBruijnEdge edge : edgeArrayList ) {
+ Collections.sort(edgeArrayList, new BaseEdge.EdgeWeightComparator());
+ for ( final BaseEdge edge : edgeArrayList ) {
// make sure the edge is not already in the path
if ( path.containsEdge(edge) )
continue;
@@ -416,8 +396,8 @@ public class KBestPaths {
* @param path the path to test
* @return true if all the outgoing edges at the end of this path have already been visited
*/
- private static boolean allOutgoingEdgesHaveBeenVisited( final Path path ) {
- for( final DeBruijnEdge edge : path.graph.outgoingEdgesOf(path.lastVertex) ) {
+ private boolean allOutgoingEdgesHaveBeenVisited( final Path path ) {
+ for( final BaseEdge edge : path.graph.outgoingEdgesOf(path.lastVertex) ) {
if( !path.containsEdge(edge) ) { // TODO -- investigate allowing numInPath < 2 to allow cycles
return false;
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java
index 66ea8a078..05bd1b881 100644
--- a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/KMerErrorCorrector.java
@@ -226,28 +226,16 @@ public class KMerErrorCorrector {
@Override
public String toString() {
final StringBuilder b = new StringBuilder("KMerErrorCorrector{");
- for ( Map.Entry toCorrect : rawToErrorCorrectedMap.entrySet() ) {
- final boolean correcting = ! toCorrect.getKey().equals(toCorrect.getValue());
- if ( correcting )
- b.append(String.format("%n\t%s / %d -> %s / %d [correcting? %b]",
- toCorrect.getKey(), getCounts(toCorrect.getKey()),
- toCorrect.getValue(), getCounts(toCorrect.getValue()),
- correcting));
+ if ( rawToErrorCorrectedMap == null ) {
+ b.append("counting ").append(countsByKMer.size()).append(" distinct kmers");
+ } else {
+ for ( Map.Entry toCorrect : rawToErrorCorrectedMap.entrySet() ) {
+ final boolean correcting = ! toCorrect.getKey().equals(toCorrect.getValue());
+ if ( correcting )
+ b.append(String.format("%n\tCorrecting %s -> %s", toCorrect.getKey(), toCorrect.getValue()));
+ }
}
b.append("\n}");
return b.toString();
}
-
- /**
- * Get a simple count estimate for printing for kmer
- * @param kmer the kmer
- * @return an integer count for kmer
- */
- private int getCounts(final String kmer) {
- if ( kmer == null ) return 0;
- final Integer count = countsByKMer == null ? -1 : countsByKMer.get(kmer);
- if ( count == null )
- throw new IllegalArgumentException("kmer not found in counts -- bug " + kmer);
- return count;
- }
}
diff --git a/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqGraph.java b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqGraph.java
new file mode 100644
index 000000000..960f2cdd7
--- /dev/null
+++ b/protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/SeqGraph.java
@@ -0,0 +1,280 @@
+/*
+* By downloading the PROGRAM you agree to the following terms of use:
+*
+* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
+*
+* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
+*
+* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
+* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
+* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
+*
+* 1. DEFINITIONS
+* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
+*
+* 2. LICENSE
+* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
+* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
+* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
+* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
+*
+* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
+* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
+* Copyright 2012 Broad Institute, Inc.
+* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
+* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
+*
+* 4. INDEMNIFICATION
+* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
+*
+* 5. NO REPRESENTATIONS OR WARRANTIES
+* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
+* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
+*
+* 6. ASSIGNMENT
+* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
+*
+* 7. MISCELLANEOUS
+* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
+* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
+* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
+* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
+* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
+* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
+* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
+*/
+
+package org.broadinstitute.sting.gatk.walkers.haplotypecaller;
+
+import org.apache.commons.lang.ArrayUtils;
+import org.apache.commons.lang.StringUtils;
+
+import java.util.*;
+
+/**
+ * A graph that contains base sequence at each node
+ *
+ * @author: depristo
+ * @since 03/2013
+ */
+public class SeqGraph extends BaseGraph {
+ /**
+ * Construct an empty SeqGraph
+ */
+ public SeqGraph() {
+ super();
+ }
+
+ /**
+ * Construct an empty SeqGraph where we'll add nodes based on a kmer size of kmer
+ *
+ * The kmer size is purely information. It is useful when converting a Debruijn graph -> SeqGraph
+ * for us to track the kmer used to make the transformation.
+ *
+ * @param kmer kmer
+ */
+ public SeqGraph(final int kmer) {
+ super(kmer);
+ }
+
+ protected void mergeNodes() {
+ zipLinearChains();
+ }
+
+ protected void zipLinearChains() {
+ boolean foundNodesToMerge = true;
+ while( foundNodesToMerge ) {
+ foundNodesToMerge = false;
+
+ for( final BaseEdge e : edgeSet() ) {
+ final SeqVertex outgoingVertex = getEdgeTarget(e);
+ final SeqVertex incomingVertex = getEdgeSource(e);
+ if( !outgoingVertex.equals(incomingVertex)
+ && outDegreeOf(incomingVertex) == 1 && inDegreeOf(outgoingVertex) == 1
+ && isReferenceNode(incomingVertex) == isReferenceNode(outgoingVertex) ) {
+
+ final Set outEdges = outgoingEdgesOf(outgoingVertex);
+ final Set inEdges = incomingEdgesOf(incomingVertex);
+ if( inEdges.size() == 1 && outEdges.size() == 1 ) {
+ inEdges.iterator().next().setMultiplicity( inEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() / 2 ) );
+ outEdges.iterator().next().setMultiplicity( outEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() / 2 ) );
+ } else if( inEdges.size() == 1 ) {
+ inEdges.iterator().next().setMultiplicity( inEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() - 1 ) );
+ } else if( outEdges.size() == 1 ) {
+ outEdges.iterator().next().setMultiplicity( outEdges.iterator().next().getMultiplicity() + ( e.getMultiplicity() - 1 ) );
+ }
+
+ final SeqVertex addedVertex = new SeqVertex( ArrayUtils.addAll(incomingVertex.getSequence(), outgoingVertex.getSequence()) );
+ addVertex(addedVertex);
+ for( final BaseEdge edge : outEdges ) {
+ addEdge(addedVertex, getEdgeTarget(edge), new BaseEdge(edge.isRef(), edge.getMultiplicity()));
+ }
+ for( final BaseEdge edge : inEdges ) {
+ addEdge(getEdgeSource(edge), addedVertex, new BaseEdge(edge.isRef(), edge.getMultiplicity()));
+ }
+
+ removeVertex(incomingVertex);
+ removeVertex(outgoingVertex);
+ foundNodesToMerge = true;
+ break;
+ }
+ }
+ }
+ }
+
+ //
+ // X -> ABC -> Y
+ // -> aBC -> Y
+ //
+ // becomes
+ //
+ // X -> A -> BCY
+ // -> a -> BCY
+ //
+ public void mergeBranchingNodes() {
+ boolean foundNodesToMerge = true;
+ while( foundNodesToMerge ) {
+ foundNodesToMerge = false;
+
+ for( final SeqVertex v : vertexSet() ) {
+ foundNodesToMerge = simplifyDiamond(v);
+ if ( foundNodesToMerge )
+ break;
+ }
+ }
+ }
+
+ /**
+ * A simple structure that looks like:
+ *
+ * v
+ * / | \ \
+ * m1 m2 m3 ... mn
+ * \ | / /
+ * b
+ *
+ * @param v
+ * @return
+ */
+ protected boolean isRootOfDiamond(final SeqVertex v) {
+ final Set ve = outgoingEdgesOf(v);
+ if ( ve.size() <= 1 )
+ return false;
+
+ SeqVertex bottom = null;
+ for ( final BaseEdge e : ve ) {
+ final SeqVertex mi = getEdgeTarget(e);
+
+ // all nodes must have at least 1 connection
+ if ( outDegreeOf(mi) < 1 )
+ return false;
+
+ // can only have 1 incoming node, the root vertex
+ if ( inDegreeOf(mi) != 1 )
+ return false;
+
+ for ( final SeqVertex mt : outgoingVerticesOf(mi) ) {
+ if ( bottom == null )
+ bottom = mt;
+ else if ( ! bottom.equals(mt) )
+ return false;
+ }
+ }
+
+ return true;
+ }
+
+ private byte[] commonSuffixOfEdgeTargets(final Set middleVertices) {
+ final String[] kmers = new String[middleVertices.size()];
+
+ int i = 0;
+ for ( final SeqVertex v : middleVertices ) {
+ kmers[i++] = (StringUtils.reverse(v.getSequenceString()));
+ }
+
+ final String commonPrefix = StringUtils.getCommonPrefix(kmers);
+ return commonPrefix.equals("") ? null : StringUtils.reverse(commonPrefix).getBytes();
+ }
+
+ private SeqVertex getDiamondBottom(final SeqVertex top) {
+ final BaseEdge topEdge = outgoingEdgesOf(top).iterator().next();
+ final SeqVertex middle = getEdgeTarget(topEdge);
+ final BaseEdge middleEdge = outgoingEdgesOf(middle).iterator().next();
+ return getEdgeTarget(middleEdge);
+ }
+
+ final Set getMiddleVertices(final SeqVertex top) {
+ final Set