Merge remote-tracking branch 'unstable/master'
This commit is contained in:
commit
ee63b59b52
|
|
@ -24,7 +24,7 @@ LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic
|
|||
|
||||
4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
Copyright 2012-2014 Broad Institute, Inc.
|
||||
Copyright 2012-2015 Broad Institute, Inc.
|
||||
Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
|
||||
|
|
|
|||
|
|
@ -24,7 +24,7 @@ LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic
|
|||
|
||||
4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
Copyright 2012-2014 Broad Institute, Inc.
|
||||
Copyright 2012-2015 Broad Institute, Inc.
|
||||
Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
|
||||
|
|
|
|||
|
|
@ -1,4 +1,4 @@
|
|||
Copyright (c) 2012 The Broad Institute
|
||||
Copyright 2012-2015 Broad Institute, Inc.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person
|
||||
obtaining a copy of this software and associated documentation
|
||||
|
|
|
|||
1
pom.xml
1
pom.xml
|
|
@ -161,6 +161,7 @@
|
|||
<configuration>
|
||||
<outputDirectory>${gatk.executable.directory}/lib</outputDirectory>
|
||||
<includeScope>runtime</includeScope>
|
||||
<useBaseVersion>false</useBaseVersion>
|
||||
</configuration>
|
||||
</execution>
|
||||
</executions>
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -1,44 +1,44 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -46,6 +46,11 @@
|
|||
<artifactId>fastutil</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>com.github.broadinstitute</groupId>
|
||||
<artifactId>picard</artifactId>
|
||||
</dependency>
|
||||
|
||||
<dependency>
|
||||
<groupId>${project.groupId}</groupId>
|
||||
<artifactId>gatk-utils</artifactId>
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -70,22 +70,16 @@ public class GenotypeCalculationArgumentCollection implements Cloneable{
|
|||
/**
|
||||
* The expected heterozygosity value used to compute prior probability that a locus is non-reference.
|
||||
*
|
||||
* The default priors are for provided for humans:
|
||||
* From the heterozygosity we calculate the probability of N samples being hom-ref at a site as 1 - sum_i_2N (hets / i)
|
||||
* where hets is this case is analogous to the parameter theta from population genetics. See https://en.wikipedia.org/wiki/Coalescent_theory for more details.
|
||||
*
|
||||
* het = 1e-3
|
||||
* Note that heterozygosity as used here is the population genetics concept. (See http://en.wikipedia.org/wiki/Zygosity#Heterozygosity_in_population_genetics.
|
||||
* We also suggest the book "Population Genetics: A Concise Guide" by John H. Gillespie for further details on the theory.) That is, a hets value of 0.001
|
||||
* implies that two randomly chosen chromosomes from the population of organisms would differ from each other at a rate of 1 in 1000 bp.
|
||||
*
|
||||
* which means that the probability of N samples being hom-ref at a site is:
|
||||
* The default priors provided for humans (hets = 1e-3)
|
||||
*
|
||||
* 1 - sum_i_2N (het / i)
|
||||
*
|
||||
* Note that heterozygosity as used here is the population genetics concept:
|
||||
*
|
||||
* http://en.wikipedia.org/wiki/Zygosity#Heterozygosity_in_population_genetics
|
||||
*
|
||||
* That is, a hets value of 0.01 implies that two randomly chosen chromosomes from the population of organisms
|
||||
* would differ from each other (one being A and the other B) at a rate of 1 in 100 bp.
|
||||
*
|
||||
* Note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype,
|
||||
* Also note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype,
|
||||
* which in the GATK is purely determined by the probability of the observed data P(D | AB) under the model that there
|
||||
* may be a AB het genotype. The posterior probability of this AB genotype would use the het prior, but the GATK
|
||||
* only uses this posterior probability in determining the prob. that a site is polymorphic. So changing the
|
||||
|
|
@ -95,13 +89,13 @@ public class GenotypeCalculationArgumentCollection implements Cloneable{
|
|||
* The quantity that changes whether the GATK considers the possibility of a het genotype at all is the ploidy,
|
||||
* which determines how many chromosomes each individual in the species carries.
|
||||
*/
|
||||
@Argument(fullName = "heterozygosity", shortName = "hets", doc = "Heterozygosity value used to compute prior likelihoods for any locus. See the GATKDocs for full details on the meaning of this population genetics concept", required = false)
|
||||
@Argument(fullName = "heterozygosity", shortName = "hets", doc = "Heterozygosity value used to compute prior likelihoods for any locus", required = false)
|
||||
public Double snpHeterozygosity = HomoSapiensConstants.SNP_HETEROZYGOSITY;
|
||||
|
||||
/**
|
||||
* This argument informs the prior probability of having an indel at a site.
|
||||
*/
|
||||
@Argument(fullName = "indel_heterozygosity", shortName = "indelHeterozygosity", doc = "Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept", required = false)
|
||||
@Argument(fullName = "indel_heterozygosity", shortName = "indelHeterozygosity", doc = "Heterozygosity for indel calling", required = false)
|
||||
public double indelHeterozygosity = HomoSapiensConstants.INDEL_HETEROZYGOSITY;
|
||||
|
||||
/**
|
||||
|
|
@ -135,12 +129,13 @@ public class GenotypeCalculationArgumentCollection implements Cloneable{
|
|||
* see e.g. Waterson (1975) or Tajima (1996).
|
||||
* This model asserts that the probability of having a population of k variant sites in N chromosomes is proportional to theta/k, for 1=1:N
|
||||
*
|
||||
* There are instances where using this prior might not be desireable, e.g. for population studies where prior might not be appropriate,
|
||||
* There are instances where using this prior might not be desirable, e.g. for population studies where prior might not be appropriate,
|
||||
* as for example when the ancestral status of the reference allele is not known.
|
||||
* By using this argument, user can manually specify priors to be used for calling as a vector for doubles, with the following restriciotns:
|
||||
* By using this argument, the user can manually specify a list of probabilities for each AC>1 to be used as priors for genotyping,
|
||||
* with the following restrictions:
|
||||
* a) User must specify 2N values, where N is the number of samples.
|
||||
* b) Only diploid calls supported.
|
||||
* c) Probability values are specified in double format, in linear space.
|
||||
* c) Probability values are specified in Double format, in linear space (not log10 space or Phred-scale).
|
||||
* d) No negative values allowed.
|
||||
* e) Values will be added and Pr(AC=0) will be 1-sum, so that they sum up to one.
|
||||
* f) If user-defined values add to more than one, an error will be produced.
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -82,7 +82,7 @@ public class BQSRReadTransformer extends ReadTransformer {
|
|||
// Although we could add this check to the apply() method below, it's kind of ugly and inefficient.
|
||||
// The call here would be: RecalUtils.checkForInvalidRecalBams(engine.getSAMFileHeaders(), engine.getArguments().ALLOW_BQSR_ON_REDUCED_BAMS);
|
||||
final BQSRArgumentSet args = engine.getBQSRArgumentSet();
|
||||
this.bqsr = new BaseRecalibration(args.getRecalFile(), args.getQuantizationLevels(), args.shouldDisableIndelQuals(), args.getPreserveQscoresLessThan(), args.shouldEmitOriginalQuals(), args.getGlobalQScorePrior());
|
||||
this.bqsr = new BaseRecalibration(args.getRecalFile(), args.getQuantizationLevels(), args.shouldDisableIndelQuals(), args.getPreserveQscoresLessThan(), args.shouldEmitOriginalQuals(), args.getGlobalQScorePrior(), args.getStaticQuantizedQuals(), args.getRoundDown());
|
||||
}
|
||||
final BQSRMode mode = WalkerManager.getWalkerAnnotation(walker, BQSRMode.class);
|
||||
return mode.ApplicationTime();
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -62,6 +62,8 @@ import org.broadinstitute.gatk.utils.recalibration.EventType;
|
|||
import org.broadinstitute.gatk.engine.recalibration.covariates.Covariate;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.Iterator;
|
||||
import java.io.File;
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
|
|
@ -86,6 +88,8 @@ public class BaseRecalibration {
|
|||
private final double globalQScorePrior;
|
||||
private final boolean emitOriginalQuals;
|
||||
|
||||
private byte[] staticQuantizedMapping = null;
|
||||
|
||||
/**
|
||||
* Constructor using a GATK Report file
|
||||
*
|
||||
|
|
@ -93,8 +97,9 @@ public class BaseRecalibration {
|
|||
* @param quantizationLevels number of bins to quantize the quality scores
|
||||
* @param disableIndelQuals if true, do not emit base indel qualities
|
||||
* @param preserveQLessThan preserve quality scores less than this value
|
||||
* @param staticQuantizedQuals static quantized bins for quality scores
|
||||
*/
|
||||
public BaseRecalibration(final File RECAL_FILE, final int quantizationLevels, final boolean disableIndelQuals, final int preserveQLessThan, final boolean emitOriginalQuals, final double globalQScorePrior) {
|
||||
public BaseRecalibration(final File RECAL_FILE, final int quantizationLevels, final boolean disableIndelQuals, final int preserveQLessThan, final boolean emitOriginalQuals, final double globalQScorePrior, final List<Integer> staticQuantizedQuals, final boolean roundDown) {
|
||||
RecalibrationReport recalibrationReport = new RecalibrationReport(RECAL_FILE);
|
||||
|
||||
recalibrationTables = recalibrationReport.getRecalibrationTables();
|
||||
|
|
@ -109,6 +114,15 @@ public class BaseRecalibration {
|
|||
this.preserveQLessThan = preserveQLessThan;
|
||||
this.globalQScorePrior = globalQScorePrior;
|
||||
this.emitOriginalQuals = emitOriginalQuals;
|
||||
|
||||
// staticQuantizedQuals is entirely separate from the dynamic binning that quantizationLevels, and
|
||||
// staticQuantizedQuals does not make use of quantizationInfo
|
||||
if(staticQuantizedQuals != null) {
|
||||
if(staticQuantizedQuals.isEmpty()) {
|
||||
throw new IllegalStateException("List of static quantized quals is empty.");
|
||||
}
|
||||
staticQuantizedMapping = constructStaticQuantizedMapping(staticQuantizedQuals, roundDown);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -184,7 +198,13 @@ public class BaseRecalibration {
|
|||
// return the quantized version of the recalibrated quality
|
||||
final byte recalibratedQualityScore = quantizationInfo.getQuantizedQuals().get(recalibratedQual);
|
||||
|
||||
quals[offset] = recalibratedQualityScore;
|
||||
// Bin to static quals
|
||||
if(staticQuantizedMapping != null) {
|
||||
quals[offset] = staticQuantizedMapping[recalibratedQualityScore];
|
||||
}
|
||||
else {
|
||||
quals[offset] = recalibratedQualityScore;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -194,6 +214,67 @@ public class BaseRecalibration {
|
|||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Constructs an array that maps particular quantized values to a rounded value in staticQuantizedQuals
|
||||
*
|
||||
* Rounding is done in probability space. When roundDown is true, we simply round down to the nearest
|
||||
* available qual in staticQuantizedQuals
|
||||
*
|
||||
* @param staticQuantizedQuals the list of qualities to round to
|
||||
* @param roundDown round down if true, round to nearest (in probability space) otherwise
|
||||
* @return Array where index representing the quality score to be mapped and the value is the rounded quality score
|
||||
*/
|
||||
protected static byte[] constructStaticQuantizedMapping(List<Integer> staticQuantizedQuals, boolean roundDown) {
|
||||
// Create array mapping that maps quals to their rounded value.
|
||||
byte[] mapping = new byte[QualityUtils.MAX_QUAL];
|
||||
|
||||
Collections.sort(staticQuantizedQuals);
|
||||
Iterator<Integer> quantizationIterator = staticQuantizedQuals.iterator();
|
||||
|
||||
// Fill mapping with one-to-one mappings for values between 0 and MIN_USABLE_Q_SCORE
|
||||
// This ensures that quals used as special codes will be preserved
|
||||
for(int i = 0 ; i < QualityUtils.MIN_USABLE_Q_SCORE ; i++) {
|
||||
mapping[i] = (byte) i;
|
||||
}
|
||||
|
||||
// If only one staticQuantizedQual is given, fill mappings larger than QualityUtils.MAX_QUAL with that value
|
||||
if(staticQuantizedQuals.size() == 1) {
|
||||
int onlyQual = quantizationIterator.next();
|
||||
for(int i = QualityUtils.MIN_USABLE_Q_SCORE ; i < QualityUtils.MAX_QUAL ; i++) {
|
||||
mapping[i] = (byte) onlyQual;
|
||||
}
|
||||
return mapping;
|
||||
}
|
||||
|
||||
int firstQual = QualityUtils.MIN_USABLE_Q_SCORE;
|
||||
int previousQual = firstQual;
|
||||
double previousProb = QualityUtils.qualToProb(previousQual);
|
||||
while(quantizationIterator.hasNext()) {
|
||||
final int nextQual = quantizationIterator.next();
|
||||
final double nextProb = QualityUtils.qualToProb(nextQual);
|
||||
|
||||
for (int i = previousQual ; i < nextQual ; i++) {
|
||||
if (roundDown) {
|
||||
mapping[i] = (byte) previousQual;
|
||||
} else {
|
||||
final double iProb = QualityUtils.qualToProb(i);
|
||||
if ((iProb - previousProb) > (nextProb - iProb)) {
|
||||
mapping[i] = (byte) nextQual;
|
||||
} else {
|
||||
mapping[i] = (byte) previousQual;
|
||||
}
|
||||
}
|
||||
}
|
||||
previousQual = nextQual;
|
||||
previousProb = nextProb;
|
||||
}
|
||||
// Round all quals larger than the largest static qual down to the largest static qual
|
||||
for(int j = previousQual ; j < QualityUtils.MAX_QUAL ; j++) {
|
||||
mapping[j] = (byte) previousQual;
|
||||
}
|
||||
return mapping;
|
||||
}
|
||||
|
||||
@Ensures("result > 0.0")
|
||||
protected static double hierarchicalBayesianQualityEstimate( final double epsilon, final RecalDatum empiricalQualRG, final RecalDatum empiricalQualQS, final List<RecalDatum> empiricalQualCovs ) {
|
||||
final double globalDeltaQ = ( empiricalQualRG == null ? 0.0 : empiricalQualRG.getEmpiricalQuality(epsilon) - epsilon );
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -75,18 +75,16 @@ public class RecalibrationArgumentCollection implements Cloneable {
|
|||
|
||||
/**
|
||||
* This algorithm treats every reference mismatch as an indication of error. However, real genetic variation is expected to mismatch the reference,
|
||||
* so it is critical that a database of known polymorphic sites is given to the tool in order to skip over those sites. This tool accepts any number of RodBindings (VCF, Bed, etc.)
|
||||
* for use as this database. For users wishing to exclude an interval list of known variation simply use -XL my.interval.list to skip over processing those sites.
|
||||
* Please note however that the statistics reported by the tool will not accurately reflected those sites skipped by the -XL argument.
|
||||
* so it is critical that a database of known polymorphic sites (e.g. dbSNP) is given to the tool in order to mask out those sites.
|
||||
*/
|
||||
@Input(fullName = "knownSites", shortName = "knownSites", doc = "A database of known polymorphic sites to skip over in the recalibration algorithm", required = false)
|
||||
@Input(fullName = "knownSites", shortName = "knownSites", doc = "A database of known polymorphic sites", required = false)
|
||||
public List<RodBinding<Feature>> knownSites = Collections.emptyList();
|
||||
|
||||
/**
|
||||
* After the header, data records occur one per line until the end of the file. The first several items on a line are the
|
||||
* values of the individual covariates and will change depending on which covariates were specified at runtime. The last
|
||||
* three items are the data- that is, number of observations for this combination of covariates, number of reference mismatches,
|
||||
* and the raw empirical quality score calculated by phred-scaling the mismatch rate. Use '/dev/stdout' to print to standard out.
|
||||
* and the raw empirical quality score calculated by phred-scaling the mismatch rate.
|
||||
*/
|
||||
@Gather(BQSRGatherer.class)
|
||||
@Output(doc = "The output recalibration table file to create", required = true)
|
||||
|
|
@ -107,7 +105,7 @@ public class RecalibrationArgumentCollection implements Cloneable {
|
|||
@Argument(fullName = "covariate", shortName = "cov", doc = "One or more covariates to be used in the recalibration. Can be specified multiple times", required = false)
|
||||
public String[] COVARIATES = null;
|
||||
|
||||
/*
|
||||
/**
|
||||
* The Cycle and Context covariates are standard and are included by default unless this argument is provided.
|
||||
* Note that the ReadGroup and QualityScore covariates are required and cannot be excluded.
|
||||
*/
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -113,6 +113,11 @@ public abstract class RepeatCovariate implements ExperimentalCovariate {
|
|||
|
||||
}
|
||||
|
||||
/**
|
||||
* Please use {@link org.broadinstitute.gatk.utils.variant.TandemRepeatFinder#findMostRelevantTandemRepeatUnitAt(int)}
|
||||
* @deprecated
|
||||
*/
|
||||
@Deprecated
|
||||
public Pair<byte[], Integer> findTandemRepeatUnits(byte[] readBases, int offset) {
|
||||
int maxBW = 0;
|
||||
byte[] bestBWRepeatUnit = new byte[]{readBases[offset]};
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -0,0 +1,128 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools;
|
||||
|
||||
import htsjdk.samtools.util.IOUtil;
|
||||
import org.broadinstitute.gatk.engine.recalibration.BQSRGatherer;
|
||||
import picard.cmdline.CommandLineProgram;
|
||||
import picard.cmdline.CommandLineProgramProperties;
|
||||
import picard.cmdline.Option;
|
||||
import picard.cmdline.StandardOptionDefinitions;
|
||||
|
||||
import java.io.File;
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Gather recalibration reports from parallelized base recalibration runs
|
||||
*
|
||||
* This tool is intended to be used to combine recalibration tables from runs of BaseRecalibrator parallelized per-interval.
|
||||
* The combination is done simply by adding up all observations and errors.
|
||||
*
|
||||
* <h3>Usage</h3>
|
||||
* <p>Note that this is a command-line utility that bypasses the GATK engine. As a result, the command-line you must use to
|
||||
* invoke it is a little different from other GATK tools (see example below), and it does not accept any of the
|
||||
* classic "CommandLineGATK" arguments.</p>
|
||||
*
|
||||
* <h4>Input</h4>
|
||||
* List of scattered BQSR files
|
||||
*
|
||||
* <h4>Output</h4>
|
||||
* Combined recalibration table in GATKReport format.
|
||||
*
|
||||
* <h4>Command</h4>
|
||||
* <pre>
|
||||
* java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.GatherBqsrReports \
|
||||
* -I input.list \
|
||||
* -O output.grp
|
||||
* </pre>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>This method DOES NOT recalculate the empirical qualities and quantized qualities. You have to recalculate
|
||||
* them after combining. The reason for not calculating it is because this function is intended for combining a
|
||||
* series of recalibration reports, and it only makes sense to calculate the empirical qualities and quantized
|
||||
* qualities after all the recalibration reports have been combined. This is done to make the tool faster.
|
||||
* </li>
|
||||
* <li>The reported empirical quality is recalculated (because it is so simple to do).</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
|
||||
@CommandLineProgramProperties(
|
||||
usage = "Gathers scattered BQSR recalibration reports into a single file",
|
||||
usageShort = "Gathers scattered BQSR recalibration reports into a single file"
|
||||
)
|
||||
public class GatherBqsrReports extends CommandLineProgram {
|
||||
@Option(shortName = StandardOptionDefinitions.INPUT_SHORT_NAME, doc="List of scattered BQSR files")
|
||||
public List<File> INPUT;
|
||||
|
||||
@Option(shortName = StandardOptionDefinitions.OUTPUT_SHORT_NAME, doc="File to output the gathered file to")
|
||||
public File OUTPUT;
|
||||
|
||||
public static void main(final String[] args) {
|
||||
new GatherBqsrReports().instanceMainWithExit(args);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected int doWork() {
|
||||
for (final File report : INPUT) {
|
||||
IOUtil.assertFileIsReadable(report);
|
||||
}
|
||||
|
||||
IOUtil.assertFileIsWritable(OUTPUT);
|
||||
|
||||
new BQSRGatherer().gather(INPUT, OUTPUT);
|
||||
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -0,0 +1,109 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.sam.ReadUtils;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
|
||||
|
||||
/**
|
||||
* Allele-specific rank Sum Test of REF versus ALT base quality scores
|
||||
*
|
||||
* <p>This variant-level annotation compares the base qualities of the data supporting the reference allele with those supporting each alternate allele. To be clear, it does so separately for each alternate allele. </p>
|
||||
*
|
||||
* <p>The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for base qualities (bases supporting REF vs. bases supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>Uninformative reads are not used in these calculations.</li>
|
||||
* <li>The base quality rank sum test cannot be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_BaseQualityRankSumTest.php">BaseQualityRankSumTest</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class AS_BaseQualityRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation {
|
||||
@Override
|
||||
public List<String> getKeyNames() {
|
||||
return Arrays.asList(GATKVCFConstants.AS_BASE_QUAL_RANK_SUM_KEY);
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getRawKeyName() { return GATKVCFConstants.AS_RAW_BASE_QUAL_RANK_SUM_KEY;}
|
||||
|
||||
/**
|
||||
* Get the element for the given read at the given reference position
|
||||
*
|
||||
* @param read the read
|
||||
* @param refLoc the reference position
|
||||
* @return a Double representing the element to be used in the rank sum test, or null if it should not be used
|
||||
*/
|
||||
@Override
|
||||
protected Double getElementForRead(final GATKSAMRecord read, final int refLoc) {
|
||||
return (double) read.getBaseQualities()[ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, refLoc, ReadUtils.ClippingTail.RIGHT_TAIL)];
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,154 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
|
||||
/**
|
||||
* Allele-specific strand bias estimated using Fisher's Exact Test
|
||||
*
|
||||
* * <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other.</p>
|
||||
*
|
||||
* <p>The AS_FisherStrand annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It uses Fisher's Exact Test to determine if there is strand bias between forward and reverse strands for the reference or alternate allele, and does so separately for each alternate allele.</p>
|
||||
* <p>The output is a Phred-scaled p-value. The higher the output value, the more likely there is to be bias. More bias is indicative of false positive calls.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this application of Fisher's Exact Test.</p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>The FisherStrand test may not be calculated for certain complex indel cases or for multi-allelic sites.</li>
|
||||
* <li>FisherStrand is best suited for low coverage situations. For testing strand bias in higher coverage situations, see the StrandOddsRatio annotation.</li>
|
||||
* </ul>
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_FisherStrand.php">AS_FisherStrand</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php">StrandBiasBySample</a></b> outputs counts of read depth per allele for each strand orientation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php">StrandOddsRatio</a></b> is an updated form of FisherStrand that uses a symmetric odds ratio calculation.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class AS_FisherStrand extends AS_StrandBiasTest implements AS_StandardAnnotation {
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() {
|
||||
return Collections.singletonList(GATKVCFConstants.AS_FISHER_STRAND_KEY);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Map<String, Object> calculateAnnotationFromLikelihoodMap(final Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap,
|
||||
final VariantContext vc) {
|
||||
// either SNP with no alignment context, or indels: per-read likelihood map needed
|
||||
final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc, MIN_COUNT);
|
||||
//logger.info("VC " + vc);
|
||||
//printTable(table, 0.0);
|
||||
return pValueAnnotationForBestTable(table, null);
|
||||
}
|
||||
|
||||
/**
|
||||
* Create an annotation for the highest (i.e., least significant) p-value of table1 and table2
|
||||
*
|
||||
* @param table1 a contingency table, may be null
|
||||
* @param table2 a contingency table, may be null
|
||||
* @return annotation result for FS given tables
|
||||
*/
|
||||
private Map<String, Object> pValueAnnotationForBestTable(final int[][] table1, final int[][] table2) {
|
||||
if ( table2 == null )
|
||||
return table1 == null ? null : annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1));
|
||||
else if (table1 == null)
|
||||
return annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2));
|
||||
else { // take the one with the best (i.e., least significant pvalue)
|
||||
double pvalue1 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1);
|
||||
double pvalue2 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2);
|
||||
return annotationForOneTable(Math.max(pvalue1, pvalue2));
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns an annotation result given a pValue
|
||||
*
|
||||
* @param pValue
|
||||
* @return a hash map from FS -> phred-scaled pValue
|
||||
*/
|
||||
protected Map<String, Object> annotationForOneTable(final double pValue) {
|
||||
final Object value = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(pValue, MIN_PVALUE))); // prevent INFINITYs
|
||||
return Collections.singletonMap(getKeyNames().get(0), value);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Map<Allele,Double> calculateReducedData(AlleleSpecificAnnotationData<List<Integer>> combinedData) {
|
||||
final Map<Allele,Double> annotationMap = new HashMap<>();
|
||||
final Map<Allele,List<Integer>> perAlleleData = combinedData.getAttributeMap();
|
||||
final List<Integer> refStrandCounts = perAlleleData.get(combinedData.getRefAllele());
|
||||
for (final Allele a : perAlleleData.keySet()) {
|
||||
if(a.equals(combinedData.getRefAllele(),true))
|
||||
continue;
|
||||
final List<Integer> altStrandCounts = combinedData.getAttribute(a);
|
||||
final int[][] refAltTable = new int[][] {new int[]{refStrandCounts.get(0),refStrandCounts.get(1)},new int[]{altStrandCounts.get(0),altStrandCounts.get(1)}};
|
||||
annotationMap.put(a,QualityUtils.phredScaleErrorRate(Math.max(StrandBiasTableUtils.FisherExactPValueForContingencyTable(refAltTable), MIN_PVALUE)));
|
||||
}
|
||||
return annotationMap;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,179 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFConstants;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.engine.walkers.Walker;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Allele-specific likelihood-based test for the inbreeding among samples
|
||||
*
|
||||
* <p>This annotation estimates whether there is evidence of inbreeding in a population. The higher the score, the higher the chance that there is inbreeding.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The calculation is a continuous generalization of the Hardy-Weinberg test for disequilibrium that works well with limited coverage per sample. The output is the F statistic from running the HW test for disequilibrium with PL values. See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this statistical test.</p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>The inbreeding coefficient can only be calculated for cohorts containing at least 10 founder samples.</li>
|
||||
* <li>This annotation can take a valid pedigree file to specify founders. If not specified, all samples will be considered as founders.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_InbreedingCoeff.php">InbreedingCoeff</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_ExcessHet.php">ExcessHet</a></b> estimates excess heterozygosity in a population of samples.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
//TODO: this can't extend InbreedingCoeff because that one is Standard and it would force this to be output all the time; should fix code duplication nonetheless
|
||||
public class AS_InbreedingCoeff extends InfoFieldAnnotation implements AS_StandardAnnotation {
|
||||
|
||||
private final static Logger logger = Logger.getLogger(InbreedingCoeff.class);
|
||||
protected static final int MIN_SAMPLES = 10;
|
||||
private Set<String> founderIds;
|
||||
private boolean didUniquifiedSampleNameCheck = false;
|
||||
final private boolean RETURN_ROUNDED = false;
|
||||
protected HeterozygosityUtils heterozygosityUtils;
|
||||
|
||||
@Override
|
||||
public void initialize ( AnnotatorCompatible walker, GenomeAnalysisEngine toolkit, Set<VCFHeaderLine> headerLines ) {
|
||||
//If available, get the founder IDs and cache them. the IC will only be computed on founders then.
|
||||
if(founderIds == null && walker != null) {
|
||||
founderIds = ((Walker) walker).getSampleDB().getFounderIds();
|
||||
}
|
||||
if(walker != null && (((Walker) walker).getSampleDB().getSamples().size() < MIN_SAMPLES || (!founderIds.isEmpty() && founderIds.size() < MIN_SAMPLES)))
|
||||
logger.warn("Annotation will not be calculated. InbreedingCoeff requires at least " + MIN_SAMPLES + " unrelated samples.");
|
||||
//intialize a HeterozygosityUtils before annotating for use in unit tests
|
||||
heterozygosityUtils = new HeterozygosityUtils(RETURN_ROUNDED);
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Collections.singletonList(GATKVCFConstants.AS_INBREEDING_COEFFICIENT_KEY); }
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() { return Collections.singletonList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); }
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
|
||||
//create a new HeterozygosityUtils to store data for each VariantContext, i.e. each annotate() call
|
||||
heterozygosityUtils = new HeterozygosityUtils(RETURN_ROUNDED);
|
||||
|
||||
//if none of the "founders" are in the vc samples, assume we uniquified the samples upstream and they are all founders
|
||||
if (!didUniquifiedSampleNameCheck) {
|
||||
founderIds = AnnotationUtils.validateFounderIDs(founderIds, vc);
|
||||
didUniquifiedSampleNameCheck = true;
|
||||
}
|
||||
return makeCoeffAnnotation(vc);
|
||||
}
|
||||
|
||||
protected Map<String, Object> makeCoeffAnnotation(final VariantContext vc) {
|
||||
final List<Allele> altAlleles = vc.getAlternateAlleles();
|
||||
final List<Double> ICvalues = new ArrayList<>();
|
||||
|
||||
for (final Allele a : altAlleles) {
|
||||
ICvalues.add(calculateIC(vc, a));
|
||||
}
|
||||
if (heterozygosityUtils.getSampleCount() < MIN_SAMPLES)
|
||||
return null;
|
||||
return Collections.singletonMap(getKeyNames().get(0), (Object) AnnotationUtils.encodeValueList(ICvalues, "%.4f"));
|
||||
}
|
||||
|
||||
protected double calculateIC(final VariantContext vc, final Allele altAllele) {
|
||||
final int AN = vc.getCalledChrCount();
|
||||
final double altAF;
|
||||
|
||||
final double hetCount = heterozygosityUtils.getHetCount(vc, altAllele);
|
||||
|
||||
final double F;
|
||||
//shortcut to get a value closer to the non-alleleSpecific value for bialleleics
|
||||
if (vc.isBiallelic()) {
|
||||
double refAC = heterozygosityUtils.getAlleleCount(vc, vc.getReference());
|
||||
double altAC = heterozygosityUtils.getAlleleCount(vc, altAllele);
|
||||
double refAF = refAC/(altAC+refAC);
|
||||
altAF = 1 - refAF;
|
||||
F = 1.0 - (hetCount / (2.0 * refAF * altAF * (double) heterozygosityUtils.getSampleCount())); // inbreeding coefficient
|
||||
}
|
||||
else {
|
||||
//compare number of hets for this allele (and any other second allele) with the expectation based on AFs
|
||||
//derive the altAF from the likelihoods to account for any accumulation of fractional counts from non-primary likelihoods,
|
||||
//e.g. for a GQ10 variant, the probability of the call will be ~0.9 and the second best call will be ~0.1 so adding up those 0.1s for het counts can dramatically change the AF compared with integer counts
|
||||
altAF = heterozygosityUtils.getAlleleCount(vc, altAllele)/ (double) AN;
|
||||
F = 1.0 - (hetCount / (2.0 * (1 - altAF) * altAF * (double) heterozygosityUtils.getSampleCount())); // inbreeding coefficient
|
||||
}
|
||||
|
||||
return F;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,104 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.pileup.PileupElement;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
|
||||
/**
|
||||
* Allele specific Rank Sum Test for mapping qualities of REF versus ALT reads
|
||||
*
|
||||
* <p>This variant-level annotation compares the mapping qualities of the reads supporting the reference allele with those supporting each alternate allele. To be clear, it does so separately for each alternate allele. </p>
|
||||
*
|
||||
* <p>The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the reads supporting the alternate allele have lower mapping quality scores than those supporting the reference allele. Conversely, a positive value indicates that the reads supporting the alternate allele have higher mapping quality scores than those supporting the reference allele.</p>
|
||||
* <p>Finding a statistically significant difference in quality either way suggests that the sequencing and/or mapping process may have been biased or affected by an artifact. In practice, we only filter out low negative values when evaluating variant quality because the idea is to filter out variants for which the quality of the data supporting the alternate allele is comparatively low. The reverse case, where it is the quality of data supporting the reference allele that is lower (resulting in positive ranksum scores), is not really informative for filtering variants.
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for mapping qualities (MAPQ of reads supporting REF vs. MAPQ of reads supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul><li>The mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
|
||||
* <li>Uninformative reads are not used in these annotations.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSumTest</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_RMSMappingQuality.php">RMSMappingQuality</a></b> gives an estimation of the overal read mapping quality supporting a variant call.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class AS_MappingQualityRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation {
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_MAP_QUAL_RANK_SUM_KEY); }
|
||||
|
||||
@Override
|
||||
public String getRawKeyName() { return GATKVCFConstants.AS_RAW_MAP_QUAL_RANK_SUM_KEY;}
|
||||
|
||||
@Override
|
||||
protected Double getElementForRead(final GATKSAMRecord read, final int refLoc) {
|
||||
return (double)read.getMappingQuality();
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,204 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypesContext;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Allele-specific call confidence normalized by depth of sample reads supporting the allele
|
||||
*
|
||||
* <p>This annotation puts the variant confidence QUAL score into perspective by normalizing for the amount of coverage available. Because each read contributes a little to the QUAL score, variants in regions with deep coverage can have artificially inflated QUAL scores, giving the impression that the call is supported by more evidence than it really is. To compensate for this, we normalize the variant confidence by depth, which gives us a more objective picture of how well supported the call is.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The QD is the QUAL score normalized by allele depth (AD) for a variant. For a single sample, the HaplotypeCaller calculates the QD by taking QUAL/AD. For multiple samples, HaplotypeCaller and GenotypeGVCFs calculate the QD by taking QUAL/AD of samples with a non hom-ref genotype call. The reason we leave out the samples with a hom-ref call is to not penalize the QUAL for the other samples with the variant call.</p>
|
||||
* <h4>Here is a single-sample example:</h4>
|
||||
* <pre>2 37629 . C G 1063.77 . AC=2;AF=1.00;AN=2;DP=31;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=58.50;QD=34.32;SOR=2.376 GT:AD:DP:GQ:PL:QSS 1/1:0,31:31:93:1092,93,0:0,960</pre>
|
||||
<p>QUAL/AD = 1063.77/31 = 34.32 = QD</p>
|
||||
* <h4>Here is a multi-sample example:</h4>
|
||||
* <pre>10 8046 . C T 4107.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=-3.717;DP=1063;FS=1.616;MLEAC=1;MLEAF=0.167;QD=11.54
|
||||
GT:AD:DP:GQ:PL:QSS 0/0:369,4:373:99:0,1007,12207:10548,98 0/0:331,1:332:99:0,967,11125:9576,27 0/1:192,164:356:99:4138,0,5291:5501,4505</pre>
|
||||
* <p>QUAL/AD = 4107.13/356 = 11.54 = QD</p>
|
||||
* <p>Note that currently, when HaplotypeCaller is run with `-ERC GVCF`, the QD calculation is invoked before AD itself has been calculated, due to a technical constraint. In that case, HaplotypeCaller uses the number of overlapping reads from the haplotype likelihood calculation in place of AD to calculate QD, which generally yields a very similar number. This does not cause any measurable problems, but can cause some confusion since the number may be slightly different than what you would expect to get if you did the calculation manually. For that reason, this behavior will be modified in an upcoming version.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>This annotation can only be calculated for sites for which at least one sample was genotyped as carrying a variant allele.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_QualByDepth.php">AS_QualByDepth</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_Coverage.php">Coverage</a></b> gives the filtered depth of coverage for each sample and the unfiltered depth across all samples.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_DepthPerAlleleBySample.php">DepthPerAlleleBySample</a></b> calculates depth of coverage for each allele per sample (AD).</li>
|
||||
* </ul>
|
||||
*/
|
||||
public class AS_QualByDepth extends InfoFieldAnnotation implements ReducibleAnnotation, AS_StandardAnnotation {
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_QUAL_BY_DEPTH_KEY); }
|
||||
|
||||
@Override
|
||||
public String getRawKeyName() { return GATKVCFConstants.AS_QUAL_KEY; }
|
||||
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
//We only have the finalized key name here because the raw key is internal to GenotypeGVCFs and won't get output in any VCF
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
return null;
|
||||
}
|
||||
|
||||
private List<Integer> getAlleleDepths(final GenotypesContext genotypes) {
|
||||
int numAlleles = -1;
|
||||
for (final Genotype genotype : genotypes) {
|
||||
if (genotype.hasAD()) {
|
||||
numAlleles = genotype.getAD().length;
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (numAlleles == -1) //no genotypes have AD
|
||||
return null;
|
||||
Integer[] alleleDepths = new Integer[numAlleles];
|
||||
for (int i = 0; i < alleleDepths.length; i++) {
|
||||
alleleDepths[i] = 0;
|
||||
}
|
||||
for (final Genotype genotype : genotypes) {
|
||||
// we care only about genotypes with variant alleles
|
||||
if ( !genotype.isHet() && !genotype.isHomVar() )
|
||||
continue;
|
||||
|
||||
// if we have the AD values for this sample, let's make sure that the variant depth is greater than 1!
|
||||
if ( genotype.hasAD() ) {
|
||||
final int[] AD = genotype.getAD();
|
||||
final int totalADdepth = (int) MathUtils.sum(AD);
|
||||
if ( totalADdepth - AD[0] > 1 ) {
|
||||
for (int i = 0; i < AD.length; i++) {
|
||||
alleleDepths[i] += AD[i];
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return Arrays.asList(alleleDepths);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotateRawData(RefMetaDataTracker tracker, AnnotatorCompatible walker, ReferenceContext ref, Map<String, AlignmentContext> stratifiedContexts, VariantContext vc, Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap) {
|
||||
return null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> combineRawData(List<Allele> allelesList, List<? extends ReducibleAnnotationData> listOfRawData) {
|
||||
return null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> finalizeRawData(VariantContext vc, VariantContext originalVC) {
|
||||
//we need to use the AS_QUAL value that was added to the VC by the GenotypingEngine
|
||||
if ( !vc.hasAttribute(GATKVCFConstants.AS_QUAL_KEY) )
|
||||
return null;
|
||||
|
||||
final GenotypesContext genotypes = vc.getGenotypes();
|
||||
if ( genotypes == null || genotypes.isEmpty() )
|
||||
return null;
|
||||
|
||||
final List<Integer> standardDepth = getAlleleDepths(genotypes);
|
||||
|
||||
//Parse the VC's allele-specific qual values
|
||||
List<Object> alleleQualObjList = vc.getAttributeAsList(GATKVCFConstants.AS_QUAL_KEY);
|
||||
if (alleleQualObjList.size() != vc.getNAlleles() -1)
|
||||
throw new IllegalStateException("Number of AS_QUAL values doesn't match the number of alternate alleles.");
|
||||
List<Double> alleleQualList = new ArrayList<>();
|
||||
for (final Object obj : alleleQualObjList) {
|
||||
alleleQualList.add(Double.parseDouble(obj.toString()));
|
||||
}
|
||||
|
||||
// Don't normalize indel length for AS_QD because it will only be called from GenotypeGVCFs, never UG
|
||||
List<Double> QDlist = new ArrayList<>();
|
||||
double refDepth = (double)standardDepth.get(0);
|
||||
for (int i = 0; i < alleleQualList.size(); i++) {
|
||||
double AS_QD = -10.0 * alleleQualList.get(i) / ((double)standardDepth.get(i+1) + refDepth); //+1 to skip the reference field of the AD, add ref counts to each to match biallelic case
|
||||
// Hack: see note in the fixTooHighQD method below
|
||||
AS_QD = QualByDepth.fixTooHighQD(AS_QD);
|
||||
QDlist.add(AS_QD);
|
||||
}
|
||||
|
||||
final Map<String, Object> map = new HashMap<>();
|
||||
map.put(getKeyNames().get(0), AnnotationUtils.encodeValueList(QDlist, "%.2f"));
|
||||
return map;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void calculateRawData(VariantContext vc, Map<String, PerReadAlleleLikelihoodMap> pralm, ReducibleAnnotationData rawAnnotations) {
|
||||
//note that the "raw data" used here is calculated by the GenotypingEngine in GenotypeGVCFs and stored in the AS_QUAL info field
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,192 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypesContext;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Allele-specific implementation of root-mean-squared annotations
|
||||
*/
|
||||
public abstract class AS_RMSAnnotation extends RMSAnnotation {
|
||||
protected final static Logger logger = Logger.getLogger(AS_RMSAnnotation.class);
|
||||
protected final String splitDelim = "\\|"; //String.split takes a regex, so we need to escape the pipe
|
||||
protected final String printDelim = "|";
|
||||
protected AnnotatorCompatible callingWalker;
|
||||
|
||||
|
||||
@Override
|
||||
public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set<VCFHeaderLine> headerLines) {
|
||||
if (!AnnotationUtils.walkerSupportsAlleleSpecificAnnotations(walker))
|
||||
logger.warn("Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output");
|
||||
callingWalker = walker;
|
||||
}
|
||||
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
if (AnnotationUtils.walkerRequiresRawData(callingWalker))
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getRawKeyName()));
|
||||
else
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
//For the raw data here, we're only keeping track of the sum of the squares of our values
|
||||
//When we go to reduce, we'll use the AD info to get the number of reads
|
||||
public void calculateRawData(final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap,
|
||||
final ReducibleAnnotationData myData) {
|
||||
|
||||
//must use perReadAlleleLikelihoodMap for allele-specific annotations
|
||||
if (perReadAlleleLikelihoodMap != null) {
|
||||
if ( perReadAlleleLikelihoodMap.size() == 0 )
|
||||
return;
|
||||
getRMSDataFromPRALM(perReadAlleleLikelihoodMap, myData);
|
||||
}
|
||||
else
|
||||
return;
|
||||
}
|
||||
|
||||
abstract void getRMSDataFromPRALM(final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap, final ReducibleAnnotationData<Number> myData);
|
||||
|
||||
@Override
|
||||
public Map<String, Object> finalizeRawData(final VariantContext vc, final VariantContext originalVC) {
|
||||
if (!vc.hasAttribute(getRawKeyName()))
|
||||
return new HashMap<>();
|
||||
final String rawMQdata = vc.getAttributeAsString(getRawKeyName(),null);
|
||||
if (rawMQdata == null)
|
||||
return new HashMap<>();
|
||||
|
||||
final Map<String,Object> annotations = new HashMap<>();
|
||||
final ReducibleAnnotationData myData = new AlleleSpecificAnnotationData<Double>(originalVC.getAlleles(), rawMQdata);
|
||||
parseRawDataString(myData);
|
||||
|
||||
final String annotationString = makeFinalizedAnnotationString(vc, myData.getAttributeMap());
|
||||
annotations.put(getKeyNames().get(0), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected void parseRawDataString(final ReducibleAnnotationData<Number> myData) {
|
||||
final String rawDataString = myData.getRawData();
|
||||
//get per-allele data by splitting on allele delimiter
|
||||
final String[] rawDataPerAllele = rawDataString.split(splitDelim);
|
||||
for (int i=0; i<rawDataPerAllele.length; i++) {
|
||||
final String alleleData = rawDataPerAllele[i];
|
||||
myData.putAttribute(myData.getAlleles().get(i), Double.parseDouble(alleleData));
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
public Map<String, Object> combineRawData(final List<Allele> vcAlleles, final List<? extends ReducibleAnnotationData> annotationList) {
|
||||
//VC already contains merged alleles from ReferenceConfidenceVariantContextMerger
|
||||
ReducibleAnnotationData combinedData = new AlleleSpecificAnnotationData(vcAlleles, null);
|
||||
|
||||
for (final ReducibleAnnotationData currentValue : annotationList) {
|
||||
parseRawDataString(currentValue);
|
||||
combineAttributeMap(currentValue, combinedData);
|
||||
|
||||
}
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap());
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void combineAttributeMap(final ReducibleAnnotationData<Number> toAdd, final ReducibleAnnotationData<Number> combined) {
|
||||
//check that alleles match
|
||||
for (final Allele currentAllele : combined.getAlleles()){
|
||||
//combined is initialized with all alleles, but toAdd might have only a subset
|
||||
if(toAdd.getAttribute(currentAllele) == null)
|
||||
continue;
|
||||
if (toAdd.getAttribute(currentAllele) != null && combined.getAttribute(currentAllele) != null) {
|
||||
combined.putAttribute(currentAllele, (double) combined.getAttribute(currentAllele) + (double) toAdd.getAttribute(currentAllele));
|
||||
}
|
||||
else
|
||||
combined.putAttribute(currentAllele, toAdd.getAttribute(currentAllele));
|
||||
}
|
||||
}
|
||||
|
||||
protected Map<Allele, Integer> getADcounts(final VariantContext vc) {
|
||||
final GenotypesContext genotypes = vc.getGenotypes();
|
||||
if ( genotypes == null || genotypes.size() == 0 ) {
|
||||
logger.warn("VC does not have genotypes -- annotations were calculated in wrong order");
|
||||
return null;
|
||||
}
|
||||
|
||||
final Map<Allele, Integer> variantADs = new HashMap<>();
|
||||
for(final Allele a : vc.getAlleles())
|
||||
variantADs.put(a,0);
|
||||
|
||||
for (final Genotype gt : vc.getGenotypes()) {
|
||||
if(!gt.hasAD()) {
|
||||
continue;
|
||||
}
|
||||
final int[] ADs = gt.getAD();
|
||||
for(int i = 1; i < vc.getNAlleles(); i++) {
|
||||
variantADs.put(vc.getAlternateAllele(i-1), variantADs.get(vc.getAlternateAllele(i-1))+ADs[i]); //here -1 is to reconcile allele index with alt allele index
|
||||
}
|
||||
}
|
||||
return variantADs;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,152 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
|
||||
/**
|
||||
* Allele-specific Root Mean Square of the mapping quality of reads across all samples.
|
||||
*
|
||||
* <p>This annotation provides an estimation of the mapping quality of reads supporting each alternate allele in a variant call. Depending on the tool it is called from, it produces either raw data (sum of squared MQs) or the calculated root mean square.</p>
|
||||
*
|
||||
* The raw data is used to accurately calculate the root mean square when combining more than one sample.
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSumTest</a></b> compares the mapping quality of reads supporting the REF and ALT alleles.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>Uninformative reads are not used in this annotation.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_RMSMappingQuality.php">RMSMappingQuality</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSumTest</a></b> compares the mapping quality of reads supporting the REF and ALT alleles.</li>
|
||||
* </ul>
|
||||
*/
|
||||
public class AS_RMSMappingQuality extends AS_RMSAnnotation implements AS_StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
|
||||
protected final String printFormat = "%.2f";
|
||||
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_RMS_MAPPING_QUALITY_KEY); }
|
||||
|
||||
public String getRawKeyName() { return GATKVCFConstants.AS_RAW_RMS_MAPPING_QUALITY_KEY; }
|
||||
|
||||
public void getRMSDataFromPRALM(Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap, ReducibleAnnotationData<Number> myData) {
|
||||
//over all the samples in the Map...
|
||||
for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : perReadAlleleLikelihoodMap.values() ) {
|
||||
//for each read...
|
||||
for ( final Map.Entry<GATKSAMRecord,Map<Allele,Double>> readLikelihoods : perReadLikelihoods.getLikelihoodReadMap().entrySet() ) {
|
||||
final int mq = readLikelihoods.getKey().getMappingQuality();
|
||||
if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) {
|
||||
if (!PerReadAlleleLikelihoodMap.getMostLikelyAllele(readLikelihoods.getValue()).isInformative())
|
||||
continue;
|
||||
final Allele bestAllele =PerReadAlleleLikelihoodMap.getMostLikelyAllele(readLikelihoods.getValue()).getMostLikelyAllele();
|
||||
double currSquareSum = 0;
|
||||
if (myData.hasAttribute(bestAllele))
|
||||
currSquareSum += (double)myData.getAttribute(bestAllele);
|
||||
myData.putAttribute(bestAllele, currSquareSum + mq * mq);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public String makeRawAnnotationString(final List<Allele> vcAlleles, final Map<Allele, Number> perAlleleValues) {
|
||||
String annotationString = "";
|
||||
for (final Allele current : vcAlleles) {
|
||||
if (!annotationString.isEmpty())
|
||||
annotationString += printDelim;
|
||||
if(perAlleleValues.get(current) != null)
|
||||
annotationString += String.format(printFormat,perAlleleValues.get(current));
|
||||
else
|
||||
annotationString += String.format(printFormat, 0.0);
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
|
||||
//this just overrides the RMSAnnotation function that's used for UG -- we don't do allele-specific annotations for UG
|
||||
@Override
|
||||
public String makeFinalizedAnnotationString(final VariantContext vc, final Map<Allele, Number> perAlleleData, final Map<String, AlignmentContext> stratifiedContexts, final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap) {
|
||||
return makeFinalizedAnnotationString(vc, perAlleleData);
|
||||
}
|
||||
|
||||
@Override
|
||||
public String makeFinalizedAnnotationString(final VariantContext vc, final Map<Allele, Number> perAlleleValues) {
|
||||
final Map<Allele, Integer> variantADs = getADcounts(vc);
|
||||
String annotationString = "";
|
||||
for (final Allele current : vc.getAlternateAlleles()) {
|
||||
if (!annotationString.isEmpty())
|
||||
annotationString += ",";
|
||||
if (perAlleleValues.containsKey(current))
|
||||
annotationString += String.format(printFormat, Math.sqrt((double) perAlleleValues.get(current) / variantADs.get(current)));
|
||||
else {
|
||||
logger.warn("ERROR: VC allele is not found in annotation alleles -- maybe there was trimming?");
|
||||
}
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,329 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs;
|
||||
import org.broadinstitute.gatk.utils.MannWhitneyU;
|
||||
import org.broadinstitute.gatk.utils.collections.Pair;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.exceptions.GATKException;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Allele-specific implementation of rank sum test annotations
|
||||
*/
|
||||
public abstract class AS_RankSumTest extends RankSumTest implements ReducibleAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(AS_RMSAnnotation.class);
|
||||
protected final String splitDelim = "\\|"; //String.split takes a regex, so we need to escape the pipe
|
||||
protected final String printDelim = "|";
|
||||
protected final String reducedDelim = ",";
|
||||
protected AnnotatorCompatible callingWalker;
|
||||
|
||||
@Override
|
||||
public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set<VCFHeaderLine> headerLines) {
|
||||
if (!AnnotationUtils.walkerSupportsAlleleSpecificAnnotations(walker))
|
||||
logger.warn("Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output");
|
||||
callingWalker = walker;
|
||||
super.initialize(walker, toolkit, headerLines);
|
||||
}
|
||||
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
if (AnnotationUtils.walkerRequiresRawData(callingWalker))
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getRawKeyName()));
|
||||
else
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
public Map<String, Object> annotateRawData(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
|
||||
if ( perReadAlleleLikelihoodMap == null)
|
||||
return new HashMap<>();
|
||||
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
final AlleleSpecificAnnotationData<CompressedDataList<Integer>> myData = initializeNewAnnotationData(vc.getAlleles());
|
||||
calculateRawData(vc, perReadAlleleLikelihoodMap, myData);
|
||||
final String annotationString = makeRawAnnotationString(vc.getAlleles(), myData.getAttributeMap());
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
protected void parseRawDataString(final ReducibleAnnotationData<CompressedDataList<Integer>> myData) {
|
||||
final String rawDataString = myData.getRawData();
|
||||
String rawDataNoBrackets;
|
||||
final Map<Allele, CompressedDataList<Integer>> perAlleleValues = new HashMap<>();
|
||||
//Initialize maps
|
||||
for (final Allele current : myData.getAlleles()) {
|
||||
perAlleleValues.put(current, new CompressedDataList<Integer>());
|
||||
}
|
||||
//Map gives back list with []
|
||||
if (rawDataString.charAt(0) == '[') {
|
||||
rawDataNoBrackets = rawDataString.substring(1, rawDataString.length() - 1);
|
||||
}
|
||||
else {
|
||||
rawDataNoBrackets = rawDataString;
|
||||
}
|
||||
//rawDataPerAllele is the list of values for each allele (each of variable length)
|
||||
final String[] rawDataPerAllele = rawDataNoBrackets.split(splitDelim);
|
||||
for (int i=0; i<rawDataPerAllele.length; i++) {
|
||||
final String alleleData = rawDataPerAllele[i];
|
||||
if (alleleData.isEmpty())
|
||||
continue;
|
||||
final CompressedDataList<Integer> alleleList = perAlleleValues.get(myData.getAlleles().get(i));
|
||||
final String[] rawListEntriesAsStringVector = alleleData.split(",");
|
||||
if (rawListEntriesAsStringVector.length %2 != 0)
|
||||
throw new GATKException("ERROR: rank sum test raw annotation data must occur in <value,count> pairs");
|
||||
for (int j=0; j<rawListEntriesAsStringVector.length; j+=2) {
|
||||
int value, count;
|
||||
if (!rawListEntriesAsStringVector[j].isEmpty()) {
|
||||
value = Integer.parseInt(rawListEntriesAsStringVector[j].trim());
|
||||
if (!rawListEntriesAsStringVector[j + 1].isEmpty()) {
|
||||
count = Integer.parseInt(rawListEntriesAsStringVector[j + 1].trim());
|
||||
alleleList.add(value,count);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
myData.setAttributeMap(perAlleleValues);
|
||||
|
||||
//check the alleles list
|
||||
boolean foundRef = false;
|
||||
for (final Allele a : myData.getAlleles()) {
|
||||
if (a.isReference()) {
|
||||
if (foundRef)
|
||||
throw new GATKException("ERROR: multiple reference alleles found in annotation data\n");
|
||||
foundRef = true;
|
||||
}
|
||||
}
|
||||
if (!foundRef)
|
||||
throw new GATKException("ERROR: no reference alleles found in annotation data\n");
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> combineRawData(final List<Allele> vcAlleles, final List<? extends ReducibleAnnotationData> annotationList) {
|
||||
//VC already contains merged alleles from ReferenceConfidenceVariantContextMerger
|
||||
final ReducibleAnnotationData combinedData = initializeNewAnnotationData(vcAlleles);
|
||||
|
||||
for (final ReducibleAnnotationData currentValue : annotationList) {
|
||||
parseRawDataString(currentValue);
|
||||
combineAttributeMap(currentValue, combinedData);
|
||||
|
||||
}
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
final String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap());
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
protected AlleleSpecificAnnotationData initializeNewAnnotationData(final List<Allele> vcAlleles) {
|
||||
Map<Allele, CompressedDataList<Integer>> perAlleleValues = new HashMap<>();
|
||||
for (Allele a : vcAlleles) {
|
||||
perAlleleValues.put(a, new CompressedDataList<Integer>());
|
||||
}
|
||||
AlleleSpecificAnnotationData ret = new AlleleSpecificAnnotationData(vcAlleles, perAlleleValues.toString());
|
||||
ret.setAttributeMap(perAlleleValues);
|
||||
return ret;
|
||||
}
|
||||
|
||||
protected void combineAttributeMap(final ReducibleAnnotationData<CompressedDataList<Integer>> toAdd, final ReducibleAnnotationData<CompressedDataList<Integer>> combined) {
|
||||
for (final Allele a : combined.getAlleles()) {
|
||||
if (toAdd.hasAttribute(a)) {
|
||||
final CompressedDataList<Integer> alleleData = combined.getAttribute(a);
|
||||
alleleData.add(toAdd.getAttribute(a));
|
||||
combined.putAttribute(a, alleleData);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
protected String makeRawAnnotationString(final List<Allele> vcAlleles, final Map<Allele, CompressedDataList<Integer>> perAlleleValues) {
|
||||
String annotationString = "";
|
||||
for (int i =0; i< vcAlleles.size(); i++) {
|
||||
if (i!=0)
|
||||
annotationString += printDelim;
|
||||
CompressedDataList<Integer> alleleValues = perAlleleValues.get(vcAlleles.get(i));
|
||||
annotationString += alleleValues.toString();
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
|
||||
protected String makeReducedAnnotationString(VariantContext vc, Map<Allele,Double> perAltRankSumResults) {
|
||||
String annotationString = "";
|
||||
for (final Allele a : vc.getAlternateAlleles()) {
|
||||
if (!annotationString.isEmpty())
|
||||
annotationString += reducedDelim;
|
||||
if (!perAltRankSumResults.containsKey(a))
|
||||
logger.warn("ERROR: VC allele not found in annotation alleles -- maybe there was trimming?");
|
||||
else
|
||||
annotationString += String.format("%.3f", perAltRankSumResults.get(a));
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
|
||||
/**
|
||||
*
|
||||
* @param vc -- contains the final set of alleles, possibly subset by GenotypeGVCFs
|
||||
* @param originalVC -- used to get all the alleles for all gVCFs
|
||||
* @return
|
||||
*/
|
||||
public Map<String, Object> finalizeRawData(final VariantContext vc, final VariantContext originalVC) {
|
||||
if (!vc.hasAttribute(getRawKeyName()))
|
||||
return new HashMap<>();
|
||||
|
||||
final String rawRankSumData = vc.getAttributeAsString(getRawKeyName(),null);
|
||||
if (rawRankSumData == null)
|
||||
return new HashMap<>();
|
||||
|
||||
final Map<String,Object> annotations = new HashMap<>();
|
||||
final AlleleSpecificAnnotationData<CompressedDataList<Integer>> myData = new AlleleSpecificAnnotationData(originalVC.getAlleles(), rawRankSumData);
|
||||
parseRawDataString(myData);
|
||||
|
||||
final Map<Allele, Double> perAltRankSumResults = calculateReducedData(myData.getAttributeMap(), myData.getRefAllele());
|
||||
//shortcut for no ref values
|
||||
if (perAltRankSumResults.isEmpty())
|
||||
return annotations;
|
||||
final String annotationString = makeReducedAnnotationString(vc, perAltRankSumResults);
|
||||
annotations.put(getKeyNames().get(0), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
public void calculateRawData(VariantContext vc, Map<String, PerReadAlleleLikelihoodMap> pralm, ReducibleAnnotationData myData) {
|
||||
if(pralm == null)
|
||||
return;
|
||||
|
||||
final Map<Allele, CompressedDataList<Integer>> perAlleleValues = myData.getAttributeMap();
|
||||
for ( final PerReadAlleleLikelihoodMap likelihoodMap : pralm.values() ) {
|
||||
if ( likelihoodMap != null && !likelihoodMap.isEmpty() ) {
|
||||
fillQualsFromLikelihoodMap(vc.getAlleles(), vc.getStart(), likelihoodMap, perAlleleValues);
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
private void fillQualsFromLikelihoodMap(final List<Allele> alleles,
|
||||
final int refLoc,
|
||||
final PerReadAlleleLikelihoodMap likelihoodMap,
|
||||
final Map<Allele, CompressedDataList<Integer>> perAlleleValues) {
|
||||
for ( final Map.Entry<GATKSAMRecord, Map<Allele,Double>> el : likelihoodMap.getLikelihoodReadMap().entrySet() ) {
|
||||
final MostLikelyAllele a = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue());
|
||||
if ( ! a.isInformative() )
|
||||
continue; // read is non-informative
|
||||
|
||||
final GATKSAMRecord read = el.getKey();
|
||||
if ( isUsableRead(read, refLoc) ) {
|
||||
final Double value = getElementForRead(read, refLoc, a);
|
||||
if ( value == null )
|
||||
continue;
|
||||
|
||||
if(perAlleleValues.containsKey(a.getMostLikelyAllele()))
|
||||
perAlleleValues.get(a.getMostLikelyAllele()).add(value.intValue());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public Map<Allele, Double> calculateReducedData(final Map<Allele, CompressedDataList<Integer>> perAlleleValues, final Allele ref) {
|
||||
final Map<Allele, Double> perAltRankSumResults = new HashMap<>();
|
||||
//shortcut to not try to calculate rank sum if there are no reads that unambiguously support the ref
|
||||
if (perAlleleValues.get(ref).isEmpty())
|
||||
return perAltRankSumResults;
|
||||
for (final Allele alt : perAlleleValues.keySet()) {
|
||||
if (alt.equals(ref, false))
|
||||
continue;
|
||||
final MannWhitneyU mannWhitneyU = new MannWhitneyU(useDithering);
|
||||
//load alts
|
||||
for (final Number qual : perAlleleValues.get(alt)) {
|
||||
mannWhitneyU.add(qual, MannWhitneyU.USet.SET1);
|
||||
}
|
||||
//load refs
|
||||
for (final Number qual : perAlleleValues.get(ref)) {
|
||||
mannWhitneyU.add(qual, MannWhitneyU.USet.SET2);
|
||||
}
|
||||
|
||||
if (DEBUG) {
|
||||
System.out.format("%s, REF QUALS:", this.getClass().getName());
|
||||
for (final Number qual : perAlleleValues.get(ref))
|
||||
System.out.format("%d ", qual);
|
||||
System.out.println();
|
||||
System.out.format("%s, ALT QUALS:", this.getClass().getName());
|
||||
for (final Number qual : perAlleleValues.get(alt))
|
||||
System.out.format("%d ", qual);
|
||||
System.out.println();
|
||||
|
||||
}
|
||||
// we are testing that set1 (the alt bases) have lower quality scores than set2 (the ref bases)
|
||||
final Pair<Double, Double> testResults = mannWhitneyU.runOneSidedTest(MannWhitneyU.USet.SET1);
|
||||
perAltRankSumResults.put(alt, testResults.first);
|
||||
}
|
||||
return perAltRankSumResults;
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,116 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.pileup.PileupElement;
|
||||
import org.broadinstitute.gatk.utils.sam.AlignmentUtils;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.sam.ReadUtils;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Allele-specific Rank Sum Test for relative positioning of REF versus ALT allele within reads
|
||||
*
|
||||
* <p>This variant-level annotation tests whether there is evidence of bias in the position of alleles within the reads that support them, between the reference and each alternate allele. To be clear, it does so separately for each alternate allele.</p>
|
||||
*
|
||||
* <p>Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.</p>
|
||||
*
|
||||
* <p>The ideal result is a value close to zero, which indicates there is little to no difference in where the alleles are found relative to the ends of reads. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele. Conversely, a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. </p>
|
||||
*
|
||||
* <p>This annotation can be used to evaluate confidence in a variant call and is a recommended covariate for variant recalibration (VQSR). Finding a statistically significant difference in relative position either way suggests that the sequencing process may have been biased or affected by an artifact. In practice, we only filter out low negative values when evaluating variant quality because the idea is to filter out variants for which the quality of the data supporting the alternate allele is comparatively low. The reverse case, where it is the quality of data supporting the reference allele that is lower (resulting in positive ranksum scores), is not really informative for filtering variants.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for site position within reads (position within reads supporting REF vs. position within reads supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <ul>
|
||||
* <li>The read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
|
||||
* <li>Uninformative reads are not used in these annotations.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_ReadPosRankSumTest.php">ReadPosRankRankSumTest</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class AS_ReadPosRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation {
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_READ_POS_RANK_SUM_KEY); }
|
||||
|
||||
@Override
|
||||
public String getRawKeyName() { return GATKVCFConstants.AS_RAW_READ_POS_RANK_SUM_KEY;}
|
||||
|
||||
@Override
|
||||
protected Double getElementForRead(final GATKSAMRecord read, final int refLoc) {
|
||||
final int offset = ReadUtils.getReadCoordinateForReferenceCoordinate(read.getSoftStart(), read.getCigar(), refLoc, ReadUtils.ClippingTail.RIGHT_TAIL, true);
|
||||
if ( offset == ReadUtils.CLIPPING_GOAL_NOT_REACHED )
|
||||
return null;
|
||||
|
||||
int readPos = AlignmentUtils.calcAlignmentByteArrayOffset(read.getCigar(), offset, false, 0, 0);
|
||||
final int numAlignedBases = AlignmentUtils.getNumAlignedBasesCountingSoftClips( read );
|
||||
if (readPos > numAlignedBases / 2)
|
||||
readPos = numAlignedBases - (readPos + 1);
|
||||
return (double)readPos;
|
||||
}
|
||||
|
||||
@Override
|
||||
protected boolean isUsableRead(final GATKSAMRecord read, final int refLoc) {
|
||||
return super.isUsableRead(read, refLoc) && read.getSoftStart() + read.getCigar().getReadLength() > refLoc;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,379 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.GenotypesContext;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.exceptions.GATKException;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Allele-specific implementation of strand bias annotations
|
||||
*/
|
||||
public abstract class AS_StrandBiasTest extends StrandBiasTest implements ReducibleAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(StrandBiasTest.class);
|
||||
protected final String splitDelim = "\\|"; //String.split takes a regex, so we need to escape the pipe
|
||||
protected final String printDelim = "|";
|
||||
protected final String reducedDelim = ",";
|
||||
protected AnnotatorCompatible callingWalker;
|
||||
protected final int MIN_COUNT = 2;
|
||||
protected static final double MIN_PVALUE = 1E-320;
|
||||
protected final int FORWARD = 0;
|
||||
protected final int REVERSE = 1;
|
||||
protected final ArrayList<Integer> ZERO_LIST = new ArrayList<>();
|
||||
|
||||
@Override
|
||||
public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set<VCFHeaderLine> headerLines) {
|
||||
if (!AnnotationUtils.walkerSupportsAlleleSpecificAnnotations(walker))
|
||||
logger.warn("Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output");
|
||||
callingWalker = walker;
|
||||
ZERO_LIST.add(0,0);
|
||||
ZERO_LIST.add(1,0);
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
if (AnnotationUtils.walkerRequiresRawData(callingWalker))
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getRawKeyName()));
|
||||
else
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
@Override
|
||||
public String getRawKeyName() { return GATKVCFConstants.AS_SB_TABLE_KEY; }
|
||||
|
||||
public Map<String, Object> annotateRawData(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
|
||||
//for allele-specific annotations we only call from HC and we only use perReadAlleleLikelihoodMap
|
||||
if ( perReadAlleleLikelihoodMap == null)
|
||||
return new HashMap<>();
|
||||
|
||||
// calculate the annotation from the stratified per read likelihood map
|
||||
// stratifiedPerReadAllelelikelihoodMap can come from HaplotypeCaller call to VariantAnnotatorEngine
|
||||
else if (perReadAlleleLikelihoodMap != null) {
|
||||
final HashMap<String, Object> annotations = new HashMap<>();
|
||||
final ReducibleAnnotationData<List<Integer>> myData = new AlleleSpecificAnnotationData<>(vc.getAlleles(),null);
|
||||
calculateRawData(vc, perReadAlleleLikelihoodMap, myData);
|
||||
final Map<Allele, List<Integer>> perAlleleValues = myData.getAttributeMap();
|
||||
final String annotationString = makeRawAnnotationString(vc.getAlleles(), perAlleleValues);
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
else {
|
||||
// for non-snp variants, we need per-read likelihoods.
|
||||
// for snps, we can get same result from simple pileup
|
||||
// for indels that do not have a computed strand bias (SB) or strand bias by sample (SBBS)
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
protected void parseRawDataString(ReducibleAnnotationData<List<Integer>> myData) {
|
||||
final String rawDataString = myData.getRawData();
|
||||
String[] rawDataPerAllele;
|
||||
String[] rawListEntriesAsStringVector;
|
||||
Map<Allele, List<Integer>> perAlleleValues = new HashMap<>();
|
||||
//Initialize maps
|
||||
for (Allele current : myData.getAlleles()) {
|
||||
perAlleleValues.put(current, new LinkedList<Integer>());
|
||||
}
|
||||
//rawDataPerAllele is the list of values for each allele (each of variable length)
|
||||
rawDataPerAllele = rawDataString.split(splitDelim);
|
||||
for (int i=0; i<rawDataPerAllele.length; i++) {
|
||||
String alleleData = rawDataPerAllele[i];
|
||||
if (alleleData.isEmpty())
|
||||
continue;
|
||||
List<Integer> alleleList = perAlleleValues.get(myData.getAlleles().get(i));
|
||||
rawListEntriesAsStringVector = alleleData.split(",");
|
||||
//Read counts will only ever be integers
|
||||
for (String s : rawListEntriesAsStringVector) {
|
||||
if (!s.isEmpty())
|
||||
alleleList.add(Integer.parseInt(s.trim()));
|
||||
}
|
||||
}
|
||||
myData.setAttributeMap(perAlleleValues);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> combineRawData(final List<Allele> vcAlleles, final List<? extends ReducibleAnnotationData> annotationList) {
|
||||
//VC already contains merged alleles from ReferenceConfidenceVariantContextMerger
|
||||
ReducibleAnnotationData combinedData = new AlleleSpecificAnnotationData(vcAlleles, null);
|
||||
|
||||
for (final ReducibleAnnotationData currentValue : annotationList) {
|
||||
parseRawDataString(currentValue);
|
||||
combineAttributeMap(currentValue, combinedData);
|
||||
}
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
final String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap());
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
protected void combineAttributeMap(final ReducibleAnnotationData<List<Integer>> toAdd, final ReducibleAnnotationData<List<Integer>> combined) {
|
||||
for (final Allele a : combined.getAlleles()) {
|
||||
if (toAdd.hasAttribute(a) && toAdd.getAttribute(a) != null) {
|
||||
if (combined.getAttribute(a) != null) {
|
||||
combined.getAttribute(a).set(0, (int) combined.getAttribute(a).get(0) + (int) toAdd.getAttribute(a).get(0));
|
||||
combined.getAttribute(a).set(1, (int) combined.getAttribute(a).get(1) + (int) toAdd.getAttribute(a).get(1));
|
||||
}
|
||||
else {
|
||||
List<Integer> alleleData = new ArrayList<>();
|
||||
alleleData.add(0, toAdd.getAttribute(a).get(0));
|
||||
alleleData.add(1, toAdd.getAttribute(a).get(1));
|
||||
combined.putAttribute(a,alleleData);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
protected String makeRawAnnotationString(final List<Allele> vcAlleles, final Map<Allele, List<Integer>> perAlleleValues) {
|
||||
String annotationString = "";
|
||||
for (final Allele a : vcAlleles) {
|
||||
if (!annotationString.isEmpty())
|
||||
annotationString += printDelim;
|
||||
List<Integer> alleleValues = perAlleleValues.get(a);
|
||||
if (alleleValues == null)
|
||||
alleleValues = ZERO_LIST;
|
||||
annotationString += encode(alleleValues);
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
|
||||
protected String encode(List<Integer> alleleValues) {
|
||||
String annotationString = "";
|
||||
for (int j =0; j < alleleValues.size(); j++) {
|
||||
annotationString += alleleValues.get(j);
|
||||
if (j < alleleValues.size()-1)
|
||||
annotationString += ",";
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
|
||||
|
||||
|
||||
protected String makeReducedAnnotationString(VariantContext vc, Map<Allele,Double> perAltsStrandCounts) {
|
||||
String annotationString = "";
|
||||
for (Allele a : vc.getAlternateAlleles()) {
|
||||
if (!annotationString.isEmpty())
|
||||
annotationString += reducedDelim;
|
||||
if (!perAltsStrandCounts.containsKey(a))
|
||||
logger.warn("ERROR: VC allele not found in annotation alleles -- maybe there was trimming?");
|
||||
else
|
||||
annotationString += String.format("%.3f", perAltsStrandCounts.get(a));
|
||||
}
|
||||
return annotationString;
|
||||
}
|
||||
|
||||
/**
|
||||
*
|
||||
* @param vc -- contains the final set of alleles, possibly subset by GenotypeGVCFs
|
||||
* @param originalVC -- used to get all the alleles for all gVCFs
|
||||
* @return
|
||||
*/
|
||||
@Override
|
||||
public Map<String, Object> finalizeRawData(final VariantContext vc, final VariantContext originalVC) {
|
||||
Map<String, Object> annotations = new HashMap<>();
|
||||
if (!vc.hasAttribute(getRawKeyName()))
|
||||
return new HashMap<>();
|
||||
String rawRankSumData = vc.getAttributeAsString(getRawKeyName(),null);
|
||||
if (rawRankSumData == null)
|
||||
return new HashMap<>();
|
||||
|
||||
AlleleSpecificAnnotationData<List<Integer>> myData = new AlleleSpecificAnnotationData<>(originalVC.getAlleles(), rawRankSumData);
|
||||
parseRawDataString(myData);
|
||||
|
||||
Map<Allele, Double> perAltRankSumResults = calculateReducedData(myData);
|
||||
|
||||
String annotationString = makeReducedAnnotationString(vc, perAltRankSumResults);
|
||||
annotations.put(getKeyNames().get(0), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void calculateRawData(final VariantContext vc, Map<String, PerReadAlleleLikelihoodMap> pralm, final ReducibleAnnotationData rawAnnotations) {
|
||||
if(pralm == null)
|
||||
return;
|
||||
|
||||
getStrandCountsFromLikelihoodMap(vc, pralm, rawAnnotations, MIN_COUNT);
|
||||
}
|
||||
|
||||
protected abstract Map<Allele,Double> calculateReducedData(final AlleleSpecificAnnotationData<List<Integer>> combinedData );
|
||||
|
||||
/**
|
||||
Allocate and fill a 2x2 strand contingency table. In the end, it'll look something like this:
|
||||
* fw rc
|
||||
* allele1 # #
|
||||
* allele2 # #
|
||||
* @return a 2x2 contingency table
|
||||
*/
|
||||
public void getStrandCountsFromLikelihoodMap( final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap,
|
||||
final ReducibleAnnotationData perAlleleValues,
|
||||
final int minCount) {
|
||||
if( stratifiedPerReadAlleleLikelihoodMap == null )
|
||||
return;
|
||||
if( vc == null )
|
||||
return;
|
||||
|
||||
final Allele ref = vc.getReference();
|
||||
final List<Allele> allAlts = vc.getAlternateAlleles();
|
||||
|
||||
for (final PerReadAlleleLikelihoodMap maps : stratifiedPerReadAlleleLikelihoodMap.values() ) {
|
||||
final ReducibleAnnotationData<List<Integer>> sampleTable = new AlleleSpecificAnnotationData<>(vc.getAlleles(),null);
|
||||
for (final Map.Entry<GATKSAMRecord,Map<Allele,Double>> el : maps.getLikelihoodReadMap().entrySet()) {
|
||||
final MostLikelyAllele mostLikelyAllele = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue());
|
||||
final GATKSAMRecord read = el.getKey();
|
||||
updateTable(mostLikelyAllele.getAlleleIfInformative(), read, ref, allAlts, sampleTable);
|
||||
}
|
||||
//for each sample (value in stratified PRALM), only include it if there are >minCount informative reads
|
||||
if ( passesMinimumThreshold(sampleTable, minCount) )
|
||||
combineAttributeMap(sampleTable, perAlleleValues);
|
||||
}
|
||||
}
|
||||
|
||||
private void updateTable(final Allele bestAllele, final GATKSAMRecord read, final Allele ref, final List<Allele> allAlts, final ReducibleAnnotationData<List<Integer>> perAlleleValues) {
|
||||
|
||||
final boolean matchesRef = bestAllele.equals(ref, true);
|
||||
final boolean matchesAnyAlt = allAlts.contains(bestAllele);
|
||||
|
||||
//for uninformative reads
|
||||
if(bestAllele.isNoCall())
|
||||
return;
|
||||
|
||||
//can happen if a read's most likely allele has been removed when --max_alternate_alleles is exceeded
|
||||
if (!( matchesRef || matchesAnyAlt ))
|
||||
return;
|
||||
|
||||
final List<Integer> alleleStrandCounts;
|
||||
if (perAlleleValues.hasAttribute(bestAllele) && perAlleleValues.getAttribute(bestAllele) != null)
|
||||
alleleStrandCounts = perAlleleValues.getAttribute(bestAllele);
|
||||
else {
|
||||
alleleStrandCounts = new ArrayList<>();
|
||||
alleleStrandCounts.add(0,0);
|
||||
alleleStrandCounts.add(1,0);
|
||||
}
|
||||
if (read.isStrandless()) {
|
||||
// a strandless read counts as observations on both strand, at 50% weight, with a minimum of 1
|
||||
// (the 1 is to ensure that a strandless read always counts as an observation on both strands, even
|
||||
// if the read is only seen once, because it's a merged read or other)
|
||||
alleleStrandCounts.set(FORWARD, alleleStrandCounts.get(FORWARD)+1);
|
||||
alleleStrandCounts.set(REVERSE, alleleStrandCounts.get(REVERSE)+1);
|
||||
} else {
|
||||
// a normal read with an actual strand
|
||||
final boolean isFW = !read.getReadNegativeStrandFlag();
|
||||
if (isFW)
|
||||
alleleStrandCounts.set(FORWARD, alleleStrandCounts.get(FORWARD)+1);
|
||||
else
|
||||
alleleStrandCounts.set(REVERSE, alleleStrandCounts.get(REVERSE)+1);
|
||||
}
|
||||
perAlleleValues.putAttribute(bestAllele, alleleStrandCounts);
|
||||
}
|
||||
|
||||
/**
|
||||
* Does this strand data array pass the minimum threshold for inclusion?
|
||||
*
|
||||
* @param sampleTable the per-allele fwd/rev read counts for a single sample
|
||||
* @param minCount The minimum threshold of counts in the array
|
||||
* @return true if it passes the minimum threshold, false otherwise
|
||||
*/
|
||||
protected boolean passesMinimumThreshold(final ReducibleAnnotationData<List<Integer>> sampleTable, final int minCount) {
|
||||
// the read total must be greater than MIN_COUNT
|
||||
int readTotal = 0;
|
||||
for (final List<Integer> alleleValues : sampleTable.getAttributeMap().values()) {
|
||||
if (alleleValues != null) {
|
||||
readTotal += alleleValues.get(FORWARD);
|
||||
readTotal += alleleValues.get(REVERSE);
|
||||
}
|
||||
}
|
||||
return readTotal > minCount;
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
//Allele-specific annotations cannot be called from walkers other than HaplotypeCaller
|
||||
protected Map<String, Object> calculateAnnotationFromGTfield(final GenotypesContext genotypes){
|
||||
return new HashMap<>();
|
||||
}
|
||||
|
||||
@Override
|
||||
//Allele-specific annotations cannot be called from walkers other than HaplotypeCaller
|
||||
protected Map<String, Object> calculateAnnotationFromStratifiedContexts(final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc){
|
||||
return new HashMap<>();
|
||||
}
|
||||
|
||||
@Override
|
||||
//This just calls the non-allele-specific code in StrandBiasTest.java
|
||||
protected abstract Map<String, Object> calculateAnnotationFromLikelihoodMap(final Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap,
|
||||
final VariantContext vc);
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,163 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
||||
import java.util.Collections;
|
||||
import java.util.HashMap;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
/**
|
||||
* Allele-specific strand bias estimated by the Symmetric Odds Ratio test
|
||||
*
|
||||
* <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. </p>
|
||||
*
|
||||
* <p>The AS_StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. It does so separately for each allele. The reported value is ln-scaled.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p> Odds Ratios in the 2x2 contingency table below are</p>
|
||||
*
|
||||
* $$ R = \frac{X[0][0] * X[1][1]}{X[0][1] * X[1][0]} $$
|
||||
*
|
||||
* <p>and its inverse:</p>
|
||||
*
|
||||
* <table>
|
||||
* <tr><td> </td><td>+ strand </td><td>- strand</td></tr>
|
||||
* <tr><td>REF;</td><td>X[0][0]</td><td>X[0][1]</td></tr>
|
||||
* <tr><td>ALT;</td><td>X[1][0]</td><td>X[1][1]</td></tr>
|
||||
* </table>
|
||||
*
|
||||
* <p>The sum R + 1/R is used to detect a difference in strand bias for REF and for ALT (the sum makes it symmetric). A high value is indicative of large difference where one entry is very small compared to the others. A scale factor of refRatio/altRatio where</p>
|
||||
*
|
||||
* $$ refRatio = \frac{max(X[0][0], X[0][1])}{min(X[0][0], X[0][1} $$
|
||||
*
|
||||
* <p>and </p>
|
||||
*
|
||||
* $$ altRatio = \frac{max(X[1][0], X[1][1])}{min(X[1][0], X[1][1]} $$
|
||||
*
|
||||
* <p>ensures that the annotation value is large only. </p>
|
||||
*
|
||||
* <p>See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this statistical test.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>
|
||||
* The name AS_StrandOddsRatio is not entirely appropriate because the implementation was changed somewhere between the start of development and release of this annotation. Now SOR isn't really an odds ratio anymore. The goal was to separate certain cases of data without penalizing variants that occur at the ends of exons because they tend to only be covered by reads in one direction (depending on which end of the exon they're on), so if a variant has 10 ref reads in the + direction, 1 ref read in the - direction, 9 alt reads in the + direction and 2 alt reads in the - direction, it's actually not strand biased, but the FS score is pretty bad. The implementation that resulted derived in part from empirically testing some read count tables of various sizes with various ratios and deciding from there.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php">StrandOddsRatio</a></b> outputs a version of this annotation that includes all alternate alleles in a single calculation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php">StrandBiasBySample</a></b> outputs counts of read depth per allele for each strand orientation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_FisherStrand.php">FisherStrand</a></b> uses Fisher's Exact Test to evaluate strand bias.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class AS_StrandOddsRatio extends AS_StrandBiasTest implements AS_StandardAnnotation {
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() {
|
||||
return Collections.singletonList(GATKVCFConstants.AS_STRAND_ODDS_RATIO_KEY);
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Map<String, Object> calculateAnnotationFromLikelihoodMap(Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap,
|
||||
final VariantContext vc){
|
||||
// either SNP with no alignment context, or indels: per-read likelihood map needed
|
||||
final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc, MIN_COUNT);
|
||||
final double ratio = calculateSOR(table);
|
||||
return Collections.singletonMap(getKeyNames().get(0), (Object)String.format("%.3f",ratio));
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Map<Allele,Double> calculateReducedData(AlleleSpecificAnnotationData<List<Integer>> combinedData) {
|
||||
final Map<Allele,Double> annotationMap = new HashMap<>();
|
||||
final Map<Allele, List<Integer>> perAlleleData = combinedData.getAttributeMap();
|
||||
final List<Integer> refStrandCounts = perAlleleData.get(combinedData.getRefAllele());
|
||||
for (final Allele a : perAlleleData.keySet()) {
|
||||
List<Integer> altStrandCounts = perAlleleData.get(a);
|
||||
int[][] refAltTable = new int[][] {new int[]{refStrandCounts.get(0),refStrandCounts.get(1)},new int[]{altStrandCounts.get(0),altStrandCounts.get(1)}};
|
||||
annotationMap.put(a,calculateSOR(refAltTable));
|
||||
}
|
||||
return annotationMap;
|
||||
}
|
||||
|
||||
/**
|
||||
* Computes the SOR value of a table after augmentation (adding pseudocounts). Based on the symmetric odds ratio but modified to take on
|
||||
* low values when the reference +/- read count ratio is skewed but the alt count ratio is not. Natural log is taken
|
||||
* to keep values within roughly the same range as other annotations.
|
||||
*
|
||||
* Adding pseudocounts prevent divide-by-zero.
|
||||
*
|
||||
* @param originalTable The table before augmentation
|
||||
* @return the SOR annotation value
|
||||
*/
|
||||
final protected double calculateSOR(final int[][] originalTable) {
|
||||
final double[][] augmentedTable = StrandBiasTableUtils.augmentContingencyTable(originalTable);
|
||||
|
||||
double ratio = 0;
|
||||
|
||||
ratio += (augmentedTable[0][0] / augmentedTable[0][1]) * (augmentedTable[1][1] / augmentedTable[1][0]);
|
||||
ratio += (augmentedTable[0][1] / augmentedTable[0][0]) * (augmentedTable[1][0] / augmentedTable[1][1]);
|
||||
|
||||
final double refRatio = (Math.min(augmentedTable[0][0], augmentedTable[0][1])/Math.max(augmentedTable[0][0], augmentedTable[0][1]));
|
||||
final double altRatio = (Math.min(augmentedTable[1][0], augmentedTable[1][1])/Math.max(augmentedTable[1][0], augmentedTable[1][1]));
|
||||
|
||||
ratio = ratio*refRatio/altRatio;
|
||||
|
||||
return Math.log(ratio);
|
||||
}
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -1,44 +1,44 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
|
|
@ -51,18 +51,64 @@
|
|||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.samtools.Cigar;
|
||||
import htsjdk.samtools.CigarElement;
|
||||
import htsjdk.samtools.CigarOperator;
|
||||
import htsjdk.samtools.SAMRecord;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.apache.commons.lang.StringUtils;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs;
|
||||
import org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.HashSet;
|
||||
import java.util.List;
|
||||
import java.util.Set;
|
||||
|
||||
public class AnnotationUtils {
|
||||
|
||||
public static final String ANNOTATION_HC_WARN_MSG = " annotation will not be calculated, must be called from HaplotypeCaller";
|
||||
public static final int WARNINGS_LOGGED_SIZE = 3;
|
||||
|
||||
/**
|
||||
* Helper function to parse the list into the annotation string
|
||||
* @param valueList the ArrayList returned from StrandBiasBySample.annotate()
|
||||
* @return the array used by the per-sample Strand Bias annotation
|
||||
*/
|
||||
protected static String encodeValueList( final List<Double> valueList, final String precisionFormat ) {
|
||||
List<String> outputList = new ArrayList<>();
|
||||
for (Double d : valueList) {
|
||||
outputList.add(String.format(precisionFormat, d));
|
||||
}
|
||||
return StringUtils.join(outputList, ",");
|
||||
}
|
||||
|
||||
/**
|
||||
* Checks if the walker is compatible with allele-specific annotations
|
||||
*/
|
||||
public static boolean walkerSupportsAlleleSpecificAnnotations(final AnnotatorCompatible walker) {
|
||||
return ((walker instanceof HaplotypeCaller) || (walker instanceof CombineGVCFs) || (walker instanceof GenotypeGVCFs));
|
||||
}
|
||||
|
||||
/**
|
||||
* Checks if the walker should get raw annotation data
|
||||
*/
|
||||
public static boolean walkerRequiresRawData(final AnnotatorCompatible walker) {
|
||||
return ((walker instanceof HaplotypeCaller && ((HaplotypeCaller) walker).emitReferenceConfidence()) || walker instanceof CombineGVCFs);
|
||||
}
|
||||
|
||||
/**
|
||||
* Checks if the input data is appropriate
|
||||
*
|
||||
* @param annotation the input genotype annotation key name(s)
|
||||
* @param walker input walker
|
||||
* @param map input map for each read, holds underlying alleles represented by an aligned read, and corresponding relative likelihood.
|
||||
* @param g input genotype
|
||||
|
|
@ -70,20 +116,38 @@ public class AnnotationUtils {
|
|||
* @param logger logger specific for each caller
|
||||
*
|
||||
* @return true if the walker is a HaplotypeCaller, the likelihood map is non-null and the genotype is non-null and called, false otherwise
|
||||
* @throws ReviewedGATKException if the size of warningsLogged is less than 4.
|
||||
* @throws IllegalArgumentException if annotation, walker, g, warningsLogged, or logger are null.
|
||||
* @throws ReviewedGATKException if the size of warningsLogged is less than 3.
|
||||
*/
|
||||
public static boolean isAppropriateInput(final AnnotatorCompatible walker, final PerReadAlleleLikelihoodMap map, final Genotype g, final boolean[] warningsLogged, final Logger logger) {
|
||||
public static boolean isAppropriateInput(final String annotation, final AnnotatorCompatible walker, final PerReadAlleleLikelihoodMap map, final Genotype g, final boolean[] warningsLogged, final Logger logger) {
|
||||
|
||||
if ( warningsLogged.length < 4 ){
|
||||
throw new ReviewedGATKException("Warnings logged array must have at last 4 elements, but has " + warningsLogged.length);
|
||||
if ( annotation == null ){
|
||||
throw new IllegalArgumentException("The input annotation cannot be null");
|
||||
}
|
||||
|
||||
if ( walker == null ) {
|
||||
throw new IllegalArgumentException("The input walker cannot be null");
|
||||
}
|
||||
|
||||
if ( g == null ) {
|
||||
throw new IllegalArgumentException("The input genotype cannot be null");
|
||||
}
|
||||
|
||||
if ( warningsLogged == null ){
|
||||
throw new IllegalArgumentException("The input warnings logged cannot be null");
|
||||
}
|
||||
|
||||
if ( logger == null ){
|
||||
throw new IllegalArgumentException("The input logger cannot be null");
|
||||
}
|
||||
|
||||
if ( warningsLogged.length < WARNINGS_LOGGED_SIZE ){
|
||||
throw new ReviewedGATKException("Warnings logged array must have at least " + WARNINGS_LOGGED_SIZE + " elements, but has " + warningsLogged.length);
|
||||
}
|
||||
|
||||
if ( !(walker instanceof HaplotypeCaller) ) {
|
||||
if ( !warningsLogged[0] ) {
|
||||
if ( walker != null )
|
||||
logger.warn("Annotation will not be calculated, must be called from HaplotyepCaller, not " + walker.getClass().getName());
|
||||
else
|
||||
logger.warn("Annotation will not be calculated, must be called from HaplotyepCaller");
|
||||
logger.warn(annotation + ANNOTATION_HC_WARN_MSG + ", not " + walker.getClass().getSimpleName());
|
||||
warningsLogged[0] = true;
|
||||
}
|
||||
return false;
|
||||
|
|
@ -97,22 +161,126 @@ public class AnnotationUtils {
|
|||
return false;
|
||||
}
|
||||
|
||||
if ( g == null ){
|
||||
if ( !warningsLogged[2] ) {
|
||||
logger.warn("Annotation will not be calculated, missing genotype");
|
||||
warningsLogged[2]= true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
if ( !g.isCalled() ){
|
||||
if ( !warningsLogged[3] ) {
|
||||
if ( !warningsLogged[2] ) {
|
||||
logger.warn("Annotation will not be calculated, genotype is not called");
|
||||
warningsLogged[3] = true;
|
||||
warningsLogged[2] = true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
|
||||
//this method is intended to reconcile uniquified sample names
|
||||
// it comes into play when calling this annotation from GenotypeGVCFs with --uniquifySamples because founderIds
|
||||
// is derived from the sampleDB, which comes from the input sample names, but vc will have uniquified (i.e. different)
|
||||
// sample names. Without this check, the founderIds won't be found in the vc and the annotation won't be calculated.
|
||||
protected static Set<String> validateFounderIDs(final Set<String> founderIds, final VariantContext vc) {
|
||||
Set<String> vcSamples = new HashSet<>();
|
||||
Set<String> returnIDs = founderIds;
|
||||
vcSamples.addAll(vc.getSampleNames());
|
||||
if (!vcSamples.isEmpty()) {
|
||||
if (founderIds != null) {
|
||||
vcSamples.removeAll(founderIds);
|
||||
if (vcSamples.equals(vc.getSampleNames()))
|
||||
returnIDs = vc.getSampleNames();
|
||||
}
|
||||
}
|
||||
return returnIDs;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the position of a variant within a read with respect to the closer end, accounting for hard clipped bases and low quality ends
|
||||
* Used by ReadPosRankSum annotations
|
||||
*
|
||||
* @param read a read containing the variant
|
||||
* @param initialReadPosition the position based on the modified, post-hard-clipped CIGAR
|
||||
* @return read position
|
||||
*/
|
||||
public static int getFinalVariantReadPosition(final GATKSAMRecord read, final int initialReadPosition) {
|
||||
final int numAlignedBases = getNumAlignedBases(read);
|
||||
|
||||
int readPos = initialReadPosition;
|
||||
//TODO: this doesn't work for the middle-right position if we index from zero
|
||||
if (initialReadPosition > numAlignedBases / 2) {
|
||||
readPos = numAlignedBases - (initialReadPosition + 1);
|
||||
}
|
||||
return readPos;
|
||||
|
||||
}
|
||||
|
||||
/**
|
||||
*
|
||||
* @param read a read containing the variant
|
||||
* @return the number of hard clipped and low qual bases at the read start (where start is the leftmost end w.r.t. the reference)
|
||||
*/
|
||||
public static int getNumClippedBasesAtStart(final SAMRecord read) {
|
||||
// check for hard clips (never consider these bases):
|
||||
final Cigar c = read.getCigar();
|
||||
final CigarElement first = c.getCigarElement(0);
|
||||
|
||||
int numStartClippedBases = 0;
|
||||
if (first.getOperator() == CigarOperator.H) {
|
||||
numStartClippedBases = first.getLength();
|
||||
}
|
||||
final byte[] unclippedReadBases = read.getReadBases();
|
||||
final byte[] unclippedReadQuals = read.getBaseQualities();
|
||||
|
||||
// Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative,
|
||||
// and may leave a string of Q2 bases still hanging off the reads.
|
||||
//TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails
|
||||
for (int i = numStartClippedBases; i < unclippedReadBases.length; i++) {
|
||||
if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD)
|
||||
numStartClippedBases++;
|
||||
else
|
||||
break;
|
||||
|
||||
}
|
||||
|
||||
return numStartClippedBases;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
*
|
||||
* @param read a read containing the variant
|
||||
* @return number of non-hard clipped, aligned bases (excluding low quality bases at either end)
|
||||
*/
|
||||
//TODO: this is bizarre -- this code counts hard clips, but then subtracts them from the read length, which already doesn't count hard clips
|
||||
public static int getNumAlignedBases(final GATKSAMRecord read) {
|
||||
return read.getReadLength() - getNumClippedBasesAtStart(read) - getNumClippedBasesAtEnd(read);
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
*
|
||||
* @param read a read containing the variant
|
||||
* @return number of hard clipped and low qual bases at the read end (where end is right end w.r.t. the reference)
|
||||
*/
|
||||
public static int getNumClippedBasesAtEnd(final GATKSAMRecord read) {
|
||||
// check for hard clips (never consider these bases):
|
||||
final Cigar c = read.getCigar();
|
||||
CigarElement last = c.getCigarElement(c.numCigarElements() - 1);
|
||||
|
||||
int numEndClippedBases = 0;
|
||||
if (last.getOperator() == CigarOperator.H) {
|
||||
numEndClippedBases = last.getLength();
|
||||
}
|
||||
final byte[] unclippedReadBases = read.getReadBases();
|
||||
final byte[] unclippedReadQuals = read.getBaseQualities();
|
||||
|
||||
// Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative,
|
||||
// and may leave a string of Q2 bases still hanging off the reads.
|
||||
//TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails
|
||||
for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) {
|
||||
if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD)
|
||||
numEndClippedBases++;
|
||||
else
|
||||
break;
|
||||
}
|
||||
|
||||
return numEndClippedBases;
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,153 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypeBuilder;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFFormatHeaderLine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.GenotypeAnnotation;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
import org.broadinstitute.gatk.utils.BaseUtils;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Count of A, C, G, T bases for each sample
|
||||
*
|
||||
* <p> This annotation returns the counts of A, C, G, and T bases for each sample, in that order.</p>
|
||||
* <h3>Example:</h3>
|
||||
*
|
||||
* <pre>BCS=3,0,3,0</pre>
|
||||
*
|
||||
* <p>
|
||||
* This means the number of A bases seen is 3, the number of T bases seen is 0, the number of G bases seen is 3, and the number of T bases seen is 0.
|
||||
* </p>
|
||||
*
|
||||
* <p>
|
||||
* BaseCountsBySample is intended to provide insight into the pileup of bases used by HaplotypeCaller in the calling process, which may differ from the pileup
|
||||
* observed in the original bam file because of the local realignment and additional filtering performed internally by HaplotypeCaller.
|
||||
* </p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>Can only be requested from HaplotypeCaller, not VariantAnnotator.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_BaseCounts.php">BaseCounts</a></b> counts the percentage of N bases.</li>
|
||||
* </ul>
|
||||
*/
|
||||
|
||||
public class BaseCountsBySample extends GenotypeAnnotation {
|
||||
|
||||
@Override
|
||||
public void annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final AlignmentContext stratifiedContext,
|
||||
final VariantContext vc,
|
||||
final Genotype g,
|
||||
final GenotypeBuilder gb,
|
||||
final PerReadAlleleLikelihoodMap alleleLikelihoodMap) {
|
||||
|
||||
if ( alleleLikelihoodMap != null && !alleleLikelihoodMap.isEmpty() )
|
||||
gb.attribute(GATKVCFConstants.BASE_COUNTS_BY_SAMPLE_KEY, getBaseCounts(alleleLikelihoodMap, vc));
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Collections.singletonList(GATKVCFConstants.BASE_COUNTS_BY_SAMPLE_KEY); }
|
||||
|
||||
@Override
|
||||
public List<VCFFormatHeaderLine> getDescriptions() {
|
||||
return Collections.singletonList(GATKVCFHeaderLines.getFormatLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
/**
|
||||
* Base counts given for the most likely allele
|
||||
*
|
||||
* @param perReadAlleleLikelihoodMap for each read, the underlying alleles represented by an aligned read, and corresponding relative likelihood.
|
||||
* @param vc variant context
|
||||
* @return count of A, C, G, T bases
|
||||
* @throws IllegalStateException if alleles in vc are not in perReadAlleleLikelihoodMap
|
||||
*/
|
||||
private int[] getBaseCounts(final PerReadAlleleLikelihoodMap perReadAlleleLikelihoodMap, final VariantContext vc) {
|
||||
final Set<Allele> alleles = new HashSet<>(vc.getAlleles());
|
||||
|
||||
// make sure that there's a meaningful relationship between the alleles in the perReadAlleleLikelihoodMap and our VariantContext
|
||||
if ( !perReadAlleleLikelihoodMap.getAllelesSet().containsAll(alleles) )
|
||||
throw new IllegalStateException("VC alleles " + alleles + " not a strict subset of per read allele map alleles " + perReadAlleleLikelihoodMap.getAllelesSet());
|
||||
|
||||
final int[] counts = new int[4];
|
||||
for ( final Map.Entry<GATKSAMRecord,Map<Allele,Double>> el : perReadAlleleLikelihoodMap.getLikelihoodReadMap().entrySet()) {
|
||||
final MostLikelyAllele a = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue(), alleles);
|
||||
if (! a.isInformative() ) continue; // read is non-informative
|
||||
for (final byte base : el.getKey().getReadBases() ){
|
||||
int index = BaseUtils.simpleBaseToBaseIndex(base);
|
||||
if ( index != -1 )
|
||||
counts[index]++;
|
||||
}
|
||||
}
|
||||
|
||||
return counts;
|
||||
}
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -65,13 +65,23 @@ import java.util.*;
|
|||
/**
|
||||
* Rank Sum Test of REF versus ALT base quality scores
|
||||
*
|
||||
* <p>This variant-level annotation tests compares the base qualities of the data supporting the reference allele with those supporting the alternate allele. The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.</p>
|
||||
* <p>This variant-level annotation compares the base qualities of the data supporting the reference allele with those supporting any alternate allele.</p>
|
||||
*
|
||||
* <p>The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for base qualities (bases supporting REF vs. bases supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>The base quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</p>
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>Uninformative reads are not used in these calculations.</li>
|
||||
* <li>The base quality rank sum test cannot be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_BaseQualityRankSumTest.php">AS_BaseQualityRankSumTest</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class BaseQualityRankSumTest extends RankSumTest implements StandardAnnotation {
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -90,6 +90,7 @@ import java.util.*;
|
|||
public class ChromosomeCounts extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
|
||||
private Set<String> founderIds = new HashSet<String>();
|
||||
private boolean didUniquifiedSampleNameCheck = false;
|
||||
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
|
|
@ -99,6 +100,11 @@ public class ChromosomeCounts extends InfoFieldAnnotation implements StandardAnn
|
|||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
if ( ! vc.hasGenotypes() )
|
||||
return null;
|
||||
//if none of the "founders" are in the vc samples, assume we uniquified the samples upstream and they are all founders
|
||||
if (!didUniquifiedSampleNameCheck) {
|
||||
checkSampleNames(vc);
|
||||
didUniquifiedSampleNameCheck = true;
|
||||
}
|
||||
|
||||
return VariantContextUtils.calculateChromosomeCounts(vc, new HashMap<String, Object>(), true,founderIds);
|
||||
}
|
||||
|
|
@ -113,4 +119,21 @@ public class ChromosomeCounts extends InfoFieldAnnotation implements StandardAnn
|
|||
}
|
||||
|
||||
public List<VCFInfoHeaderLine> getDescriptions() { return Arrays.asList(ChromosomeCountConstants.descriptions); }
|
||||
|
||||
//this method is intended to reconcile uniquified sample names
|
||||
// it comes into play when calling this annotation from GenotypeGVCFs with --uniquifySamples because founderIds
|
||||
// is derived from the sampleDB, which comes from the input sample names, but vc will have uniquified (i.e. different)
|
||||
// sample names. Without this check, the founderIds won't be found in the vc and the annotation won't be calculated.
|
||||
protected void checkSampleNames(final VariantContext vc) {
|
||||
Set<String> vcSamples = new HashSet<>();
|
||||
vcSamples.addAll(vc.getSampleNames());
|
||||
if (!vcSamples.isEmpty()) {
|
||||
if (founderIds!=null) {
|
||||
vcSamples.retainAll(founderIds);
|
||||
if (vcSamples.isEmpty())
|
||||
founderIds = vc.getSampleNames();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -52,6 +52,7 @@
|
|||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardHCAnnotation;
|
||||
import org.broadinstitute.gatk.utils.sam.AlignmentUtils;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
|
@ -71,7 +72,7 @@ import java.util.*;
|
|||
* <p>The clipping rank sum test cannot be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</p>
|
||||
*
|
||||
*/
|
||||
public class ClippingRankSumTest extends RankSumTest {
|
||||
public class ClippingRankSumTest extends RankSumTest implements StandardHCAnnotation{
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.CLIPPING_RANK_SUM_KEY); }
|
||||
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -51,9 +51,12 @@
|
|||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import org.apache.commons.lang.StringUtils;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.GenotypeAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardHCAnnotation;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
|
|
@ -91,10 +94,10 @@ import java.util.*;
|
|||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class DepthPerSampleHC extends GenotypeAnnotation {
|
||||
public class DepthPerSampleHC extends GenotypeAnnotation implements StandardHCAnnotation{
|
||||
private final static Logger logger = Logger.getLogger(DepthPerSampleHC.class);
|
||||
private boolean alleleLikelihoodMapSubsetWarningLogged = false;
|
||||
boolean[] warningsLogged = new boolean[4];
|
||||
private final boolean[] warningsLogged = new boolean[AnnotationUtils.WARNINGS_LOGGED_SIZE];
|
||||
|
||||
@Override
|
||||
public void annotate(final RefMetaDataTracker tracker,
|
||||
|
|
@ -106,7 +109,7 @@ public class DepthPerSampleHC extends GenotypeAnnotation {
|
|||
final GenotypeBuilder gb,
|
||||
final PerReadAlleleLikelihoodMap alleleLikelihoodMap){
|
||||
|
||||
if ( !AnnotationUtils.isAppropriateInput(walker, alleleLikelihoodMap, g, warningsLogged, logger) ) {
|
||||
if ( !AnnotationUtils.isAppropriateInput(VCFConstants.DEPTH_KEY , walker, alleleLikelihoodMap, g, warningsLogged, logger) ) {
|
||||
return;
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -0,0 +1,276 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import org.apache.commons.math.stat.StatUtils;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.engine.walkers.Walker;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypesContext;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
|
||||
/**
|
||||
* Phred-scaled p-value for exact test of excess heterozygosity
|
||||
*
|
||||
* This annotation estimates excess heterozygosity in a population of samples. It is related to but distinct from InbreedingCoeff, which estimates evidence for inbreeding in a population. ExcessHet scales more reliably to large cohort sizes.
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>This annotation is a one-sided phred-scaled p-value using an exact test of the Hardy-Weinberg Equilibrium. The null hypothesis is that the number of heterozygotes follows the Hardy-Weinberg Equilibrium. The p-value is the probability of getting the same or more heterozygotes as was observed, given the null hypothesis. </p>
|
||||
* <p>The implementation used is adapted from Wigginton JE, Cutler DJ, Abecasis GR. A Note on Exact Tests of Hardy-Weinberg Equilibrium. American Journal of Human Genetics. 2005;76(5):887-893.</p>
|
||||
* <p>The p-value is calculated exactly by using the Levene-Haldane distribution. This implementation also uses a mid-p correction as described by Graffelman, J. & Moreno, V. (2013). The mid p-value in exact tests for Hardy-Weinberg equilibrium. Statistical Applications in Genetics and Molecular Biology, 12(4), pp. 433-448. </p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>The annotation is not accurate for very small p-values. Beyond 1.0E-16 there is no guarantee that the p-value is accurate, just that it is in fact smaller than 1.0E-16. </li>
|
||||
* <li>For multiallelic sites, all non-reference alleles are treated as a single alternate allele.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_InbreedingCoeff.php">InbreedingCoeff</a></b> estimates whether there is evidence of inbreeding in a population</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_InbreedingCoeff.php">AS_InbreedingCoeff</a></b> outputs an allele-specific version of the InbreedingCoeff annotation.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class ExcessHet extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(ExcessHet.class);
|
||||
private final double minNeededValue = 1.0E-16;
|
||||
private Set<String> founderIds;
|
||||
private final boolean RETURN_ROUNDED = true;
|
||||
private int sampleCount = -1;
|
||||
|
||||
@Override
|
||||
public void initialize ( AnnotatorCompatible walker, GenomeAnalysisEngine toolkit, Set<VCFHeaderLine> headerLines ) {
|
||||
//If available, get the founder IDs and cache them. The ExcessHet value will only be computed on founders then.
|
||||
//excessHet respects pedigree files, but doesn't require a minimum number of samples
|
||||
if(founderIds == null && walker != null) {
|
||||
founderIds = ((Walker) walker).getSampleDB().getFounderIds();
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap) {
|
||||
|
||||
return makeEHAnnotation(vc);
|
||||
}
|
||||
|
||||
protected double calculateEH(final VariantContext vc, final GenotypesContext genotypes) {
|
||||
HeterozygosityUtils heterozygosityUtils = new HeterozygosityUtils(RETURN_ROUNDED);
|
||||
final double[] genotypeCountsDoubles = heterozygosityUtils.getGenotypeCountsForRefVsAllAlts(vc, genotypes);
|
||||
sampleCount = heterozygosityUtils.getSampleCount();
|
||||
final int[] genotypeCounts = new int[genotypeCountsDoubles.length];
|
||||
for(int i = 0; i < genotypeCountsDoubles.length; i++) {
|
||||
genotypeCounts[i] = (int)genotypeCountsDoubles[i];
|
||||
}
|
||||
|
||||
double pval = exactTest(genotypeCounts);
|
||||
|
||||
//If the actual phredPval would be infinity we will probably still filter out just a very large number
|
||||
if (pval == 0) {
|
||||
return Integer.MAX_VALUE;
|
||||
}
|
||||
double phredPval = -10.0 * Math.log10(pval);
|
||||
|
||||
return phredPval;
|
||||
}
|
||||
|
||||
/**
|
||||
* Note that this method is not accurate for very small p-values. Beyond 1.0E-16 there is no guarantee that the
|
||||
* p-value is accurate, just that it is in fact smaller than 1.0E-16 (and therefore we should filter it). It would
|
||||
* be more computationally expensive to calculate accuracy beyond a given threshold. Here we have enough accuracy
|
||||
* to filter anything below a p-value of 10E-6.
|
||||
*
|
||||
* @param genotypeCounts Number of observed genotypes (n_aa, n_ab, n_bb)
|
||||
* @return Right sided p-value or the probability of getting the observed or higher number of hets given the sample
|
||||
* size (N) and the observed number of allele a (rareCopies)
|
||||
*/
|
||||
protected double exactTest(final int[] genotypeCounts) {
|
||||
if (genotypeCounts.length != 3) {
|
||||
throw new IllegalStateException("Input genotype counts must be length 3 for the number of genotypes with {2, 1, 0} ref alleles.");
|
||||
}
|
||||
final int REF_INDEX = 0;
|
||||
final int HET_INDEX = 1;
|
||||
final int VAR_INDEX = 2;
|
||||
|
||||
final int refCount = genotypeCounts[REF_INDEX];
|
||||
final int hetCount = genotypeCounts[HET_INDEX];
|
||||
final int homCount = genotypeCounts[VAR_INDEX];
|
||||
|
||||
if (hetCount < 0 || refCount < 0 || homCount < 0) {
|
||||
throw new IllegalArgumentException("Genotype counts cannot be less than 0");
|
||||
}
|
||||
|
||||
//Split into observed common allele and rare allele
|
||||
final int obsHomR;
|
||||
final int obsHomC;
|
||||
if (refCount < homCount) {
|
||||
obsHomR = refCount;
|
||||
obsHomC = homCount;
|
||||
} else {
|
||||
obsHomR = homCount;
|
||||
obsHomC = refCount;
|
||||
}
|
||||
|
||||
final int rareCopies = 2 * obsHomR + hetCount;
|
||||
final int N = hetCount + obsHomC + obsHomR;
|
||||
|
||||
//If the probability distribution has only 1 point, then the mid p-value is .5
|
||||
if (rareCopies <= 1) {
|
||||
return .5;
|
||||
}
|
||||
|
||||
double[] probs = new double[rareCopies + 1];
|
||||
|
||||
//Find (something close to the) mode for the midpoint
|
||||
int mid = (int) Math.floor(((double) rareCopies * (2.0 * (double) N - (double) rareCopies)) / (2.0 * (double) N - 1.0));
|
||||
if ((mid % 2) != (rareCopies % 2)) {
|
||||
mid++;
|
||||
}
|
||||
|
||||
probs[mid] = 1.0;
|
||||
double mysum = 1.0;
|
||||
|
||||
//Calculate probabilities from midpoint down
|
||||
int currHets = mid;
|
||||
int currHomR = (rareCopies - mid) / 2;
|
||||
int currHomC = N - currHets - currHomR;
|
||||
|
||||
while (currHets >= 2) {
|
||||
double potentialProb = probs[currHets] * (double) currHets * ((double) currHets - 1.0) / (4.0 * ((double) currHomR + 1.0) * ((double) currHomC + 1.0));
|
||||
if (potentialProb < minNeededValue) {
|
||||
break;
|
||||
}
|
||||
|
||||
probs[currHets - 2] = potentialProb;
|
||||
mysum = mysum + probs[currHets - 2];
|
||||
|
||||
//2 fewer hets means one additional homR and homC each
|
||||
currHets = currHets - 2;
|
||||
currHomR = currHomR + 1;
|
||||
currHomC = currHomC + 1;
|
||||
}
|
||||
|
||||
//Calculate probabilities from midpoint up
|
||||
currHets = mid;
|
||||
currHomR = (rareCopies - mid) / 2;
|
||||
currHomC = N - currHets - currHomR;
|
||||
|
||||
while (currHets <= rareCopies - 2) {
|
||||
double potentialProb = probs[currHets] * 4.0 * (double) currHomR * (double) currHomC / (((double) currHets + 2.0) * ((double) currHets + 1.0));
|
||||
if (potentialProb < minNeededValue) {
|
||||
break;
|
||||
}
|
||||
|
||||
probs[currHets + 2] = potentialProb;
|
||||
mysum = mysum + probs[currHets + 2];
|
||||
|
||||
//2 more hets means 1 fewer homR and homC each
|
||||
currHets = currHets + 2;
|
||||
currHomR = currHomR - 1;
|
||||
currHomC = currHomC - 1;
|
||||
}
|
||||
|
||||
double rightPval = probs[hetCount] / (2.0 * mysum);
|
||||
//Check if we observed the highest possible number of hets
|
||||
if (hetCount == rareCopies) {
|
||||
return rightPval;
|
||||
}
|
||||
rightPval = rightPval + StatUtils.sum(Arrays.copyOfRange(probs, hetCount + 1, probs.length)) / mysum;
|
||||
|
||||
return (rightPval);
|
||||
}
|
||||
|
||||
protected Map<String, Object> makeEHAnnotation(final VariantContext vc) {
|
||||
final GenotypesContext genotypes = (founderIds == null || founderIds.isEmpty()) ? vc.getGenotypes() : vc.getGenotypes(founderIds);
|
||||
if (genotypes == null || !vc.isVariant())
|
||||
return null;
|
||||
double EH = calculateEH(vc, genotypes);
|
||||
if (sampleCount < 1)
|
||||
return null;
|
||||
return Collections.singletonMap(getKeyNames().get(0), (Object) String.format("%.4f", EH));
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() {
|
||||
return Collections.singletonList(GATKVCFConstants.EXCESS_HET_KEY);
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
return Collections.singletonList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -51,9 +51,7 @@
|
|||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import cern.jet.math.Arithmetic;
|
||||
import htsjdk.variant.variantcontext.GenotypesContext;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
|
|
@ -70,7 +68,10 @@ import java.util.*;
|
|||
/**
|
||||
* Strand bias estimated using Fisher's Exact Test
|
||||
*
|
||||
* <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. The FisherStrand annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It uses Fisher's Exact Test to determine if there is strand bias between forward and reverse strands for the reference or alternate allele.”</p>
|
||||
* <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other.</p>
|
||||
*
|
||||
* <p>The FisherStrand annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It uses Fisher's Exact Test to determine if there is strand bias between forward and reverse strands for the reference or alternate allele.</p>
|
||||
*
|
||||
* <p>The output is a Phred-scaled p-value. The higher the output value, the more likely there is to be bias. More bias is indicative of false positive calls.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
|
|
@ -83,6 +84,7 @@ import java.util.*;
|
|||
* </ul>
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_FisherStrand.php">FisherStrand</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php">StrandBiasBySample</a></b> outputs counts of read depth per allele for each strand orientation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandOddsRatio.php">StrandOddsRatio</a></b> is an updated form of FisherStrand that uses a symmetric odds ratio calculation.</li>
|
||||
* </ul>
|
||||
|
|
@ -90,16 +92,25 @@ import java.util.*;
|
|||
*/
|
||||
public class FisherStrand extends StrandBiasTest implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
private final static boolean ENABLE_DEBUGGING = false;
|
||||
private final static Logger logger = Logger.getLogger(FisherStrand.class);
|
||||
|
||||
private static final double MIN_PVALUE = 1E-320;
|
||||
private static final int MIN_QUAL_FOR_FILTERED_TEST = 17;
|
||||
private static final int MIN_COUNT = ARRAY_DIM;
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() {
|
||||
return Collections.singletonList(GATKVCFConstants.FISHER_STRAND_KEY);
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
return Collections.singletonList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
@Override
|
||||
protected Map<String, Object> calculateAnnotationFromGTfield(final GenotypesContext genotypes){
|
||||
final int[][] tableFromPerSampleAnnotations = getTableFromSamples( genotypes, MIN_COUNT );
|
||||
return ( tableFromPerSampleAnnotations != null )? pValueForBestTable(tableFromPerSampleAnnotations, null) : null;
|
||||
return ( tableFromPerSampleAnnotations != null )? pValueAnnotationForBestTable(tableFromPerSampleAnnotations, null) : null;
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
@ -107,9 +118,11 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation,
|
|||
final VariantContext vc){
|
||||
final int[][] tableNoFiltering = getSNPContingencyTable(stratifiedContexts, vc.getReference(), vc.getAlternateAlleles(), -1, MIN_COUNT);
|
||||
final int[][] tableFiltering = getSNPContingencyTable(stratifiedContexts, vc.getReference(), vc.getAlternateAlleles(), MIN_QUAL_FOR_FILTERED_TEST, MIN_COUNT);
|
||||
printTable("unfiltered", tableNoFiltering);
|
||||
printTable("filtered", tableFiltering);
|
||||
return pValueForBestTable(tableFiltering, tableNoFiltering);
|
||||
if (ENABLE_DEBUGGING) {
|
||||
StrandBiasTableUtils.printTable("unfiltered", tableNoFiltering);
|
||||
StrandBiasTableUtils.printTable("filtered", tableFiltering);
|
||||
}
|
||||
return pValueAnnotationForBestTable(tableFiltering, tableNoFiltering);
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
@ -119,11 +132,9 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation,
|
|||
final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc, MIN_COUNT);
|
||||
//logger.info("VC " + vc);
|
||||
//printTable(table, 0.0);
|
||||
return pValueForBestTable(table, null);
|
||||
return pValueAnnotationForBestTable(table, null);
|
||||
}
|
||||
|
||||
|
||||
|
||||
/**
|
||||
* Create an annotation for the highest (i.e., least significant) p-value of table1 and table2
|
||||
*
|
||||
|
|
@ -131,14 +142,14 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation,
|
|||
* @param table2 a contingency table, may be null
|
||||
* @return annotation result for FS given tables
|
||||
*/
|
||||
private Map<String, Object> pValueForBestTable(final int[][] table1, final int[][] table2) {
|
||||
private Map<String, Object> pValueAnnotationForBestTable(final int[][] table1, final int[][] table2) {
|
||||
if ( table2 == null )
|
||||
return table1 == null ? null : annotationForOneTable(pValueForContingencyTable(table1));
|
||||
return table1 == null ? null : annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1));
|
||||
else if (table1 == null)
|
||||
return annotationForOneTable(pValueForContingencyTable(table2));
|
||||
return annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2));
|
||||
else { // take the one with the best (i.e., least significant pvalue)
|
||||
double pvalue1 = pValueForContingencyTable(table1);
|
||||
double pvalue2 = pValueForContingencyTable(table2);
|
||||
double pvalue1 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1);
|
||||
double pvalue2 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2);
|
||||
return annotationForOneTable(Math.max(pvalue1, pvalue2));
|
||||
}
|
||||
}
|
||||
|
|
@ -153,185 +164,4 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation,
|
|||
final Object value = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(pValue, MIN_PVALUE))); // prevent INFINITYs
|
||||
return Collections.singletonMap(getKeyNames().get(0), value);
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() {
|
||||
return Collections.singletonList(GATKVCFConstants.FISHER_STRAND_KEY);
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
return Collections.singletonList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper function to turn the FisherStrand table into the SB annotation array
|
||||
* @param table the table used by the FisherStrand annotation
|
||||
* @return the array used by the per-sample Strand Bias annotation
|
||||
*/
|
||||
public static List<Integer> getContingencyArray( final int[][] table ) {
|
||||
if(table.length != ARRAY_DIM || table[0].length != ARRAY_DIM) {
|
||||
logger.warn("Expecting a " + ARRAY_DIM + "x" + ARRAY_DIM + " strand bias table.");
|
||||
return null;
|
||||
}
|
||||
|
||||
final List<Integer> list = new ArrayList<>(ARRAY_SIZE); // TODO - if we ever want to do something clever with multi-allelic sites this will need to change
|
||||
list.add(table[0][0]);
|
||||
list.add(table[0][1]);
|
||||
list.add(table[1][0]);
|
||||
list.add(table[1][1]);
|
||||
return list;
|
||||
}
|
||||
|
||||
public static Double pValueForContingencyTable(int[][] originalTable) {
|
||||
final int[][] normalizedTable = normalizeContingencyTable(originalTable);
|
||||
|
||||
int[][] table = copyContingencyTable(normalizedTable);
|
||||
|
||||
double pCutoff = computePValue(table);
|
||||
//printTable(table, pCutoff);
|
||||
|
||||
double pValue = pCutoff;
|
||||
while (rotateTable(table)) {
|
||||
double pValuePiece = computePValue(table);
|
||||
|
||||
//printTable(table, pValuePiece);
|
||||
|
||||
if (pValuePiece <= pCutoff) {
|
||||
pValue += pValuePiece;
|
||||
}
|
||||
}
|
||||
|
||||
table = copyContingencyTable(normalizedTable);
|
||||
while (unrotateTable(table)) {
|
||||
double pValuePiece = computePValue(table);
|
||||
|
||||
//printTable(table, pValuePiece);
|
||||
|
||||
if (pValuePiece <= pCutoff) {
|
||||
pValue += pValuePiece;
|
||||
}
|
||||
}
|
||||
|
||||
//System.out.printf("P-cutoff: %f\n", pCutoff);
|
||||
//System.out.printf("P-value: %f\n\n", pValue);
|
||||
|
||||
// min is necessary as numerical precision can result in pValue being slightly greater than 1.0
|
||||
return Math.min(pValue, 1.0);
|
||||
}
|
||||
|
||||
// how large do we want the normalized table to be?
|
||||
private static final double TARGET_TABLE_SIZE = 200.0;
|
||||
|
||||
/**
|
||||
* Normalize the table so that the entries are not too large.
|
||||
* Note that this method does NOT necessarily make a copy of the table being passed in!
|
||||
*
|
||||
* @param table the original table
|
||||
* @return a normalized version of the table or the original table if it is already normalized
|
||||
*/
|
||||
private static int[][] normalizeContingencyTable(final int[][] table) {
|
||||
final int sum = table[0][0] + table[0][1] + table[1][0] + table[1][1];
|
||||
if ( sum <= TARGET_TABLE_SIZE * 2 )
|
||||
return table;
|
||||
|
||||
final double normalizationFactor = (double)sum / TARGET_TABLE_SIZE;
|
||||
|
||||
final int[][] normalized = new int[ARRAY_DIM][ARRAY_DIM];
|
||||
for ( int i = 0; i < ARRAY_DIM; i++ ) {
|
||||
for ( int j = 0; j < ARRAY_DIM; j++ )
|
||||
normalized[i][j] = (int)(table[i][j] / normalizationFactor);
|
||||
}
|
||||
|
||||
return normalized;
|
||||
}
|
||||
|
||||
private static int [][] copyContingencyTable(int [][] t) {
|
||||
int[][] c = new int[ARRAY_DIM][ARRAY_DIM];
|
||||
|
||||
for ( int i = 0; i < ARRAY_DIM; i++ )
|
||||
for ( int j = 0; j < ARRAY_DIM; j++ )
|
||||
c[i][j] = t[i][j];
|
||||
|
||||
return c;
|
||||
}
|
||||
|
||||
|
||||
private static void printTable(int[][] table, double pValue) {
|
||||
logger.info(String.format("%d %d; %d %d : %f", table[0][0], table[0][1], table[1][0], table[1][1], pValue));
|
||||
}
|
||||
|
||||
/**
|
||||
* Printing information to logger.info for debugging purposes
|
||||
*
|
||||
* @param name the name of the table
|
||||
* @param table the table itself
|
||||
*/
|
||||
private void printTable(final String name, final int[][] table) {
|
||||
if ( ENABLE_DEBUGGING ) {
|
||||
final String pValue = (String)annotationForOneTable(pValueForContingencyTable(table)).get(getKeyNames().get(0));
|
||||
logger.info(String.format("FS %s (REF+, REF-, ALT+, ALT-) = (%d, %d, %d, %d) = %s",
|
||||
name, table[0][0], table[0][1], table[1][0], table[1][1], pValue));
|
||||
}
|
||||
}
|
||||
|
||||
private static boolean rotateTable(int[][] table) {
|
||||
table[0][0]--;
|
||||
table[1][0]++;
|
||||
|
||||
table[0][1]++;
|
||||
table[1][1]--;
|
||||
|
||||
return (table[0][0] >= 0 && table[1][1] >= 0);
|
||||
}
|
||||
|
||||
private static boolean unrotateTable(int[][] table) {
|
||||
table[0][0]++;
|
||||
table[1][0]--;
|
||||
|
||||
table[0][1]--;
|
||||
table[1][1]++;
|
||||
|
||||
return (table[0][1] >= 0 && table[1][0] >= 0);
|
||||
}
|
||||
|
||||
private static double computePValue(int[][] table) {
|
||||
|
||||
int[] rowSums = { sumRow(table, 0), sumRow(table, 1) };
|
||||
int[] colSums = { sumColumn(table, 0), sumColumn(table, 1) };
|
||||
int N = rowSums[0] + rowSums[1];
|
||||
|
||||
// calculate in log space so we don't die with high numbers
|
||||
double pCutoff = Arithmetic.logFactorial(rowSums[0])
|
||||
+ Arithmetic.logFactorial(rowSums[1])
|
||||
+ Arithmetic.logFactorial(colSums[0])
|
||||
+ Arithmetic.logFactorial(colSums[1])
|
||||
- Arithmetic.logFactorial(table[0][0])
|
||||
- Arithmetic.logFactorial(table[0][1])
|
||||
- Arithmetic.logFactorial(table[1][0])
|
||||
- Arithmetic.logFactorial(table[1][1])
|
||||
- Arithmetic.logFactorial(N);
|
||||
return Math.exp(pCutoff);
|
||||
}
|
||||
|
||||
private static int sumRow(int[][] table, int column) {
|
||||
int sum = 0;
|
||||
for (int r = 0; r < table.length; r++) {
|
||||
sum += table[r][column];
|
||||
}
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
private static int sumColumn(int[][] table, int row) {
|
||||
int sum = 0;
|
||||
for (int c = 0; c < table[row].length; c++) {
|
||||
sum += table[row][c];
|
||||
}
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
|
||||
|
||||
}
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -0,0 +1,236 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.*;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import java.util.HashMap;
|
||||
import java.util.Map;
|
||||
|
||||
/**
|
||||
* A class containing utility methods used in the calculation of annotations related to cohort heterozygosity, e.g. InbreedingCoefficient and ExcessHet
|
||||
* Stores sample count to make sure we never have to iterate the genotypes more than once
|
||||
* Should be reinitialized for each VariantContext
|
||||
*/
|
||||
public class HeterozygosityUtils {
|
||||
|
||||
final public static int REF_INDEX = 0;
|
||||
final public static int HET_INDEX = 1;
|
||||
final public static int VAR_INDEX = 2;
|
||||
|
||||
protected int sampleCount = -1;
|
||||
private Map<Allele, Double> hetCounts;
|
||||
private Map<Allele, Double> alleleCounts;
|
||||
boolean returnRounded = false;
|
||||
|
||||
/**
|
||||
* Create a new HeterozygosityUtils -- a new class should be instantiated for each VariantContext to store data for that VC
|
||||
* @param returnRounded round the likelihoods to return integer numbers of counts (as doubles)
|
||||
*/
|
||||
protected HeterozygosityUtils(final boolean returnRounded) {
|
||||
this.returnRounded = returnRounded;
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the genotype counts for A/A, A/B, and B/B where A is the reference and B is any alternate allele
|
||||
* @param vc
|
||||
* @param genotypes may be subset to just founders if a pedigree file is provided
|
||||
* @return may be null, otherwise length-3 double[] representing homRef, het, and homVar counts
|
||||
*/
|
||||
protected double[] getGenotypeCountsForRefVsAllAlts(final VariantContext vc, final GenotypesContext genotypes) {
|
||||
if (genotypes == null || !vc.isVariant())
|
||||
return null;
|
||||
|
||||
final boolean doMultiallelicMapping = !vc.isBiallelic();
|
||||
|
||||
int idxAA = 0, idxAB = 1, idxBB = 2;
|
||||
|
||||
double refCount = 0;
|
||||
double hetCount = 0;
|
||||
double homCount = 0;
|
||||
|
||||
sampleCount = 0;
|
||||
for (final Genotype g : genotypes) {
|
||||
if (g.isCalled() && g.hasLikelihoods() && g.getPloidy() == 2) // only work for diploid samples
|
||||
sampleCount++;
|
||||
else
|
||||
continue;
|
||||
|
||||
//Need to round the likelihoods to deal with small numerical deviations due to normalizing
|
||||
final double[] normalizedLikelihoodsUnrounded = MathUtils.normalizeFromLog10(g.getLikelihoods().getAsVector());
|
||||
double[] normalizedLikelihoods = new double[normalizedLikelihoodsUnrounded.length];
|
||||
if (returnRounded) {
|
||||
for (int i = 0; i < normalizedLikelihoodsUnrounded.length; i++) {
|
||||
normalizedLikelihoods[i] = Math.round(normalizedLikelihoodsUnrounded[i]);
|
||||
}
|
||||
} else {
|
||||
normalizedLikelihoods = normalizedLikelihoodsUnrounded;
|
||||
}
|
||||
|
||||
if (doMultiallelicMapping) {
|
||||
if (g.isHetNonRef()) {
|
||||
//all likelihoods go to homCount
|
||||
homCount++;
|
||||
continue;
|
||||
}
|
||||
|
||||
if (!g.isHomRef()) {
|
||||
//get alternate allele for each sample
|
||||
final Allele a1 = g.getAllele(0);
|
||||
final Allele a2 = g.getAllele(1);
|
||||
final int[] idxVector = vc.getGLIndecesOfAlternateAllele(a2.isNonReference() ? a2 : a1);
|
||||
idxAA = idxVector[0];
|
||||
idxAB = idxVector[1];
|
||||
idxBB = idxVector[2];
|
||||
}
|
||||
}
|
||||
|
||||
refCount += normalizedLikelihoods[idxAA];
|
||||
hetCount += normalizedLikelihoods[idxAB];
|
||||
homCount += normalizedLikelihoods[idxBB];
|
||||
}
|
||||
return new double[]{refCount, hetCount, homCount};
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the count of heterozygotes in vc for a specific altAllele (both reference and non-reference hets, e.g. 1/2)
|
||||
* @param vc
|
||||
*/
|
||||
protected void doGenotypeCalculations(final VariantContext vc) {
|
||||
final GenotypesContext genotypes = vc.getGenotypes();
|
||||
if (genotypes == null || !vc.isVariant())
|
||||
return;
|
||||
|
||||
final int numAlleles = vc.getNAlleles();
|
||||
|
||||
sampleCount = 0;
|
||||
if (hetCounts == null && alleleCounts == null) {
|
||||
hetCounts = new HashMap<>();
|
||||
alleleCounts = new HashMap<>();
|
||||
for (final Allele a : vc.getAlleles()) {
|
||||
if (a.isNonReference())
|
||||
hetCounts.put(a, 0.0);
|
||||
alleleCounts.put(a, 0.0);
|
||||
}
|
||||
|
||||
int idxAB;
|
||||
|
||||
//for each sample
|
||||
for (final Genotype g : genotypes) {
|
||||
if (g.isCalled() && g.hasLikelihoods() && g.getPloidy() == 2) // only work for diploid samples
|
||||
sampleCount++;
|
||||
else
|
||||
continue;
|
||||
|
||||
int altIndex = 0;
|
||||
for(final Allele a : vc.getAlternateAlleles()) {
|
||||
//for each alt allele index from 1 to N
|
||||
altIndex++;
|
||||
|
||||
final double[] normalizedLikelihoodsUnrounded = MathUtils.normalizeFromLog10(g.getLikelihoods().getAsVector());
|
||||
double[] normalizedLikelihoods = new double[normalizedLikelihoodsUnrounded.length];
|
||||
if (returnRounded) {
|
||||
for (int i = 0; i < normalizedLikelihoodsUnrounded.length; i++) {
|
||||
normalizedLikelihoods[i] = Math.round(normalizedLikelihoodsUnrounded[i]);
|
||||
}
|
||||
} else {
|
||||
normalizedLikelihoods = normalizedLikelihoodsUnrounded;
|
||||
}
|
||||
//iterate over the other alleles
|
||||
for (int i = 0; i < numAlleles; i++) {
|
||||
//only add homozygotes to alleleCounts, not hetCounts
|
||||
if (i == altIndex) {
|
||||
final double currentAlleleCounts = alleleCounts.get(a);
|
||||
alleleCounts.put(a, currentAlleleCounts + 2*normalizedLikelihoods[GenotypeLikelihoods.calculatePLindex(altIndex,altIndex)]);
|
||||
continue;
|
||||
}
|
||||
//pull out the heterozygote PL index, ensuring that the first allele index < second allele index
|
||||
idxAB = GenotypeLikelihoods.calculatePLindex(Math.min(i,altIndex),Math.max(i,altIndex));
|
||||
final double aHetCounts = hetCounts.get(a);
|
||||
hetCounts.put(a, aHetCounts + normalizedLikelihoods[idxAB]);
|
||||
final double currentAlleleCounts = alleleCounts.get(a);
|
||||
//these are guaranteed to be hets
|
||||
alleleCounts.put(a, currentAlleleCounts + normalizedLikelihoods[idxAB]);
|
||||
final double refAlleleCounts = alleleCounts.get(vc.getReference());
|
||||
alleleCounts.put(vc.getReference(), refAlleleCounts + normalizedLikelihoods[idxAB]);
|
||||
}
|
||||
//add in ref/ref likelihood
|
||||
final double refAlleleCounts = alleleCounts.get(vc.getReference());
|
||||
alleleCounts.put(vc.getReference(), refAlleleCounts + 2*normalizedLikelihoods[0]);
|
||||
}
|
||||
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the count of heterozygotes in vc for a specific altAllele (both reference and non-reference hets, e.g. 1/2)
|
||||
* @param vc
|
||||
* @param altAllele the alternate allele of interest
|
||||
* @return number of hets
|
||||
*/
|
||||
protected double getHetCount(final VariantContext vc, final Allele altAllele) {
|
||||
if (hetCounts == null)
|
||||
doGenotypeCalculations(vc);
|
||||
return hetCounts.containsKey(altAllele)? hetCounts.get(altAllele) : 0;
|
||||
}
|
||||
|
||||
protected double getAlleleCount(final VariantContext vc, final Allele allele) {
|
||||
if (alleleCounts == null)
|
||||
doGenotypeCalculations(vc);
|
||||
return alleleCounts.containsKey(allele)? alleleCounts.get(allele) : 0;
|
||||
}
|
||||
|
||||
protected int getSampleCount() {
|
||||
return sampleCount;
|
||||
}
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -52,19 +52,16 @@
|
|||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.engine.walkers.Walker;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypesContext;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
|
@ -79,24 +76,42 @@ import java.util.*;
|
|||
* <p>This annotation estimates whether there is evidence of inbreeding in a population. The higher the score, the higher the chance that there is inbreeding.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The calculation is a continuous generalization of the Hardy-Weinberg test for disequilibrium that works well with limited coverage per sample. The output is a Phred-scaled p-value derived from running the HW test for disequilibrium with PL values. See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this statistical test.</p>
|
||||
* <p>The calculation is a continuous generalization of the Hardy-Weinberg test for disequilibrium that works well with limited coverage per sample. The output is the F statistic from running the HW test for disequilibrium with PL values. See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this statistical test.</p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>The Inbreeding Coefficient can only be calculated for cohorts containing at least 10 founder samples.</li>
|
||||
* <li>This annotation is used in variant recalibration, but may not be appropriate for that purpose if the cohort being analyzed contains many closely related individuals.</li>
|
||||
* <li>This annotation requires a valid pedigree file.</li>
|
||||
* <li>The inbreeding coefficient can only be calculated for cohorts containing at least 10 founder samples.</li>
|
||||
* <li>This annotation is used in variant filtering, but may not be appropriate for that purpose if the cohort being analyzed contains many closely related individuals.</li>
|
||||
* <li>This annotation can take a valid pedigree file to specify founders.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_InbreedingCoeff.php">AS_InbreedingCoeff</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_ExcessHet.php">ExcessHet</a></b> estimates excess heterozygosity in a population of samples.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class InbreedingCoeff extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
public class InbreedingCoeff extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation, ReducibleAnnotation {
|
||||
|
||||
private final static Logger logger = Logger.getLogger(InbreedingCoeff.class);
|
||||
private static final int MIN_SAMPLES = 10;
|
||||
protected static final int MIN_SAMPLES = 10;
|
||||
private Set<String> founderIds;
|
||||
private int sampleCount;
|
||||
private boolean pedigreeCheckWarningLogged = false;
|
||||
private boolean didUniquifiedSampleNameCheck = false;
|
||||
protected HeterozygosityUtils heterozygosityUtils;
|
||||
final private boolean RETURN_ROUNDED = false;
|
||||
|
||||
@Override
|
||||
public void initialize (final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set<VCFHeaderLine> headerLines ) {
|
||||
//If available, get the founder IDs and cache them. the IC will only be computed on founders then.
|
||||
if(founderIds == null && walker != null) {
|
||||
founderIds = ((Walker) walker).getSampleDB().getFounderIds();
|
||||
}
|
||||
if(walker != null && (((Walker) walker).getSampleDB().getSamples().size() < MIN_SAMPLES || (!founderIds.isEmpty() && founderIds.size() < MIN_SAMPLES)))
|
||||
logger.warn("Annotation will not be calculated. InbreedingCoeff requires at least " + MIN_SAMPLES + " unrelated samples.");
|
||||
//intialize a HeterozygosityUtils before annotating for use in unit tests
|
||||
heterozygosityUtils = new HeterozygosityUtils(RETURN_ROUNDED);
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
|
|
@ -105,78 +120,59 @@ public class InbreedingCoeff extends InfoFieldAnnotation implements StandardAnno
|
|||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
//If available, get the founder IDs and cache them. the IC will only be computed on founders then.
|
||||
if(founderIds == null && walker != null) {
|
||||
founderIds = ((Walker) walker).getSampleDB().getFounderIds();
|
||||
}
|
||||
|
||||
heterozygosityUtils = new HeterozygosityUtils(RETURN_ROUNDED);
|
||||
|
||||
//if none of the "founders" are in the vc samples, assume we uniquified the samples upstream and they are all founders
|
||||
if (!didUniquifiedSampleNameCheck) {
|
||||
checkSampleNames(vc);
|
||||
founderIds = AnnotationUtils.validateFounderIDs(founderIds, vc);
|
||||
didUniquifiedSampleNameCheck = true;
|
||||
}
|
||||
if ( founderIds == null || founderIds.isEmpty() ) {
|
||||
if ( !pedigreeCheckWarningLogged ) {
|
||||
logger.warn("Annotation will not be calculated, must provide a valid PED file (-ped) from the command line.");
|
||||
pedigreeCheckWarningLogged = true;
|
||||
}
|
||||
return null;
|
||||
}
|
||||
else{
|
||||
return makeCoeffAnnotation(vc);
|
||||
return makeCoeffAnnotation(vc);
|
||||
}
|
||||
|
||||
//Inbreeding coeff doesn't need raw data -- it's calculated from the final genotypes
|
||||
@Override
|
||||
public String getRawKeyName() { return null; }
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotateRawData(final RefMetaDataTracker tracker, final AnnotatorCompatible walker, final ReferenceContext ref, final Map<String, AlignmentContext> stratifiedContexts, final VariantContext vc, final Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap) {
|
||||
return null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public void calculateRawData(final VariantContext vc, final Map<String, PerReadAlleleLikelihoodMap> pralm, final ReducibleAnnotationData rawAnnotations) { }
|
||||
|
||||
@Override
|
||||
public Map<String, Object> combineRawData(final List<Allele> allelesList, final List<? extends ReducibleAnnotationData> listOfRawData) {
|
||||
return null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> finalizeRawData(final VariantContext vc, final VariantContext originalVC) {
|
||||
heterozygosityUtils = new HeterozygosityUtils(RETURN_ROUNDED);
|
||||
|
||||
//if none of the "founders" are in the vc samples, assume we uniquified the samples upstream and they are all founders
|
||||
if (!didUniquifiedSampleNameCheck) {
|
||||
founderIds = AnnotationUtils.validateFounderIDs(founderIds, vc);
|
||||
didUniquifiedSampleNameCheck = true;
|
||||
}
|
||||
return makeCoeffAnnotation(vc);
|
||||
}
|
||||
|
||||
protected double calculateIC(final VariantContext vc, final GenotypesContext genotypes) {
|
||||
|
||||
final boolean doMultiallelicMapping = !vc.isBiallelic();
|
||||
|
||||
int idxAA = 0, idxAB = 1, idxBB = 2;
|
||||
|
||||
double refCount = 0.0;
|
||||
double hetCount = 0.0;
|
||||
double homCount = 0.0;
|
||||
sampleCount = 0; // number of samples that have likelihoods
|
||||
|
||||
for ( final Genotype g : genotypes ) {
|
||||
if ( g.isCalled() && g.hasLikelihoods() && g.getPloidy() == 2) // only work for diploid samples
|
||||
sampleCount++;
|
||||
else
|
||||
continue;
|
||||
final double[] normalizedLikelihoods = MathUtils.normalizeFromLog10( g.getLikelihoods().getAsVector() );
|
||||
if (doMultiallelicMapping)
|
||||
{
|
||||
if (g.isHetNonRef()) {
|
||||
//all likelihoods go to homCount
|
||||
homCount++;
|
||||
continue;
|
||||
}
|
||||
|
||||
//get alternate allele for each sample
|
||||
final Allele a1 = g.getAllele(0);
|
||||
final Allele a2 = g.getAllele(1);
|
||||
if (a2.isNonReference()) {
|
||||
final int[] idxVector = vc.getGLIndecesOfAlternateAllele(a2);
|
||||
idxAA = idxVector[0];
|
||||
idxAB = idxVector[1];
|
||||
idxBB = idxVector[2];
|
||||
}
|
||||
//I expect hets to be reference first, but there are no guarantees (e.g. phasing)
|
||||
else if (a1.isNonReference()) {
|
||||
final int[] idxVector = vc.getGLIndecesOfAlternateAllele(a1);
|
||||
idxAA = idxVector[0];
|
||||
idxAB = idxVector[1];
|
||||
idxBB = idxVector[2];
|
||||
}
|
||||
}
|
||||
|
||||
refCount += normalizedLikelihoods[idxAA];
|
||||
hetCount += normalizedLikelihoods[idxAB];
|
||||
homCount += normalizedLikelihoods[idxBB];
|
||||
final double[] genotypeCounts = heterozygosityUtils.getGenotypeCountsForRefVsAllAlts(vc, genotypes); //guarantees that sampleCount is set
|
||||
if (genotypeCounts.length != 3) {
|
||||
throw new IllegalStateException("Input genotype counts must be length 3 for the number of genotypes with {2, 1, 0} ref alleles.");
|
||||
}
|
||||
final double refCount = genotypeCounts[HeterozygosityUtils.REF_INDEX];
|
||||
final double hetCount = genotypeCounts[HeterozygosityUtils.HET_INDEX];
|
||||
final double homCount = genotypeCounts[HeterozygosityUtils.VAR_INDEX];
|
||||
|
||||
final double p = ( 2.0 * refCount + hetCount ) / ( 2.0 * (refCount + hetCount + homCount) ); // expected reference allele frequency
|
||||
final double q = 1.0 - p; // expected alternative allele frequency
|
||||
final double F = 1.0 - ( hetCount / ( 2.0 * p * q * (double) sampleCount) ); // inbreeding coefficient
|
||||
final double F = 1.0 - ( hetCount / ( 2.0 * p * q * (double) heterozygosityUtils.getSampleCount()) ); // inbreeding coefficient
|
||||
|
||||
return F;
|
||||
}
|
||||
|
|
@ -185,27 +181,13 @@ public class InbreedingCoeff extends InfoFieldAnnotation implements StandardAnno
|
|||
final GenotypesContext genotypes = (founderIds == null || founderIds.isEmpty()) ? vc.getGenotypes() : vc.getGenotypes(founderIds);
|
||||
if (genotypes == null || genotypes.size() < MIN_SAMPLES || !vc.isVariant())
|
||||
return null;
|
||||
double F = calculateIC(vc, genotypes);
|
||||
if (sampleCount < MIN_SAMPLES)
|
||||
final double F = calculateIC(vc, genotypes);
|
||||
if (heterozygosityUtils.getSampleCount() < MIN_SAMPLES)
|
||||
return null;
|
||||
return Collections.singletonMap(getKeyNames().get(0), (Object)String.format("%.4f", F));
|
||||
}
|
||||
|
||||
//this method is intended to reconcile uniquified sample names
|
||||
// it comes into play when calling this annotation from GenotypeGVCFs with --uniquifySamples because founderIds
|
||||
// is derived from the sampleDB, which comes from the input sample names, but vc will have uniquified (i.e. different)
|
||||
// sample names. Without this check, the founderIds won't be found in the vc and the annotation won't be calculated.
|
||||
protected void checkSampleNames(final VariantContext vc) {
|
||||
Set<String> vcSamples = new HashSet<>();
|
||||
vcSamples.addAll(vc.getSampleNames());
|
||||
if (!vcSamples.isEmpty()) {
|
||||
if (founderIds!=null) {
|
||||
vcSamples.removeAll(founderIds);
|
||||
if (vcSamples.equals(vc.getSampleNames()))
|
||||
founderIds = vc.getSampleNames();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Collections.singletonList(GATKVCFConstants.INBREEDING_COEFFICIENT_KEY); }
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -70,11 +70,14 @@ import java.util.*;
|
|||
* <h3>Statistical notes</h3>
|
||||
* <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for mapping qualities (MAPQ of reads supporting REF vs. MAPQ of reads supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>The mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</p>
|
||||
* <h3>Caveats</h3>
|
||||
* <ul><li>The mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
|
||||
* <li>Uninformative reads are not used in these annotations.</li>
|
||||
* </ul>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_MappingQualityRankSumTest.php">AS_MappingQualityRankSumTest</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_RMSMappingQuality.php">RMSMappingQuality</a></b> gives an estimation of the overal read mapping quality supporting a variant call.</li>
|
||||
* </ul>
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -73,26 +73,26 @@ import htsjdk.variant.variantcontext.VariantContext;
|
|||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Variant confidence normalized by unfiltered depth of variant samples
|
||||
* Variant call confidence normalized by depth of sample reads supporting a variant
|
||||
*
|
||||
* <p>This annotation puts the variant confidence QUAL score into perspective by normalizing for the amount of coverage available. Because each read contributes a little to the QUAL score, variants in regions with deep coverage can have artificially inflated QUAL scores, giving the impression that the call is supported by more evidence than it really is. To compensate for this, we normalize the variant confidence by depth, which gives us a more objective picture of how well supported the call is.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The QD is the QUAL score normalized by allele depth (AD) for a variant. For a single sample, the HaplotypeCaller calculates the QD by taking QUAL/AD. For multiple samples, HaplotypeCaller and GenotypeGVCFs calculate the QD by taking QUAL/AD of samples with a non hom-ref genotype call. The reason we leave out the samples with a hom-ref call is to not penalize the QUAL for the other samples with the variant call.</p>
|
||||
* <p>Here is a single sample example:</p>
|
||||
* <h4>Here is a single-sample example:</h4>
|
||||
* <pre>2 37629 . C G 1063.77 . AC=2;AF=1.00;AN=2;DP=31;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=58.50;QD=34.32;SOR=2.376 GT:AD:DP:GQ:PL:QSS 1/1:0,31:31:93:1092,93,0:0,960</pre>
|
||||
<p>QUAL/AD = 1063.77/31 = 34.32 = QD</p>
|
||||
* <p>Here is a multi-sample example:</p>
|
||||
* <h4>Here is a multi-sample example:</h4>
|
||||
* <pre>10 8046 . C T 4107.13 . AC=1;AF=0.167;AN=6;BaseQRankSum=-3.717;DP=1063;FS=1.616;MLEAC=1;MLEAF=0.167;QD=11.54
|
||||
GT:AD:DP:GQ:PL:QSS 0/0:369,4:373:99:0,1007,12207:10548,98 0/0:331,1:332:99:0,967,11125:9576,27 0/1:192,164:356:99:4138,0,5291:5501,4505</pre>
|
||||
* <p>QUAL/AD = 4107.13/356 = 11.54 = QD</p>
|
||||
* <p>Note that currently, when HaplotypeCaller is run with `-ERC GVCF`, the QD calculation is invoked before AD itself has been calculated, due to a technical constraint. In that case, HaplotypeCaller uses the number of overlapping reads from the haplotype likelihood calculation in place of AD to calculate QD, which generally yields a very similar number. This does not cause any measurable problems, but can cause some confusion since the number may be slightly different than what you would expect to get if you did the calculation manually. For that reason, this behavior will be modified in an upcoming version.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>This annotation can only be calculated for sites for which at least one sample was genotyped as carrying a variant allele.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_QualByDepth.php">AS_QualByDepth</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_Coverage.php">Coverage</a></b> gives the filtered depth of coverage for each sample and the unfiltered depth across all samples.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_DepthPerAlleleBySample.php">DepthPerAlleleBySample</a></b> calculates depth of coverage for each allele per sample (AD).</li>
|
||||
* </ul>
|
||||
|
|
@ -100,6 +100,7 @@ import java.util.*;
|
|||
public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
// private final static Logger logger = Logger.getLogger(QualByDepth.class);
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
|
|
@ -113,6 +114,25 @@ public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotati
|
|||
if ( genotypes == null || genotypes.size() == 0 )
|
||||
return null;
|
||||
|
||||
final int standardDepth = getDepth(genotypes, stratifiedContexts, perReadAlleleLikelihoodMap);
|
||||
|
||||
if ( standardDepth == 0 )
|
||||
return null;
|
||||
|
||||
final double altAlleleLength = GATKVariantContextUtils.getMeanAltAlleleLength(vc);
|
||||
// Hack: UnifiedGenotyper (but not HaplotypeCaller or GenotypeGVCFs) over-estimates the quality of long indels
|
||||
// Penalize the QD calculation for UG indels to compensate for this
|
||||
double QD = -10.0 * vc.getLog10PError() / ((double)standardDepth * indelNormalizationFactor(altAlleleLength, walker instanceof UnifiedGenotyper));
|
||||
|
||||
// Hack: see note in the fixTooHighQD method below
|
||||
QD = fixTooHighQD(QD);
|
||||
|
||||
final Map<String, Object> map = new HashMap<>();
|
||||
map.put(getKeyNames().get(0), String.format("%.2f", QD));
|
||||
return map;
|
||||
}
|
||||
|
||||
protected int getDepth(final GenotypesContext genotypes, final Map<String, AlignmentContext> stratifiedContexts, final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap) {
|
||||
int standardDepth = 0;
|
||||
int ADrestrictedDepth = 0;
|
||||
|
||||
|
|
@ -123,10 +143,6 @@ public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotati
|
|||
continue;
|
||||
|
||||
// if we have the AD values for this sample, let's make sure that the variant depth is greater than 1!
|
||||
// TODO -- If we like how this is working and want to apply it to a situation other than the single sample HC pipeline,
|
||||
// TODO -- then we will need to modify the annotateContext() - and related - routines in the VariantAnnotatorEngine
|
||||
// TODO -- so that genotype-level annotations are run first (to generate AD on the samples) and then the site-level
|
||||
// TODO -- annotations must come afterwards (so that QD can use the AD).
|
||||
if ( genotype.hasAD() ) {
|
||||
final int[] AD = genotype.getAD();
|
||||
final int totalADdepth = (int)MathUtils.sum(AD);
|
||||
|
|
@ -157,20 +173,7 @@ public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotati
|
|||
if ( ADrestrictedDepth > 0 )
|
||||
standardDepth = ADrestrictedDepth;
|
||||
|
||||
if ( standardDepth == 0 )
|
||||
return null;
|
||||
|
||||
final double altAlleleLength = GATKVariantContextUtils.getMeanAltAlleleLength(vc);
|
||||
// Hack: UnifiedGenotyper (but not HaplotypeCaller or GenotypeGVCFs) over-estimates the quality of long indels
|
||||
// Penalize the QD calculation for UG indels to compensate for this
|
||||
double QD = -10.0 * vc.getLog10PError() / ((double)standardDepth * indelNormalizationFactor(altAlleleLength, walker instanceof UnifiedGenotyper));
|
||||
|
||||
// Hack: see note in the fixTooHighQD method below
|
||||
QD = fixTooHighQD(QD);
|
||||
|
||||
final Map<String, Object> map = new HashMap<>();
|
||||
map.put(getKeyNames().get(0), String.format("%.2f", QD));
|
||||
return map;
|
||||
return standardDepth;
|
||||
}
|
||||
|
||||
/**
|
||||
|
|
@ -178,7 +181,7 @@ public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotati
|
|||
*
|
||||
* @param altAlleleLength the average alternate allele length for the call
|
||||
* @param increaseNormalizationAsLengthIncreases should we apply a normalization factor based on the allele length?
|
||||
* @return a possitive double
|
||||
* @return a positive double
|
||||
*/
|
||||
private double indelNormalizationFactor(final double altAlleleLength, final boolean increaseNormalizationAsLengthIncreases) {
|
||||
return ( increaseNormalizationAsLengthIncreases ? Math.max(altAlleleLength / 3.0, 1.0) : 1.0);
|
||||
|
|
@ -190,12 +193,10 @@ public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotati
|
|||
* and VQSR will filter these out. This code looks at the QD value, and if it is above
|
||||
* threshold we map it down to the mean high QD value, with some jittering
|
||||
*
|
||||
* // TODO -- remove me when HaplotypeCaller bubble caller is live
|
||||
*
|
||||
* @param QD the raw QD score
|
||||
* @return a QD value
|
||||
*/
|
||||
private double fixTooHighQD(final double QD) {
|
||||
protected static double fixTooHighQD(final double QD) {
|
||||
if ( QD < MAX_QD_BEFORE_FIXING ) {
|
||||
return QD;
|
||||
} else {
|
||||
|
|
@ -203,12 +204,14 @@ public class QualByDepth extends InfoFieldAnnotation implements StandardAnnotati
|
|||
}
|
||||
}
|
||||
|
||||
private final static double MAX_QD_BEFORE_FIXING = 35;
|
||||
private final static double IDEAL_HIGH_QD = 30;
|
||||
private final static double JITTER_SIGMA = 3;
|
||||
protected final static double MAX_QD_BEFORE_FIXING = 35;
|
||||
protected final static double IDEAL_HIGH_QD = 30;
|
||||
protected final static double JITTER_SIGMA = 3;
|
||||
|
||||
@Override
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.QUAL_BY_DEPTH_KEY); }
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
}
|
||||
|
|
|
|||
|
|
@ -0,0 +1,248 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFConstants;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.pileup.PileupElement;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Abstract root for all RankSum-based annotations
|
||||
*/
|
||||
public abstract class RMSAnnotation extends InfoFieldAnnotation implements ReducibleAnnotation {
|
||||
protected AnnotatorCompatible callingWalker;
|
||||
|
||||
@Override
|
||||
public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set<VCFHeaderLine> headerLines) {
|
||||
callingWalker = walker;
|
||||
}
|
||||
|
||||
@Override
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
final List<VCFInfoHeaderLine> headerLines = new ArrayList<>();
|
||||
//ideally only HC in GVCF mode would get the raw header line, but that's a little more complicated
|
||||
if (callingWalker instanceof HaplotypeCaller || callingWalker instanceof CombineGVCFs)
|
||||
headerLines.add(GATKVCFHeaderLines.getInfoLine(getRawKeyName()));
|
||||
headerLines.add(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
return headerLines;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
|
||||
if ( (stratifiedContexts == null || stratifiedContexts.isEmpty()) && perReadAlleleLikelihoodMap == null)
|
||||
return null;
|
||||
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
final ReducibleAnnotationData<Number> myData = new ReducibleAnnotationData<>(null);
|
||||
calculateRawData(stratifiedContexts, perReadAlleleLikelihoodMap, myData);
|
||||
final String annotationString = makeFinalizedAnnotationString(vc, myData.getAttributeMap(), stratifiedContexts, perReadAlleleLikelihoodMap);
|
||||
annotations.put(getKeyNames().get(0), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
public Map<String, Object> annotateRawData(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
|
||||
if ( perReadAlleleLikelihoodMap == null)
|
||||
return new HashMap<>();
|
||||
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
ReducibleAnnotationData<Number> myData = new ReducibleAnnotationData<>(null);
|
||||
calculateRawData(vc, perReadAlleleLikelihoodMap, myData);
|
||||
String annotationString = makeRawAnnotationString(vc.getAlleles(), myData.getAttributeMap());
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> combineRawData(final List<Allele> vcAlleles, final List<? extends ReducibleAnnotationData> annotationList) {
|
||||
//VC already contains merged alleles from ReferenceConfidenceVariantContextMerger
|
||||
ReducibleAnnotationData combinedData = new ReducibleAnnotationData(null);
|
||||
|
||||
for (final ReducibleAnnotationData currentValue : annotationList) {
|
||||
parseRawDataString(currentValue);
|
||||
combineAttributeMap(currentValue, combinedData);
|
||||
|
||||
}
|
||||
final Map<String, Object> annotations = new HashMap<>();
|
||||
String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap());
|
||||
annotations.put(getRawKeyName(), annotationString);
|
||||
return annotations;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Map<String, Object> finalizeRawData(final VariantContext vc, final VariantContext originalVC) {
|
||||
if (!vc.hasAttribute(getRawKeyName()))
|
||||
return new HashMap<>();
|
||||
String rawMQdata = vc.getAttributeAsString(getRawKeyName(),null);
|
||||
if (rawMQdata == null)
|
||||
return new HashMap<>();
|
||||
|
||||
ReducibleAnnotationData myData = new ReducibleAnnotationData(rawMQdata);
|
||||
parseRawDataString(myData);
|
||||
|
||||
String annotationString = makeFinalizedAnnotationString(vc, myData.getAttributeMap());
|
||||
return Collections.singletonMap(getKeyNames().get(0), (Object)annotationString);
|
||||
}
|
||||
|
||||
protected void parseRawDataString(ReducibleAnnotationData<Number> myData) {
|
||||
final String rawDataString = myData.getRawData();
|
||||
String[] rawMQdataAsStringVector;
|
||||
rawMQdataAsStringVector = rawDataString.split(",");
|
||||
double squareSum = Double.parseDouble(rawMQdataAsStringVector[0]);
|
||||
myData.putAttribute(Allele.NO_CALL, squareSum);
|
||||
}
|
||||
|
||||
public void combineAttributeMap(ReducibleAnnotationData<Number> toAdd, ReducibleAnnotationData<Number> combined) {
|
||||
if (combined.getAttribute(Allele.NO_CALL) != null)
|
||||
combined.putAttribute(Allele.NO_CALL, (Double) combined.getAttribute(Allele.NO_CALL) + (Double) toAdd.getAttribute(Allele.NO_CALL));
|
||||
else
|
||||
combined.putAttribute(Allele.NO_CALL, toAdd.getAttribute(Allele.NO_CALL));
|
||||
|
||||
}
|
||||
|
||||
//Implementations of this method should return a string consisting of the sum of the squared values for the attribute being annotated (or a delimited list of those if allele-specific)
|
||||
abstract protected String makeRawAnnotationString(List<Allele> vcAlleles, Map<Allele,Number> sumOfSquares);
|
||||
|
||||
//Implementations of this method should return a string with the finalized annotation value as will appear in the INFO field
|
||||
abstract protected String makeFinalizedAnnotationString(VariantContext vc, Map<Allele, Number> sumOfSquares);
|
||||
|
||||
//Implementations of this method should return a string with the finalized annotation value as will appear in the INFO field
|
||||
abstract protected String makeFinalizedAnnotationString(VariantContext vc, Map<Allele, Number> sumOfSquares, Map<String, AlignmentContext> stratifiedContexts, final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap);
|
||||
|
||||
protected void calculateRawData(final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap,
|
||||
final ReducibleAnnotationData myData) {
|
||||
if (perReadAlleleLikelihoodMap != null) {
|
||||
calculateRawData((VariantContext) null, perReadAlleleLikelihoodMap, myData);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
*
|
||||
* @param vc
|
||||
* @param perReadAlleleLikelihoodMap
|
||||
* @param stratifiedContexts
|
||||
* @return the number of reads at the vc position (-1 if all read data is null)
|
||||
*/
|
||||
public int getNumOfReads(final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap,
|
||||
final Map<String, AlignmentContext> stratifiedContexts) {
|
||||
//don't use the full depth because we don't calculate MQ for reference blocks
|
||||
int numOfReads = 0;
|
||||
if(vc.hasAttribute(VCFConstants.DEPTH_KEY)) {
|
||||
numOfReads += Integer.parseInt(vc.getAttributeAsString(VCFConstants.DEPTH_KEY, "-1"));
|
||||
if(vc.hasGenotypes()) {
|
||||
for(Genotype gt : vc.getGenotypes()) {
|
||||
if(gt.isHomRef() && gt.hasExtendedAttribute("MIN_DP")) //site-level DP contribution will come from MIN_DP for gVCF-called reference variants
|
||||
numOfReads -= Integer.parseInt(gt.getExtendedAttribute("MIN_DP").toString());
|
||||
}
|
||||
}
|
||||
return numOfReads;
|
||||
}
|
||||
else if (stratifiedContexts != null && !stratifiedContexts.isEmpty()) {
|
||||
for ( final Map.Entry<String, AlignmentContext> sample : stratifiedContexts.entrySet() ) {
|
||||
final AlignmentContext context = sample.getValue();
|
||||
for ( final PileupElement p : context.getBasePileup() ) {
|
||||
int mq = p.getRead().getMappingQuality();
|
||||
if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) {
|
||||
numOfReads++;
|
||||
}
|
||||
}
|
||||
}
|
||||
return numOfReads;
|
||||
}
|
||||
else if (perReadAlleleLikelihoodMap != null && !perReadAlleleLikelihoodMap.isEmpty())
|
||||
{
|
||||
for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : perReadAlleleLikelihoodMap.values() ) {
|
||||
for ( final GATKSAMRecord read : perReadLikelihoods.getStoredElements() ) {
|
||||
int mq = read.getMappingQuality();
|
||||
if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) {
|
||||
numOfReads++;
|
||||
}
|
||||
}
|
||||
}
|
||||
return numOfReads;
|
||||
}
|
||||
return -1;
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -51,22 +51,21 @@
|
|||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFConstants;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFStandardHeaderLines;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*;
|
||||
import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.pileup.PileupElement;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
|
|
@ -74,62 +73,106 @@ import java.util.*;
|
|||
/**
|
||||
* Root Mean Square of the mapping quality of reads across all samples.
|
||||
*
|
||||
* <p>This annotation provides an estimation of the overall mapping quality of reads supporting a variant call, averaged over all samples in a cohort.</p>
|
||||
* <p>This annotation provides an estimation of the overall mapping quality of reads supporting a variant call. It produce both raw data (sum of square and num of total reads) and the calculated root mean square.</p>
|
||||
*
|
||||
* The raw data is used to accurately calculate the root mean square when combining more than one sample.
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p>The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>Uninformative reads are not used in this annotation.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_RMSMappingQuality.php">AS_RMSMappingQuality</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityRankSumTest.php">MappingQualityRankSumTest</a></b> compares the mapping quality of reads supporting the REF and ALT alleles.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class RMSMappingQuality extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
public class RMSMappingQuality extends RMSAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation, ReducibleAnnotation {
|
||||
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final VariantContext vc,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap ) {
|
||||
@Override //this needs an override because MQ is a VCF standard so it's headerline is in a different place
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
final List<VCFInfoHeaderLine> headerLines = new ArrayList<>();
|
||||
//only HC in GVCF mode should get the raw header line
|
||||
if ((callingWalker instanceof HaplotypeCaller && ((HaplotypeCaller) callingWalker).emitReferenceConfidence()) || callingWalker instanceof CombineGVCFs)
|
||||
headerLines.add(GATKVCFHeaderLines.getInfoLine(getRawKeyName()));
|
||||
headerLines.add(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
return headerLines;
|
||||
}
|
||||
|
||||
final List<Integer> qualities = new ArrayList<>();
|
||||
public List<String> getKeyNames() { return Arrays.asList(
|
||||
VCFConstants.RMS_MAPPING_QUALITY_KEY);
|
||||
}
|
||||
|
||||
public String getRawKeyName() { return GATKVCFConstants.RAW_RMS_MAPPING_QUALITY_KEY;}
|
||||
|
||||
@Override
|
||||
public void calculateRawData(final VariantContext vc, final Map<String, PerReadAlleleLikelihoodMap> pralm, final ReducibleAnnotationData rawAnnotations) {
|
||||
Double squareSum = 0.0;
|
||||
if ( pralm.size() == 0 )
|
||||
return;
|
||||
|
||||
for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : pralm.values() ) {
|
||||
for ( final GATKSAMRecord read : perReadLikelihoods.getStoredElements() ) {
|
||||
int mq = read.getMappingQuality();
|
||||
if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) {
|
||||
squareSum += mq * mq;
|
||||
}
|
||||
}
|
||||
}
|
||||
rawAnnotations.putAttribute(Allele.NO_CALL,squareSum);
|
||||
}
|
||||
|
||||
//this version applies to non-HaplotypeCaller annotators
|
||||
@Override
|
||||
protected void calculateRawData(final Map<String, AlignmentContext> stratifiedContexts,
|
||||
final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap,
|
||||
final ReducibleAnnotationData myData) {
|
||||
|
||||
Double squareSum = 0.0;
|
||||
if ( stratifiedContexts != null ) {
|
||||
if ( stratifiedContexts.size() == 0 )
|
||||
return null;
|
||||
return;
|
||||
|
||||
for ( final Map.Entry<String, AlignmentContext> sample : stratifiedContexts.entrySet() ) {
|
||||
final AlignmentContext context = sample.getValue();
|
||||
for ( final PileupElement p : context.getBasePileup() )
|
||||
fillMappingQualitiesFromPileup(p.getRead().getMappingQuality(), qualities);
|
||||
for ( final PileupElement p : context.getBasePileup() ) {
|
||||
int mq = p.getRead().getMappingQuality();
|
||||
if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) {
|
||||
squareSum += mq * mq;
|
||||
}
|
||||
}
|
||||
}
|
||||
myData.putAttribute(Allele.NO_CALL,squareSum);
|
||||
}
|
||||
else if (perReadAlleleLikelihoodMap != null) {
|
||||
if ( perReadAlleleLikelihoodMap.size() == 0 )
|
||||
return null;
|
||||
|
||||
for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : perReadAlleleLikelihoodMap.values() ) {
|
||||
for ( final GATKSAMRecord read : perReadLikelihoods.getStoredElements() )
|
||||
fillMappingQualitiesFromPileup(read.getMappingQuality(), qualities);
|
||||
}
|
||||
}
|
||||
else
|
||||
return null;
|
||||
|
||||
final double rms = MathUtils.rms(qualities);
|
||||
return Collections.singletonMap(getKeyNames().get(0), (Object)String.format("%.2f", rms));
|
||||
}
|
||||
|
||||
private static void fillMappingQualitiesFromPileup(final int mq, final List<Integer> qualities) {
|
||||
if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) {
|
||||
qualities.add(mq);
|
||||
calculateRawData((VariantContext) null, perReadAlleleLikelihoodMap, myData);
|
||||
}
|
||||
}
|
||||
|
||||
public List<String> getKeyNames() { return Arrays.asList(VCFConstants.RMS_MAPPING_QUALITY_KEY); }
|
||||
|
||||
public List<VCFInfoHeaderLine> getDescriptions() {
|
||||
return Arrays.asList(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0)));
|
||||
@Override
|
||||
public String makeRawAnnotationString(final List<Allele> vcAlleles, final Map<Allele, Number> perAlleleData) {
|
||||
return String.format("%.2f", perAlleleData.get(Allele.NO_CALL));
|
||||
}
|
||||
|
||||
@Override
|
||||
public String makeFinalizedAnnotationString(final VariantContext vc, final Map<Allele, Number> perAlleleData, final Map<String, AlignmentContext> stratifiedContexts, final Map<String, PerReadAlleleLikelihoodMap> perReadAlleleLikelihoodMap) {
|
||||
if ((stratifiedContexts != null && !stratifiedContexts.isEmpty()) || perReadAlleleLikelihoodMap != null) {
|
||||
int numOfReads = getNumOfReads(vc, perReadAlleleLikelihoodMap, stratifiedContexts);
|
||||
return String.format("%.2f", Math.sqrt((double) perAlleleData.get(Allele.NO_CALL) / numOfReads));
|
||||
}
|
||||
else {
|
||||
return makeFinalizedAnnotationString(vc, perAlleleData);
|
||||
}
|
||||
}
|
||||
|
||||
@Override
|
||||
public String makeFinalizedAnnotationString(final VariantContext vc, final Map<Allele, Number> perAlleleData) {
|
||||
int numOfReads = getNumOfReads(vc, null, null);
|
||||
return String.format("%.2f", Math.sqrt((double)perAlleleData.get(Allele.NO_CALL)/numOfReads));
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -52,12 +52,10 @@
|
|||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.MannWhitneyU;
|
||||
|
|
@ -76,11 +74,12 @@ import java.util.*;
|
|||
|
||||
|
||||
/**
|
||||
* Abstract root for all RankSum based annotations
|
||||
* Abstract root for all RankSum-based annotations
|
||||
*/
|
||||
//TODO: will eventually implement ReducibleAnnotation in order to preserve accuracy for CombineGVCFs and GenotypeGVCFs -- see RMSAnnotation.java for an example of an abstract ReducibleAnnotation
|
||||
public abstract class RankSumTest extends InfoFieldAnnotation implements ActiveRegionBasedAnnotation {
|
||||
static final boolean DEBUG = false;
|
||||
private boolean useDithering = true;
|
||||
protected boolean useDithering = true;
|
||||
|
||||
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -51,12 +51,7 @@
|
|||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import htsjdk.samtools.Cigar;
|
||||
import htsjdk.samtools.CigarElement;
|
||||
import htsjdk.samtools.CigarOperator;
|
||||
import htsjdk.samtools.SAMRecord;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel;
|
||||
import htsjdk.variant.vcf.VCFInfoHeaderLine;
|
||||
import org.broadinstitute.gatk.utils.pileup.PileupElement;
|
||||
import org.broadinstitute.gatk.utils.sam.AlignmentUtils;
|
||||
|
|
@ -70,7 +65,9 @@ import java.util.*;
|
|||
/**
|
||||
* Rank Sum Test for relative positioning of REF versus ALT alleles within reads
|
||||
*
|
||||
* <p>This variant-level annotation tests whether there is evidence of bias in the position of alleles within the reads that support them, between the reference and alternate alleles. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.</p>
|
||||
* <p>This variant-level annotation tests whether there is evidence of bias in the position of alleles within the reads that support them, between the reference and alternate alleles.</p>
|
||||
*
|
||||
* <p>Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.</p>
|
||||
*
|
||||
* <p>The ideal result is a value close to zero, which indicates there is little to no difference in where the alleles are found relative to the ends of reads. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele. Conversely, a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele. </p>
|
||||
*
|
||||
|
|
@ -80,7 +77,15 @@ import java.util.*;
|
|||
* <p>The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for site position within reads (position within reads supporting REF vs. position within reads supporting ALT). See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of the ranksum test.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>The read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</p>
|
||||
* <ul>
|
||||
* <li>The read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles.</li>
|
||||
* <li>Uninformative reads are not used in these annotations.</li>
|
||||
* </ul>
|
||||
*
|
||||
* * <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_ReadPosRankSumTest.php">AS_ReadPosRankRankSumTest</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class ReadPosRankSumTest extends RankSumTest implements StandardAnnotation {
|
||||
|
|
@ -109,7 +114,7 @@ public class ReadPosRankSumTest extends RankSumTest implements StandardAnnotatio
|
|||
@Override
|
||||
protected Double getElementForPileupElement(final PileupElement p) {
|
||||
final int offset = AlignmentUtils.calcAlignmentByteArrayOffset(p.getRead().getCigar(), p, 0, 0);
|
||||
return (double)getFinalReadPosition(p.getRead(), offset);
|
||||
return (double)AnnotationUtils.getFinalVariantReadPosition(p.getRead(), offset);
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
@ -122,69 +127,5 @@ public class ReadPosRankSumTest extends RankSumTest implements StandardAnnotatio
|
|||
return super.isUsableRead(read, refLoc) && read.getSoftStart() + read.getCigar().getReadLength() > refLoc;
|
||||
}
|
||||
|
||||
private int getFinalReadPosition(final GATKSAMRecord read, final int initialReadPosition) {
|
||||
final int numAlignedBases = getNumAlignedBases(read);
|
||||
|
||||
int readPos = initialReadPosition;
|
||||
if (initialReadPosition > numAlignedBases / 2) {
|
||||
readPos = numAlignedBases - (initialReadPosition + 1);
|
||||
}
|
||||
return readPos;
|
||||
|
||||
}
|
||||
|
||||
private int getNumClippedBasesAtStart(final SAMRecord read) {
|
||||
// compute total number of clipped bases (soft or hard clipped)
|
||||
// check for hard clips (never consider these bases):
|
||||
final Cigar c = read.getCigar();
|
||||
final CigarElement first = c.getCigarElement(0);
|
||||
|
||||
int numStartClippedBases = 0;
|
||||
if (first.getOperator() == CigarOperator.H) {
|
||||
numStartClippedBases = first.getLength();
|
||||
}
|
||||
final byte[] unclippedReadBases = read.getReadBases();
|
||||
final byte[] unclippedReadQuals = read.getBaseQualities();
|
||||
|
||||
// Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative,
|
||||
// and may leave a string of Q2 bases still hanging off the reads.
|
||||
for (int i = numStartClippedBases; i < unclippedReadBases.length; i++) {
|
||||
if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD)
|
||||
numStartClippedBases++;
|
||||
else
|
||||
break;
|
||||
|
||||
}
|
||||
|
||||
return numStartClippedBases;
|
||||
}
|
||||
|
||||
private int getNumAlignedBases(final GATKSAMRecord read) {
|
||||
return read.getReadLength() - getNumClippedBasesAtStart(read) - getNumClippedBasesAtEnd(read);
|
||||
}
|
||||
|
||||
private int getNumClippedBasesAtEnd(final GATKSAMRecord read) {
|
||||
// compute total number of clipped bases (soft or hard clipped)
|
||||
// check for hard clips (never consider these bases):
|
||||
final Cigar c = read.getCigar();
|
||||
CigarElement last = c.getCigarElement(c.numCigarElements() - 1);
|
||||
|
||||
int numEndClippedBases = 0;
|
||||
if (last.getOperator() == CigarOperator.H) {
|
||||
numEndClippedBases = last.getLength();
|
||||
}
|
||||
final byte[] unclippedReadBases = read.getReadBases();
|
||||
final byte[] unclippedReadQuals = read.getBaseQualities();
|
||||
|
||||
// Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative,
|
||||
// and may leave a string of Q2 bases still hanging off the reads.
|
||||
for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) {
|
||||
if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD)
|
||||
numEndClippedBases++;
|
||||
else
|
||||
break;
|
||||
}
|
||||
|
||||
return numEndClippedBases;
|
||||
}
|
||||
}
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -56,6 +56,7 @@ import htsjdk.variant.variantcontext.Genotype;
|
|||
import htsjdk.variant.variantcontext.GenotypeBuilder;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFFormatHeaderLine;
|
||||
import org.apache.commons.lang.StringUtils;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.GenotypeAnnotation;
|
||||
|
|
@ -105,7 +106,7 @@ import java.util.Map;
|
|||
|
||||
public class StrandAlleleCountsBySample extends GenotypeAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(StrandAlleleCountsBySample.class);
|
||||
boolean[] warningsLogged = new boolean[4];
|
||||
private final boolean[] warningsLogged = new boolean[AnnotationUtils.WARNINGS_LOGGED_SIZE];
|
||||
|
||||
@Override
|
||||
public void annotate(final RefMetaDataTracker tracker,
|
||||
|
|
@ -117,7 +118,7 @@ public class StrandAlleleCountsBySample extends GenotypeAnnotation {
|
|||
final GenotypeBuilder gb,
|
||||
final PerReadAlleleLikelihoodMap alleleLikelihoodMap) {
|
||||
|
||||
if ( !AnnotationUtils.isAppropriateInput(walker, alleleLikelihoodMap, g, warningsLogged, logger) ) {
|
||||
if ( !AnnotationUtils.isAppropriateInput(GATKVCFConstants.STRAND_COUNT_BY_SAMPLE_KEY, walker, alleleLikelihoodMap, g, warningsLogged, logger) ) {
|
||||
return;
|
||||
}
|
||||
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -99,7 +99,7 @@ import java.util.*;
|
|||
|
||||
public class StrandBiasBySample extends GenotypeAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(StrandBiasBySample.class);
|
||||
boolean[] warningsLogged = new boolean[4];
|
||||
private final boolean[] warningsLogged = new boolean[AnnotationUtils.WARNINGS_LOGGED_SIZE];
|
||||
|
||||
@Override
|
||||
public void annotate(final RefMetaDataTracker tracker,
|
||||
|
|
@ -110,14 +110,13 @@ public class StrandBiasBySample extends GenotypeAnnotation {
|
|||
final Genotype g,
|
||||
final GenotypeBuilder gb,
|
||||
final PerReadAlleleLikelihoodMap alleleLikelihoodMap) {
|
||||
|
||||
if (!AnnotationUtils.isAppropriateInput(walker, alleleLikelihoodMap, g, warningsLogged, logger)) {
|
||||
if (!AnnotationUtils.isAppropriateInput(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY, walker, alleleLikelihoodMap, g, warningsLogged, logger)) {
|
||||
return;
|
||||
}
|
||||
|
||||
final int[][] table = FisherStrand.getContingencyTable(Collections.singletonMap(g.getSampleName(), alleleLikelihoodMap), vc, 0);
|
||||
|
||||
gb.attribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY, FisherStrand.getContingencyArray(table));
|
||||
gb.attribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY, StrandBiasTableUtils.getContingencyArray(table));
|
||||
}
|
||||
|
||||
@Override
|
||||
|
|
|
|||
|
|
@ -0,0 +1,250 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator;
|
||||
|
||||
import cern.jet.math.Arithmetic;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.List;
|
||||
|
||||
/**
|
||||
* A class containing many convenience methods used in the strand bias annotation calculations
|
||||
*/
|
||||
public class StrandBiasTableUtils {
|
||||
|
||||
private final static Logger logger = Logger.getLogger(StrandBiasTableUtils.class);
|
||||
|
||||
//For now this is only for 2x2 contingency tables
|
||||
protected static final int ARRAY_DIM = 2;
|
||||
protected static final int ARRAY_SIZE = ARRAY_DIM * ARRAY_DIM;
|
||||
private static double MIN_PVALUE = 1E-320;
|
||||
// how large do we want the normalized table to be?
|
||||
private static final double TARGET_TABLE_SIZE = 200.0;
|
||||
private final static double AUGMENTATION_CONSTANT = 1.0;
|
||||
|
||||
/**
|
||||
* Computes a two-sided p-Value for a Fisher's exact test on the contingency table, after normalizing counts so that the sum does not exceed {@value org.broadinstitute.gatk.tools.walkers.annotator.StrandBiasTableUtils#TARGET_TABLE_SIZE}
|
||||
* @param originalTable
|
||||
* @return
|
||||
*/
|
||||
public static Double FisherExactPValueForContingencyTable(int[][] originalTable) {
|
||||
final int[][] normalizedTable = normalizeContingencyTable(originalTable);
|
||||
|
||||
int[][] table = copyContingencyTable(normalizedTable);
|
||||
|
||||
double pCutoff = computePValue(table);
|
||||
|
||||
double pValue = pCutoff;
|
||||
while (rotateTable(table)) {
|
||||
double pValuePiece = computePValue(table);
|
||||
|
||||
if (pValuePiece <= pCutoff) {
|
||||
pValue += pValuePiece;
|
||||
}
|
||||
}
|
||||
|
||||
table = copyContingencyTable(normalizedTable);
|
||||
while (unrotateTable(table)) {
|
||||
double pValuePiece = computePValue(table);
|
||||
|
||||
if (pValuePiece <= pCutoff) {
|
||||
pValue += pValuePiece;
|
||||
}
|
||||
}
|
||||
|
||||
// min is necessary as numerical precision can result in pValue being slightly greater than 1.0
|
||||
return Math.min(pValue, 1.0);
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper function to turn the FisherStrand table into the SB annotation array
|
||||
* @param table the table used by the FisherStrand annotation
|
||||
* @return the array used by the per-sample Strand Bias annotation
|
||||
*/
|
||||
public static List<Integer> getContingencyArray( final int[][] table ) {
|
||||
if(table.length != ARRAY_DIM || table[0].length != ARRAY_DIM) {
|
||||
logger.warn("Expecting a " + ARRAY_DIM + "x" + ARRAY_DIM + " strand bias table.");
|
||||
return null;
|
||||
}
|
||||
|
||||
final List<Integer> list = new ArrayList<>(ARRAY_SIZE);
|
||||
list.add(table[0][0]);
|
||||
list.add(table[0][1]);
|
||||
list.add(table[1][0]);
|
||||
list.add(table[1][1]);
|
||||
return list;
|
||||
}
|
||||
|
||||
/**
|
||||
* Printing information to logger.info for debugging purposes
|
||||
*
|
||||
* @param name the name of the table
|
||||
* @param table the table itself
|
||||
*/
|
||||
public static void printTable(final String name, final int[][] table) {
|
||||
final String pValue = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(FisherExactPValueForContingencyTable(table), MIN_PVALUE)));
|
||||
logger.info(String.format("FS %s (REF+, REF-, ALT+, ALT-) = (%d, %d, %d, %d) = %s",
|
||||
name, table[0][0], table[0][1], table[1][0], table[1][1], pValue));
|
||||
}
|
||||
|
||||
/**
|
||||
* Adds the small value AUGMENTATION_CONSTANT to all the entries of the table.
|
||||
*
|
||||
* @param table the table to augment
|
||||
* @return the augmented table
|
||||
*/
|
||||
protected static double[][] augmentContingencyTable(final int[][] table) {
|
||||
double[][] augmentedTable = new double[ARRAY_DIM][ARRAY_DIM];
|
||||
for ( int i = 0; i < ARRAY_DIM; i++ ) {
|
||||
for ( int j = 0; j < ARRAY_DIM; j++ )
|
||||
augmentedTable[i][j] = table[i][j] + AUGMENTATION_CONSTANT;
|
||||
}
|
||||
|
||||
return augmentedTable;
|
||||
}
|
||||
|
||||
/**
|
||||
* Normalize the table so that the entries are not too large.
|
||||
* Note that this method does NOT necessarily make a copy of the table being passed in!
|
||||
*
|
||||
* @param table the original table
|
||||
* @return a normalized version of the table or the original table if it is already normalized
|
||||
*/
|
||||
protected static int[][] normalizeContingencyTable(final int[][] table) {
|
||||
final int sum = table[0][0] + table[0][1] + table[1][0] + table[1][1];
|
||||
if ( sum <= TARGET_TABLE_SIZE * 2 )
|
||||
return table;
|
||||
|
||||
final double normalizationFactor = (double)sum / TARGET_TABLE_SIZE;
|
||||
|
||||
final int[][] normalized = new int[ARRAY_DIM][ARRAY_DIM];
|
||||
for ( int i = 0; i < ARRAY_DIM; i++ ) {
|
||||
for ( int j = 0; j < ARRAY_DIM; j++ )
|
||||
normalized[i][j] = (int)(table[i][j] / normalizationFactor);
|
||||
}
|
||||
|
||||
return normalized;
|
||||
}
|
||||
|
||||
public static int [][] copyContingencyTable(int [][] t) {
|
||||
int[][] c = new int[ARRAY_DIM][ARRAY_DIM];
|
||||
|
||||
for ( int i = 0; i < ARRAY_DIM; i++ ) {
|
||||
//System.arraycopy(t,0,c,0,ARRAY_DIM);
|
||||
for (int j = 0; j < ARRAY_DIM; j++) {
|
||||
c[i][j] = t[i][j];
|
||||
}
|
||||
}
|
||||
|
||||
return c;
|
||||
}
|
||||
|
||||
protected static boolean rotateTable(int[][] table) {
|
||||
table[0][0]--;
|
||||
table[1][0]++;
|
||||
|
||||
table[0][1]++;
|
||||
table[1][1]--;
|
||||
|
||||
return (table[0][0] >= 0 && table[1][1] >= 0);
|
||||
}
|
||||
|
||||
protected static boolean unrotateTable(int[][] table) {
|
||||
table[0][0]++;
|
||||
table[1][0]--;
|
||||
|
||||
table[0][1]--;
|
||||
table[1][1]++;
|
||||
|
||||
return (table[0][1] >= 0 && table[1][0] >= 0);
|
||||
}
|
||||
|
||||
protected static double computePValue(int[][] table) {
|
||||
|
||||
int[] rowSums = { sumRow(table, 0), sumRow(table, 1) };
|
||||
int[] colSums = { sumColumn(table, 0), sumColumn(table, 1) };
|
||||
int N = rowSums[0] + rowSums[1];
|
||||
|
||||
// calculate in log space for better precision
|
||||
double pCutoff = Arithmetic.logFactorial(rowSums[0])
|
||||
+ Arithmetic.logFactorial(rowSums[1])
|
||||
+ Arithmetic.logFactorial(colSums[0])
|
||||
+ Arithmetic.logFactorial(colSums[1])
|
||||
- Arithmetic.logFactorial(table[0][0])
|
||||
- Arithmetic.logFactorial(table[0][1])
|
||||
- Arithmetic.logFactorial(table[1][0])
|
||||
- Arithmetic.logFactorial(table[1][1])
|
||||
- Arithmetic.logFactorial(N);
|
||||
return Math.exp(pCutoff);
|
||||
}
|
||||
|
||||
private static int sumRow(int[][] table, int column) {
|
||||
int sum = 0;
|
||||
for (int r = 0; r < table.length; r++) {
|
||||
sum += table[r][column];
|
||||
}
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
private static int sumColumn(int[][] table, int row) {
|
||||
int sum = 0;
|
||||
for (int c = 0; c < table[row].length; c++) {
|
||||
sum += table[row][c];
|
||||
}
|
||||
|
||||
return sum;
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -58,6 +58,7 @@ import htsjdk.variant.vcf.VCFFormatHeaderLine;
|
|||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
|
|
@ -77,7 +78,8 @@ import java.util.*;
|
|||
/**
|
||||
* Class of tests to detect strand bias.
|
||||
*/
|
||||
public abstract class StrandBiasTest extends InfoFieldAnnotation {
|
||||
//TODO: will eventually implement ReducibleAnnotation -- see RMSAnnotation.java for an example of an abstract ReducibleAnnotation
|
||||
public abstract class StrandBiasTest extends InfoFieldAnnotation implements ActiveRegionBasedAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(StrandBiasTest.class);
|
||||
private static boolean stratifiedPerReadAlleleLikelihoodMapWarningLogged = false;
|
||||
private static boolean inputVariantContextWarningLogged = false;
|
||||
|
|
@ -181,8 +183,16 @@ public abstract class StrandBiasTest extends InfoFieldAnnotation {
|
|||
continue;
|
||||
|
||||
foundData = true;
|
||||
final String sbbsString = (String) g.getAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY);
|
||||
final int[] data = encodeSBBS(sbbsString);
|
||||
int[] data;
|
||||
if ( g.getAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY).getClass().equals(String.class)) {
|
||||
final String sbbsString = (String) g.getAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY);
|
||||
data = encodeSBBS(sbbsString);
|
||||
} else if (g.getAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY).getClass().equals(ArrayList.class)) {
|
||||
ArrayList sbbsList = (ArrayList) g.getAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY);
|
||||
data = encodeSBBS(sbbsList);
|
||||
} else
|
||||
throw new IllegalArgumentException("Unexpected " + GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY + " type");
|
||||
|
||||
if ( passesMinimumThreshold(data, minCount) ) {
|
||||
for( int index = 0; index < sbArray.length; index++ ) {
|
||||
sbArray[index] += data[index];
|
||||
|
|
@ -304,7 +314,6 @@ public abstract class StrandBiasTest extends InfoFieldAnnotation {
|
|||
private static void updateTable(final int[] table, final Allele allele, final GATKSAMRecord read, final Allele ref, final List<Allele> allAlts) {
|
||||
|
||||
final boolean matchesRef = allele.equals(ref, true);
|
||||
final boolean matchesAlt = allele.equals(allAlts.get(0), true);
|
||||
final boolean matchesAnyAlt = allAlts.contains(allele);
|
||||
|
||||
if ( matchesRef || matchesAnyAlt ) {
|
||||
|
|
@ -350,6 +359,20 @@ public abstract class StrandBiasTest extends InfoFieldAnnotation {
|
|||
return array;
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper function to parse the genotype annotation into the SB annotation array
|
||||
* @param arrayList the ArrayList returned from StrandBiasBySample.annotate()
|
||||
* @return the array used by the per-sample Strand Bias annotation
|
||||
*/
|
||||
private static int[] encodeSBBS( final ArrayList<Integer> arrayList ) {
|
||||
final int[] array = new int[ARRAY_SIZE];
|
||||
int index = 0;
|
||||
for ( Integer item : arrayList )
|
||||
array[index++] = item.intValue();
|
||||
|
||||
return array;
|
||||
}
|
||||
|
||||
/**
|
||||
* Helper function to turn the SB annotation array into a contingency table
|
||||
* @param array the array used by the per-sample Strand Bias annotation
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -66,7 +66,9 @@ import java.util.*;
|
|||
/**
|
||||
* Strand bias estimated by the Symmetric Odds Ratio test
|
||||
*
|
||||
* <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele.</p>
|
||||
* <p>Strand bias is a type of sequencing bias in which one DNA strand is favored over the other, which can result in incorrect evaluation of the amount of evidence observed for one allele vs. the other. </p>
|
||||
*
|
||||
* <p>The StrandOddsRatio annotation is one of several methods that aims to evaluate whether there is strand bias in the data. It is an updated form of the Fisher Strand Test that is better at taking into account large amounts of data in high coverage situations. It is used to determine if there is strand bias between forward and reverse strands for the reference or alternate allele. The reported value is ln-scaled.</p>
|
||||
*
|
||||
* <h3>Statistical notes</h3>
|
||||
* <p> Odds Ratios in the 2x2 contingency table below are</p>
|
||||
|
|
@ -93,15 +95,19 @@ import java.util.*;
|
|||
*
|
||||
* <p>See the <a href="http://www.broadinstitute.org/gatk/guide/article?id=4732">method document on statistical tests</a> for a more detailed explanation of this statistical test.</p>
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <p>
|
||||
* The name SOR is not entirely appropriate because the implementation was changed somewhere between the start of development and release of this annotation. Now SOR isn't really an odds ratio anymore. The goal was to separate certain cases of data without penalizing variants that occur at the ends of exons because they tend to only be covered by reads in one direction (depending on which end of the exon they're on), so if a variant has 10 ref reads in the + direction, 1 ref read in the - direction, 9 alt reads in the + direction and 2 alt reads in the - direction, it's actually not strand biased, but the FS score is pretty bad. The implementation that resulted derived in part from empirically testing some read count tables of various sizes with various ratios and deciding from there.</p>
|
||||
*
|
||||
* <h3>Related annotations</h3>
|
||||
* <ul>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_AS_StrandOddsRatio.php">AS_StrandOddsRatio</a></b> outputs an allele-specific version of this annotation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_StrandBiasBySample.php">StrandBiasBySample</a></b> outputs counts of read depth per allele for each strand orientation.</li>
|
||||
* <li><b><a href="https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_FisherStrand.php">FisherStrand</a></b> uses Fisher's Exact Test to evaluate strand bias.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class StrandOddsRatio extends StrandBiasTest implements StandardAnnotation, ActiveRegionBasedAnnotation {
|
||||
private final static double AUGMENTATION_CONSTANT = 1.0;
|
||||
private static final int MIN_COUNT = 0;
|
||||
|
||||
@Override
|
||||
|
|
@ -132,17 +138,17 @@ public class StrandOddsRatio extends StrandBiasTest implements StandardAnnotatio
|
|||
}
|
||||
|
||||
/**
|
||||
* Computes the SOR value of a table after augmentation. Based on the symmetric odds ratio but modified to take on
|
||||
* Computes the SOR value of a table after augmentation (adding pseudocounts). Based on the symmetric odds ratio but modified to take on
|
||||
* low values when the reference +/- read count ratio is skewed but the alt count ratio is not. Natural log is taken
|
||||
* to keep values within roughly the same range as other annotations.
|
||||
*
|
||||
* Augmentation avoids quotient by zero.
|
||||
* Adding pseudocounts prevent divide-by-zero.
|
||||
*
|
||||
* @param originalTable The table before augmentation
|
||||
* @return the SOR annotation value
|
||||
*/
|
||||
final protected double calculateSOR(final int[][] originalTable) {
|
||||
final double[][] augmentedTable = augmentContingencyTable(originalTable);
|
||||
final double[][] augmentedTable = StrandBiasTableUtils.augmentContingencyTable(originalTable);
|
||||
|
||||
double ratio = 0;
|
||||
|
||||
|
|
@ -158,22 +164,6 @@ public class StrandOddsRatio extends StrandBiasTest implements StandardAnnotatio
|
|||
}
|
||||
|
||||
|
||||
/**
|
||||
* Adds the small value AUGMENTATION_CONSTANT to all the entries of the table.
|
||||
*
|
||||
* @param table the table to augment
|
||||
* @return the augmented table
|
||||
*/
|
||||
private static double[][] augmentContingencyTable(final int[][] table) {
|
||||
double[][] augmentedTable = new double[ARRAY_DIM][ARRAY_DIM];
|
||||
for ( int i = 0; i < ARRAY_DIM; i++ ) {
|
||||
for ( int j = 0; j < ARRAY_DIM; j++ )
|
||||
augmentedTable[i][j] = table[i][j] + AUGMENTATION_CONSTANT;
|
||||
}
|
||||
|
||||
return augmentedTable;
|
||||
}
|
||||
|
||||
/**
|
||||
* Returns an annotation result given a ratio
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -76,11 +76,6 @@ import java.util.*;
|
|||
*
|
||||
* <p>A tandem repeat unit is composed of one or more nucleotides that are repeated multiple times in series. Repetitive sequences are difficult to map to the reference because they are associated with multiple alignment possibilities. Knowing the number of repeat units in a set of tandem repeats tells you the number of different positions the tandem repeat can be placed in. The observation of many tandem repeat units multiplies the number of possible representations that can be made of the region.
|
||||
*
|
||||
* <h3>Caveat</h3>
|
||||
* <ul>
|
||||
* <li>This annotation is currently not compatible with HaplotypeCaller.</li>
|
||||
* </ul>
|
||||
*
|
||||
*/
|
||||
public class TandemRepeatAnnotator extends InfoFieldAnnotation implements StandardUGAnnotation, ActiveRegionBasedAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(TandemRepeatAnnotator.class);
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -0,0 +1,57 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.annotator.interfaces;
|
||||
|
||||
/**
|
||||
* Annotations implementing this interface will be default for HaplotypeCaller
|
||||
*/
|
||||
public interface StandardHCAnnotation extends AnnotationType {}
|
||||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -79,7 +79,7 @@ import java.util.Map;
|
|||
* Create plots to visualize base recalibration results
|
||||
*
|
||||
* <p/>
|
||||
* This tool generates plots for visualizing the quality of a recalibration run.
|
||||
* This tool generates plots for visualizing the quality of a recalibration run (effected by <a href='http://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php'>BaseRecalibrator</a>).
|
||||
* </p>
|
||||
*
|
||||
* <h3>Input</h3>
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
@ -86,22 +86,35 @@ import java.util.Arrays;
|
|||
import java.util.List;
|
||||
|
||||
/**
|
||||
* Generate base recalibration table to compensate for systematic errors
|
||||
* Generate base recalibration table to compensate for systematic errors in basecalling confidences
|
||||
*
|
||||
* <p>
|
||||
* This tool is designed to work as the first pass in a two-pass processing step. It does a by-locus traversal operating
|
||||
* only at sites that are not in dbSNP. We assume that all reference mismatches we see are therefore errors and indicative
|
||||
* of poor base quality. This tool generates tables based on various user-specified covariates (such as read group,
|
||||
* reported quality score, cycle, and context). Since there is a large amount of data, one can then calculate an empirical
|
||||
* probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations.
|
||||
* The output file is a table (of the several covariate values, num observations, num mismatches, empirical quality score).
|
||||
* </p>
|
||||
* <p>
|
||||
* Note: ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added regardless of whether
|
||||
* or not they were specified.
|
||||
* Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence
|
||||
* read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores
|
||||
* produced by the machines are subject to various sources of systematic technical error, leading to over- or
|
||||
* under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we
|
||||
* apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us
|
||||
* to get more accurate base qualities, which in turn improves the accuracy of our variant calls.
|
||||
*
|
||||
* The base recalibration process involves two key steps: first the program builds a model of covariation based on the
|
||||
* data and a set of known variants (which you can bootstrap if there is none available for your organism), then it
|
||||
* adjusts the base quality scores in the data based on the model.
|
||||
*
|
||||
* There is an optional but highly recommended step that involves building a second model and generating before/after
|
||||
* plots to visualize the effects of the recalibration process. This is useful for quality control purposes.
|
||||
*
|
||||
* This tool performs the first step described above: it builds the model of covariation and produces the recalibration
|
||||
* table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore
|
||||
* errors and indicative of poor base quality. This tool generates tables based on various user-specified covariates
|
||||
* (such as read group, reported quality score, cycle, and context). Assuming we are working with a large amount of data,
|
||||
* we can then calculate an empirical probability of error given the particular covariates seen at this site,
|
||||
* where p(error) = num mismatches / num observations.
|
||||
*
|
||||
* The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical
|
||||
* quality score).
|
||||
* </p>
|
||||
*
|
||||
* <h3>Input</h3>
|
||||
* <h3>Inputs</h3>
|
||||
* <p>
|
||||
* A BAM file containing data that needs to be recalibrated.
|
||||
* <p>
|
||||
|
|
@ -131,6 +144,13 @@ import java.util.List;
|
|||
* -knownSites latest_dbsnp.vcf \
|
||||
* -o recal_data.table
|
||||
* </pre>
|
||||
*
|
||||
* <h3>Notes</h3>
|
||||
* <ul><li>This *base* recalibration process should not be confused with *variant* recalibration, which is a s
|
||||
* ophisticated filtering technique applied on the variant callset produced in a later step of the analysis workflow.</li>
|
||||
* <li>ReadGroupCovariate and QualityScoreCovariate are required covariates and will be added regardless of whether
|
||||
* or not they were specified.</li></ul>
|
||||
*
|
||||
*/
|
||||
|
||||
@DocumentedGATKFeature(groupName = HelpConstants.DOCS_CAT_DATA, extraDocs = {CommandLineGATK.class})
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@
|
|||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
|
|
|
|||
|
|
@ -0,0 +1,283 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer;
|
||||
|
||||
import org.broadinstitute.gatk.utils.commandline.Argument;
|
||||
import org.broadinstitute.gatk.utils.commandline.ArgumentCollection;
|
||||
import org.broadinstitute.gatk.utils.commandline.Output;
|
||||
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.engine.walkers.RodWalker;
|
||||
import org.broadinstitute.gatk.engine.walkers.TreeReducible;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.engine.SampleUtils;
|
||||
import org.broadinstitute.gatk.utils.exceptions.UserException;
|
||||
import org.broadinstitute.gatk.engine.GATKVCFUtils;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.variantcontext.VariantContextBuilder;
|
||||
import htsjdk.variant.variantcontext.VariantContextUtils;
|
||||
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
|
||||
import htsjdk.variant.vcf.*;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Assigns somatic status to a set of calls
|
||||
*/
|
||||
public class AssignSomaticStatus extends RodWalker<Integer, Integer> implements TreeReducible<Integer> {
|
||||
@ArgumentCollection
|
||||
protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
|
||||
|
||||
@Argument(shortName="n", fullName="normalSample", required=true, doc="The normal sample")
|
||||
public String normalSample;
|
||||
|
||||
@Argument(shortName="t", fullName="tumorSample", required=true, doc="The tumor sample")
|
||||
public String tumorSample;
|
||||
|
||||
@Argument(shortName="somaticPriorQ", fullName="somaticPriorQ", required=false, doc="Phred-scaled probability that a site is a somatic mutation")
|
||||
public byte somaticPriorQ = 60;
|
||||
|
||||
@Argument(shortName="somaticMinLOD", fullName="somaticMinLOD", required=false, doc="Phred-scaled min probability that a site should be called somatic mutation")
|
||||
public byte somaticMinLOD = 1;
|
||||
|
||||
@Argument(shortName="minimalVCF", fullName="minimalVCF", required=false, doc="If provided, the attributes of the output VCF will only contain the somatic status fields")
|
||||
public boolean minimalVCF = false;
|
||||
|
||||
@Output
|
||||
protected VariantContextWriter vcfWriter = null;
|
||||
|
||||
private final String SOMATIC_LOD_TAG_NAME = "SOMATIC_LOD";
|
||||
private final String SOMATIC_AC_TAG_NAME = "SOMATIC_AC";
|
||||
private final String SOMATIC_NONREF_TAG_NAME = "SOMATIC_NNR";
|
||||
|
||||
private final Set<String> samples = new HashSet<String>(2);
|
||||
|
||||
/**
|
||||
* Parse the familial relationship specification, and initialize VCF writer
|
||||
*/
|
||||
public void initialize() {
|
||||
List<String> rodNames = new ArrayList<String>();
|
||||
rodNames.add(variantCollection.variants.getName());
|
||||
|
||||
Map<String, VCFHeader> vcfRods = GATKVCFUtils.getVCFHeadersFromRods(getToolkit(), rodNames);
|
||||
Set<String> vcfSamples = SampleUtils.getSampleList(vcfRods, GATKVariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE);
|
||||
|
||||
// set up tumor and normal samples
|
||||
if ( !vcfSamples.contains(normalSample) )
|
||||
throw new UserException.BadArgumentValue("--normalSample", "the normal sample " + normalSample + " doesn't match any sample from the input VCF");
|
||||
if ( !vcfSamples.contains(tumorSample) )
|
||||
throw new UserException.BadArgumentValue("--tumorSample", "the tumor sample " + tumorSample + " doesn't match any sample from the input VCF");
|
||||
|
||||
logger.info("Normal sample: " + normalSample);
|
||||
logger.info("Tumor sample: " + tumorSample);
|
||||
|
||||
Set<VCFHeaderLine> headerLines = new HashSet<VCFHeaderLine>();
|
||||
headerLines.addAll(GATKVCFUtils.getHeaderFields(this.getToolkit()));
|
||||
headerLines.add(new VCFInfoHeaderLine(VCFConstants.SOMATIC_KEY, 0, VCFHeaderLineType.Flag, "Is this a confidently called somatic mutation"));
|
||||
headerLines.add(new VCFInfoHeaderLine(SOMATIC_LOD_TAG_NAME, 1, VCFHeaderLineType.Float, "log10 probability that the site is a somatic mutation"));
|
||||
headerLines.add(new VCFInfoHeaderLine(SOMATIC_AC_TAG_NAME, 1, VCFHeaderLineType.Integer, "Allele count of samples with somatic event"));
|
||||
headerLines.add(new VCFInfoHeaderLine(SOMATIC_NONREF_TAG_NAME, 1, VCFHeaderLineType.Integer, "Number of samples with somatic event"));
|
||||
|
||||
samples.add(normalSample);
|
||||
samples.add(tumorSample);
|
||||
vcfWriter.writeHeader(new VCFHeader(headerLines, samples));
|
||||
}
|
||||
|
||||
private double log10pNonRefInSamples(final VariantContext vc, final String sample) {
|
||||
return log10PLFromSamples(vc, sample, false);
|
||||
}
|
||||
|
||||
private double log10pRefInSamples(final VariantContext vc, final String sample) {
|
||||
return log10PLFromSamples(vc, sample, true);
|
||||
}
|
||||
|
||||
private double log10PLFromSamples(final VariantContext vc, final String sample, boolean calcRefP) {
|
||||
|
||||
Genotype g = vc.getGenotype(sample);
|
||||
double log10pSample = -1000;
|
||||
if ( ! g.isNoCall() ) {
|
||||
final double[] gLikelihoods = MathUtils.normalizeFromLog10(g.getLikelihoods().getAsVector());
|
||||
log10pSample = Math.log10(calcRefP ? gLikelihoods[0] : 1 - gLikelihoods[0]);
|
||||
log10pSample = Double.isInfinite(log10pSample) ? -10000 : log10pSample;
|
||||
}
|
||||
return log10pSample;
|
||||
}
|
||||
|
||||
private int calculateTumorAC(final VariantContext vc) {
|
||||
int ac = 0;
|
||||
switch ( vc.getGenotype(tumorSample).getType() ) {
|
||||
case HET: ac += 1; break;
|
||||
case HOM_VAR: ac += 2; break;
|
||||
case NO_CALL: case UNAVAILABLE: case HOM_REF: break;
|
||||
}
|
||||
return ac;
|
||||
}
|
||||
|
||||
private int calculateTumorNNR(final VariantContext vc) {
|
||||
int nnr = 0;
|
||||
switch ( vc.getGenotype(tumorSample).getType() ) {
|
||||
case HET: case HOM_VAR: nnr += 1; break;
|
||||
case NO_CALL: case UNAVAILABLE: case HOM_REF: break;
|
||||
}
|
||||
return nnr;
|
||||
}
|
||||
|
||||
/**
|
||||
* P(somatic | D)
|
||||
* = P(somatic) * P(D | somatic)
|
||||
* = P(somatic) * P(D | normals are ref) * P(D | tumors are non-ref)
|
||||
*
|
||||
* P(! somatic | D)
|
||||
* = P(! somatic) * P(D | ! somatic)
|
||||
* = P(! somatic) *
|
||||
* * ( P(D | normals are non-ref) * P(D | tumors are non-ref) [germline]
|
||||
* + P(D | normals are ref) * P(D | tumors are ref)) [no-variant at all]
|
||||
*
|
||||
* @param vc
|
||||
* @return
|
||||
*/
|
||||
private double calcLog10pSomatic(final VariantContext vc) {
|
||||
// walk over tumors
|
||||
double log10pNonRefInTumors = log10pNonRefInSamples(vc, tumorSample);
|
||||
double log10pRefInTumors = log10pRefInSamples(vc, tumorSample);
|
||||
|
||||
// walk over normals
|
||||
double log10pNonRefInNormals = log10pNonRefInSamples(vc, normalSample);
|
||||
double log10pRefInNormals = log10pRefInSamples(vc, normalSample);
|
||||
|
||||
// priors
|
||||
double log10pSomaticPrior = QualityUtils.qualToErrorProbLog10(somaticPriorQ);
|
||||
double log10pNotSomaticPrior = Math.log10(1 - QualityUtils.qualToErrorProb(somaticPriorQ));
|
||||
|
||||
double log10pNotSomaticGermline = log10pNonRefInNormals + log10pNonRefInTumors;
|
||||
double log10pNotSomaticNoVariant = log10pRefInNormals + log10pRefInTumors;
|
||||
|
||||
double log10pNotSomatic = log10pNotSomaticPrior + MathUtils.log10sumLog10(new double[]{log10pNotSomaticGermline, log10pNotSomaticNoVariant});
|
||||
double log10pSomatic = log10pSomaticPrior + log10pNonRefInTumors + log10pRefInNormals;
|
||||
double lod = log10pSomatic - log10pNotSomatic;
|
||||
|
||||
return Double.isInfinite(lod) ? -10000 : lod;
|
||||
}
|
||||
|
||||
/**
|
||||
* For each variant in the file, determine the phasing for the child and replace the child's genotype with the trio's genotype
|
||||
*
|
||||
* @param tracker the reference meta-data tracker
|
||||
* @param ref the reference context
|
||||
* @param context the alignment context
|
||||
* @return null
|
||||
*/
|
||||
@Override
|
||||
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
|
||||
if (tracker != null) {
|
||||
for ( VariantContext vc : tracker.getValues(variantCollection.variants, context.getLocation()) ) {
|
||||
vc = vc.subContextFromSamples(samples);
|
||||
if ( !vc.isPolymorphicInSamples() )
|
||||
continue;
|
||||
|
||||
double log10pSomatic = calcLog10pSomatic(vc);
|
||||
|
||||
// write in the somatic status probability
|
||||
Map<String, Object> attrs = new HashMap<String, Object>(); // vc.getAttributes());
|
||||
if ( ! minimalVCF ) attrs.putAll(vc.getAttributes());
|
||||
attrs.put(SOMATIC_LOD_TAG_NAME, log10pSomatic);
|
||||
if ( log10pSomatic > somaticMinLOD ) {
|
||||
attrs.put(VCFConstants.SOMATIC_KEY, true);
|
||||
attrs.put(SOMATIC_NONREF_TAG_NAME, calculateTumorNNR(vc));
|
||||
attrs.put(SOMATIC_AC_TAG_NAME, calculateTumorAC(vc));
|
||||
|
||||
}
|
||||
final VariantContextBuilder builder = new VariantContextBuilder(vc).attributes(attrs);
|
||||
VariantContextUtils.calculateChromosomeCounts(builder, false);
|
||||
VariantContext newvc = builder.make();
|
||||
|
||||
vcfWriter.add(newvc);
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Provide an initial value for reduce computations.
|
||||
*
|
||||
* @return Initial value of reduce.
|
||||
*/
|
||||
@Override
|
||||
public Integer reduceInit() {
|
||||
return null;
|
||||
}
|
||||
|
||||
/**
|
||||
* Reduces a single map with the accumulator provided as the ReduceType.
|
||||
*
|
||||
* @param value result of the map.
|
||||
* @param sum accumulator for the reduce.
|
||||
* @return accumulator with result of the map taken into account.
|
||||
*/
|
||||
@Override
|
||||
public Integer reduce(Integer value, Integer sum) {
|
||||
return null;
|
||||
}
|
||||
|
||||
@Override
|
||||
public Integer treeReduce(Integer sum1, Integer sum2) {
|
||||
return reduce(sum1, sum2);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,190 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypeBuilder;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFFormatHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFHeaderLineCount;
|
||||
import htsjdk.variant.vcf.VCFHeaderLineType;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.GenotypeAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardSomaticAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.exceptions.GATKException;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.sam.ReadUtils;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
|
||||
/**
|
||||
* Sum of evidence in reads supporting each allele for each sample
|
||||
*
|
||||
* <p>In the domain of somatic variants, a variant call can be supported by a few high quality reads. The
|
||||
* BaseQualitySumPerAlleleBySample annotation aims to give the user an estimate of the quality of the evidence supporting
|
||||
* a variant.</p>
|
||||
*
|
||||
* <h3>Notes</h3>
|
||||
* BaseQualitySumPerAlleleBySample is called and used by MuTect2 for variant filtering. This annotation is applied to SNPs
|
||||
* and INDELs. Qualities are not literal base qualities, but instead are derived from the per-allele likelihoods derived
|
||||
* from the assembly engine.
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>At this time, BaseQualitySumPerAlleleBySample can only be called from MuTect2</li>
|
||||
* </ul>
|
||||
*/
|
||||
public class BaseQualitySumPerAlleleBySample extends GenotypeAnnotation implements StandardSomaticAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(BaseQualitySumPerAlleleBySample.class);
|
||||
private boolean walkerIdentityCheckWarningLogged = false;
|
||||
|
||||
public List<String> getKeyNames() { return Arrays.asList(GATKVCFConstants.QUALITY_SCORE_SUM_KEY); }
|
||||
|
||||
|
||||
public void annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final AlignmentContext stratifiedContext,
|
||||
final VariantContext vc,
|
||||
final Genotype g,
|
||||
final GenotypeBuilder gb,
|
||||
final PerReadAlleleLikelihoodMap alleleLikelihoodMap) {
|
||||
|
||||
// Can only call from MuTect2
|
||||
if ( !(walker instanceof MuTect2) ) {
|
||||
if ( !walkerIdentityCheckWarningLogged ) {
|
||||
if ( walker != null )
|
||||
logger.warn("Annotation will not be calculated, can only be called from MuTect2, not " + walker.getClass().getName());
|
||||
else
|
||||
logger.warn("Annotation will not be calculated, can only be called from MuTect2");
|
||||
walkerIdentityCheckWarningLogged = true;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
if ( g == null || !g.isCalled() || ( stratifiedContext == null && alleleLikelihoodMap == null) )
|
||||
return;
|
||||
|
||||
if (alleleLikelihoodMap != null) {
|
||||
annotateWithLikelihoods(alleleLikelihoodMap, vc, gb);
|
||||
}
|
||||
}
|
||||
|
||||
protected void annotateWithLikelihoods(final PerReadAlleleLikelihoodMap perReadAlleleLikelihoodMap, final VariantContext vc, final GenotypeBuilder gb) {
|
||||
final ArrayList<Double> refQuals = new ArrayList<>();
|
||||
final ArrayList<Double> altQuals = new ArrayList<>();
|
||||
|
||||
// clean up
|
||||
fillQualsFromLikelihoodMap(vc.getAlleles(), vc.getStart(), perReadAlleleLikelihoodMap, refQuals, altQuals);
|
||||
double refQualSum = 0;
|
||||
for(Double d : refQuals) { refQualSum += d; }
|
||||
|
||||
double altQualSum = 0;
|
||||
for(Double d : altQuals) { altQualSum += d; }
|
||||
|
||||
gb.attribute(GATKVCFConstants.QUALITY_SCORE_SUM_KEY, new Integer[]{ (int) refQualSum, (int) altQualSum});
|
||||
}
|
||||
|
||||
public List<VCFFormatHeaderLine> getDescriptions() {
|
||||
return Arrays.asList(new VCFFormatHeaderLine(getKeyNames().get(0), VCFHeaderLineCount.A, VCFHeaderLineType.Integer, "Sum of base quality scores for each allele"));
|
||||
}
|
||||
|
||||
// from rank sum test */
|
||||
protected void fillQualsFromLikelihoodMap(final List<Allele> alleles,
|
||||
final int refLoc,
|
||||
final PerReadAlleleLikelihoodMap likelihoodMap,
|
||||
final List<Double> refQuals,
|
||||
final List<Double> altQuals) {
|
||||
for ( final Map.Entry<GATKSAMRecord, Map<Allele,Double>> el : likelihoodMap.getLikelihoodReadMap().entrySet() ) {
|
||||
final MostLikelyAllele a = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue());
|
||||
if ( ! a.isInformative() )
|
||||
continue; // read is non-informative
|
||||
|
||||
final GATKSAMRecord read = el.getKey();
|
||||
if ( isUsableRead(read) ) {
|
||||
final Double value = getBaseQualityForRead(read, refLoc);
|
||||
if ( value == null )
|
||||
continue;
|
||||
|
||||
if ( a.getMostLikelyAllele().isReference() )
|
||||
refQuals.add(value);
|
||||
else if ( alleles.contains(a.getMostLikelyAllele()) )
|
||||
altQuals.add(value);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
protected boolean isUsableRead(final GATKSAMRecord read) {
|
||||
return !( read.getMappingQuality() == 0 ||
|
||||
read.getMappingQuality() == QualityUtils.MAPPING_QUALITY_UNAVAILABLE );
|
||||
}
|
||||
|
||||
|
||||
protected Double getBaseQualityForRead(final GATKSAMRecord read, final int refLoc) {
|
||||
return (double)read.getBaseQualities()[ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, refLoc, ReadUtils.ClippingTail.RIGHT_TAIL)];
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,197 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer;
|
||||
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.GenotypeBuilder;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.vcf.VCFFormatHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFHeaderLineCount;
|
||||
import htsjdk.variant.vcf.VCFHeaderLineType;
|
||||
import org.apache.log4j.Logger;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.GenotypeAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation;
|
||||
import org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2;
|
||||
import org.broadinstitute.gatk.utils.QualityUtils;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.exceptions.GATKException;
|
||||
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
|
||||
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
|
||||
import org.broadinstitute.gatk.utils.sam.ReadUtils;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.ArrayList;
|
||||
import java.util.Arrays;
|
||||
import java.util.List;
|
||||
import java.util.Map;
|
||||
|
||||
|
||||
/**
|
||||
* Count of read pairs in the F1R2 and F2R1 configurations supporting the reference and alternate alleles
|
||||
*
|
||||
* <p>This is an annotation that gathers information about the read pair configuration for the reads supporting each
|
||||
* allele. It can be used along with downstream filtering steps to identify and filter out erroneous variants that occur
|
||||
* with higher frequency in one read pair orientation.</p>
|
||||
*
|
||||
* <h3>References</h3>
|
||||
* <p>For more details about the mechanism of oxoG artifact generation, see <a href='http://www.ncbi.nlm.nih.gov/pubmed/23303777' target='_blank'>
|
||||
* "Discovery and characterization of artefactual mutations in deep coverage targeted capture sequencing data due to oxidative DNA damage during sample preparation."
|
||||
* by Costello et al.</a></p>
|
||||
*
|
||||
* <h3>Caveats</h3>
|
||||
* <ul>
|
||||
* <li>At present, this annotation can only be called from MuTect2</li>
|
||||
* <li>The FOXOG annotation is only calculated for SNPs</li>
|
||||
* </ul>
|
||||
*/
|
||||
public class OxoGReadCounts extends GenotypeAnnotation {
|
||||
private final static Logger logger = Logger.getLogger(OxoGReadCounts.class);
|
||||
private boolean walkerIdentityCheckWarningLogged = false;
|
||||
Allele refAllele;
|
||||
Allele altAllele;
|
||||
|
||||
public List<String> getKeyNames() {
|
||||
return Arrays.asList(GATKVCFConstants.OXOG_ALT_F1R2_KEY, GATKVCFConstants.OXOG_ALT_F2R1_KEY, GATKVCFConstants.OXOG_REF_F1R2_KEY, GATKVCFConstants.OXOG_REF_F2R1_KEY, GATKVCFConstants.OXOG_FRACTION_KEY);
|
||||
}
|
||||
|
||||
|
||||
public void annotate(final RefMetaDataTracker tracker,
|
||||
final AnnotatorCompatible walker,
|
||||
final ReferenceContext ref,
|
||||
final AlignmentContext stratifiedContext,
|
||||
final VariantContext vc,
|
||||
final Genotype g,
|
||||
final GenotypeBuilder gb,
|
||||
final PerReadAlleleLikelihoodMap alleleLikelihoodMap) {
|
||||
|
||||
// Can only call from MuTect2
|
||||
if ( !(walker instanceof MuTect2) ) {
|
||||
if ( !walkerIdentityCheckWarningLogged ) {
|
||||
if ( walker != null )
|
||||
logger.warn("Annotation will not be calculated, can only be called from MuTect2, not " + walker.getClass().getName());
|
||||
else
|
||||
logger.warn("Annotation will not be calculated, can only be called from MuTect2");
|
||||
walkerIdentityCheckWarningLogged = true;
|
||||
}
|
||||
return;
|
||||
}
|
||||
|
||||
if (g == null || !g.isCalled() || (stratifiedContext == null && alleleLikelihoodMap == null))
|
||||
return;
|
||||
|
||||
refAllele = vc.getReference();
|
||||
altAllele = vc.getAlternateAllele(0);
|
||||
|
||||
if (alleleLikelihoodMap != null) {
|
||||
annotateWithLikelihoods(alleleLikelihoodMap, vc, gb);
|
||||
}
|
||||
}
|
||||
|
||||
protected void annotateWithLikelihoods(final PerReadAlleleLikelihoodMap perReadAlleleLikelihoodMap, final VariantContext vc, final GenotypeBuilder gb) {
|
||||
int ALT_F1R2, ALT_F2R1, REF_F1R2, REF_F2R1;
|
||||
ALT_F1R2 = ALT_F2R1 = REF_F1R2 = REF_F2R1 = 0;
|
||||
double numerator, denominator;
|
||||
|
||||
for ( final Map.Entry<GATKSAMRecord, Map<Allele,Double>> el : perReadAlleleLikelihoodMap.getLikelihoodReadMap().entrySet() ) {
|
||||
final MostLikelyAllele a = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue());
|
||||
if ( ! a.isInformative() || ! isUsableRead(el.getKey()))
|
||||
continue; // read is non-informative or MQ0
|
||||
if (a.getAlleleIfInformative().equals(refAllele, true) && el.getKey().getReadPairedFlag()) {
|
||||
if (el.getKey().getReadNegativeStrandFlag() == el.getKey().getFirstOfPairFlag())
|
||||
REF_F2R1++;
|
||||
else
|
||||
REF_F1R2++;
|
||||
}
|
||||
else if (a.getAlleleIfInformative().equals(altAllele,true) && el.getKey().getReadPairedFlag()){
|
||||
if (el.getKey().getReadNegativeStrandFlag() == el.getKey().getFirstOfPairFlag())
|
||||
ALT_F2R1++;
|
||||
else
|
||||
ALT_F1R2++;
|
||||
}
|
||||
}
|
||||
|
||||
denominator = ALT_F1R2 + ALT_F2R1;
|
||||
Double fOxoG = null;
|
||||
if (vc.isSNP() && denominator > 0) {
|
||||
if (refAllele.equals(Allele.create((byte) 'C', true)) || refAllele.equals(Allele.create((byte) 'A', true)))
|
||||
numerator = ALT_F2R1;
|
||||
else
|
||||
numerator = ALT_F1R2;
|
||||
fOxoG = (float) numerator / denominator;
|
||||
}
|
||||
|
||||
gb.attribute(GATKVCFConstants.OXOG_ALT_F1R2_KEY, new Integer(ALT_F1R2));
|
||||
gb.attribute(GATKVCFConstants.OXOG_ALT_F2R1_KEY, new Integer(ALT_F2R1));
|
||||
gb.attribute(GATKVCFConstants.OXOG_REF_F1R2_KEY, new Integer(REF_F1R2));
|
||||
gb.attribute(GATKVCFConstants.OXOG_REF_F2R1_KEY, new Integer(REF_F2R1));
|
||||
gb.attribute(GATKVCFConstants.OXOG_FRACTION_KEY, fOxoG);
|
||||
}
|
||||
|
||||
public List<VCFFormatHeaderLine> getDescriptions() {
|
||||
return Arrays.asList(GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.OXOG_ALT_F1R2_KEY),
|
||||
GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.OXOG_ALT_F2R1_KEY),
|
||||
GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.OXOG_REF_F1R2_KEY),
|
||||
GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.OXOG_REF_F2R1_KEY),
|
||||
GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.OXOG_FRACTION_KEY));
|
||||
}
|
||||
|
||||
protected boolean isUsableRead(final GATKSAMRecord read) {
|
||||
return !( read.getMappingQuality() == 0 ||
|
||||
read.getMappingQuality() == QualityUtils.MAPPING_QUALITY_UNAVAILABLE );
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,186 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer.contamination;
|
||||
|
||||
|
||||
import org.broadinstitute.gatk.utils.commandline.Argument;
|
||||
import org.broadinstitute.gatk.utils.commandline.Input;
|
||||
import org.broadinstitute.gatk.utils.commandline.Output;
|
||||
import org.broadinstitute.gatk.utils.commandline.RodBinding;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.engine.samples.Sample;
|
||||
import org.broadinstitute.gatk.engine.walkers.DataSource;
|
||||
import org.broadinstitute.gatk.engine.walkers.Requires;
|
||||
import org.broadinstitute.gatk.engine.walkers.RodWalker;
|
||||
import org.broadinstitute.gatk.engine.walkers.TreeReducible;
|
||||
import htsjdk.variant.vcf.VCFHeader;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import org.broadinstitute.gatk.utils.exceptions.UserException;
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.variantcontext.VariantContextBuilder;
|
||||
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Given a input VCF representing a collection of populations, split the input into each population, and annotate each record with population allele frequencies
|
||||
*/
|
||||
// @Requires(DataSource.SAMPLE) <- require the sample data when this works
|
||||
public class AnnotatePopulationAFWalker extends RodWalker<Integer, Integer> implements TreeReducible<Integer> {
|
||||
// control the output
|
||||
@Output(doc="File to which variants should be written",required=true)
|
||||
protected VariantContextWriter writer = null;
|
||||
|
||||
// our mapping of population to sample list
|
||||
private final Map<String, List<Sample>> popMapping = new LinkedHashMap<String, List<Sample>>();
|
||||
|
||||
@Input(fullName="population", shortName = "pop", doc="the VCF containing large populations of samples", required=true)
|
||||
public RodBinding<VariantContext> pop;
|
||||
|
||||
// either load the lanes into our name list, or the samples, depending on the command line parameters
|
||||
public void initialize() {
|
||||
// get the sample information
|
||||
for (Sample sp: getToolkit().getSampleDB().getSamples())
|
||||
if (sp.getOtherPhenotype() != null) {
|
||||
if (!popMapping.containsKey(sp.getOtherPhenotype()))
|
||||
popMapping.put(sp.getOtherPhenotype(),new ArrayList<Sample>());
|
||||
popMapping.get(sp.getOtherPhenotype()).add(sp);
|
||||
}
|
||||
|
||||
// this is a stop-gap until the @Requires tag is working with sample information
|
||||
if (popMapping.size() == 0)
|
||||
throw new UserException.BadInput("we require a sample file that contains population information. Please see the wiki about how to supply one");
|
||||
|
||||
// setup our VCF
|
||||
// TODO: add code to get the samples from the input VCF, if they set 'preserveGenotypes' above
|
||||
Set<VCFHeaderLine> hInfo = new HashSet<VCFHeaderLine>();
|
||||
|
||||
VCFHeader vcfHeader = new VCFHeader(hInfo);
|
||||
writer.writeHeader(vcfHeader);
|
||||
}
|
||||
|
||||
|
||||
// boilerplate code - the standard reduce function for integers
|
||||
@Override public Integer reduceInit() { return 0; }
|
||||
@Override public Integer reduce(Integer value, Integer sum) { return(value + sum); }
|
||||
public Integer treeReduce(Integer lhs, Integer rhs) { return lhs + rhs; }
|
||||
|
||||
@Override
|
||||
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
|
||||
if (tracker == null) return 0;
|
||||
|
||||
// get the variant contexts, and return if we have anything other than one record at this site
|
||||
Collection<VariantContext> vcs = tracker.getValues(pop);
|
||||
if (vcs.size() != 1) return 0;
|
||||
VariantContext originalVC = vcs.iterator().next();
|
||||
|
||||
if (!originalVC.isSNP()) return 0;
|
||||
VariantContext vc = originalVC;
|
||||
|
||||
// get the list of alleles
|
||||
List<Allele> vcAlleles = vc.getAlleles();
|
||||
// setup the mapping of allele to population map
|
||||
Map<String,Map<Allele,String>> popToAlleleFreq = new LinkedHashMap<String,Map<Allele,String>>();
|
||||
|
||||
// initialize all pops
|
||||
Map<Allele,Integer> allPopAC = new LinkedHashMap<Allele,Integer>();
|
||||
int allPopTotal = 0;
|
||||
for (Allele a : vcAlleles) allPopAC.put(a,0);
|
||||
|
||||
// find the sub-population allele frequencies, and annotate them
|
||||
for (Map.Entry<String, List<Sample>> pop : popMapping.entrySet()) {
|
||||
Map<Allele,Integer> thisPopAC = new LinkedHashMap<Allele,Integer>();
|
||||
int total = 0;
|
||||
for (Allele a : vcAlleles) thisPopAC.put(a,0);
|
||||
for (Sample s : pop.getValue()) {
|
||||
Genotype g = vc.getGenotype(s.getID());
|
||||
if (g == null) continue;
|
||||
for (Allele a : vcAlleles) {
|
||||
int count = a.length();
|
||||
|
||||
total += count;
|
||||
thisPopAC.put(a,thisPopAC.get(a) + count);
|
||||
|
||||
allPopTotal += count;
|
||||
allPopAC.put(a, allPopAC.get(a) + count);
|
||||
}
|
||||
}
|
||||
Map<Allele,String> thisPopAF = new LinkedHashMap<Allele,String>();
|
||||
for (Map.Entry<Allele,Integer> entry : thisPopAC.entrySet())
|
||||
thisPopAF.put(entry.getKey(),String.format("%1.5f", (total == 0) ? 0 : (double)entry.getValue()/(double)total));
|
||||
popToAlleleFreq.put(pop.getKey(),thisPopAF);
|
||||
}
|
||||
|
||||
// add the all pops value as well
|
||||
Map<Allele, String> allPopAF = new LinkedHashMap<Allele, String>();
|
||||
for (Map.Entry<Allele,Integer> entry : allPopAC.entrySet())
|
||||
allPopAF.put(entry.getKey(), String.format("%1.5f", (allPopTotal == 0) ? 0 : (double)entry.getValue()/(double)allPopTotal));
|
||||
|
||||
popToAlleleFreq.put("ALL", allPopAF);
|
||||
|
||||
// add the population af annotations
|
||||
VariantContextBuilder vcb = new VariantContextBuilder(vc);
|
||||
Map<String,Object> popToAlleleFreqAsObject = new LinkedHashMap<String,Object>();
|
||||
for (Map.Entry<String,Map<Allele,String>> mp : popToAlleleFreq.entrySet()) {
|
||||
popToAlleleFreqAsObject.put(mp.getKey(),(Object)mp.getValue());
|
||||
}
|
||||
vcb.attributes(popToAlleleFreqAsObject);
|
||||
writer.add(vc);
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
||||
}
|
||||
|
|
@ -0,0 +1,729 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer.contamination;
|
||||
|
||||
import htsjdk.samtools.SAMReadGroupRecord;
|
||||
import htsjdk.samtools.util.StringUtil;
|
||||
import org.broadinstitute.gatk.engine.CommandLineGATK;
|
||||
import org.broadinstitute.gatk.engine.walkers.*;
|
||||
import org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.AFCalculatorProvider;
|
||||
import org.broadinstitute.gatk.tools.walkers.genotyper.afcalc.FixedAFCalculatorProvider;
|
||||
import org.broadinstitute.gatk.utils.commandline.*;
|
||||
import org.broadinstitute.gatk.engine.GenomeAnalysisEngine;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
|
||||
import org.broadinstitute.gatk.utils.help.HelpConstants;
|
||||
import org.broadinstitute.gatk.utils.sam.SAMReaderID;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedArgumentCollection;
|
||||
import org.broadinstitute.gatk.tools.walkers.genotyper.UnifiedGenotypingEngine;
|
||||
import org.broadinstitute.gatk.tools.walkers.genotyper.VariantCallContext;
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.broadinstitute.gatk.utils.exceptions.GATKException;
|
||||
import org.broadinstitute.gatk.utils.exceptions.UserException;
|
||||
import org.broadinstitute.gatk.utils.pileup.ReadBackedPileup;
|
||||
import htsjdk.variant.variantcontext.*;
|
||||
|
||||
import java.io.*;
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Estimate cross-sample contamination
|
||||
*
|
||||
* This tool determine the percent contamination of an input bam by sample, by lane, or in aggregate across all the input reads.
|
||||
*
|
||||
* <h3>Usage examples</h3>
|
||||
* <p>These are example commands that show how to run ContEst for typical use cases. Square brackets ("[ ]")
|
||||
* indicate optional arguments. Note that parameter values and/or resources shown here may not be the latest recommended; see the Best Practices documentation for detailed recommendations. </p>
|
||||
*
|
||||
* <br />
|
||||
* <h4>Contamination estimation using a VCF containing the normal sample's genotypes (as might be derived from a genotyping array)</h4>
|
||||
* <pre>
|
||||
* java
|
||||
* -jar GenomeAnalysisTK.jar \
|
||||
* -T ContEst \
|
||||
* -R reference.fasta \
|
||||
* -I tumor.bam \
|
||||
* --genotypes normalGenotypes.vcf \
|
||||
* --popFile populationAlleleFrequencies.vcf \
|
||||
* -L populationSites.interval_list
|
||||
* [-L targets.interval_list] \
|
||||
* -isr INTERSECTION \
|
||||
* -o output.txt
|
||||
* </pre>
|
||||
*
|
||||
* <br />
|
||||
* <h4>Contamination estimation using the normal BAM for genotyping on-the-fly</h4>
|
||||
* <pre>
|
||||
* java
|
||||
* -jar GenomeAnalysisTK.jar \
|
||||
* -T ContEst \
|
||||
* -R reference.fasta \
|
||||
* -I:eval tumor.bam \
|
||||
* -I:genotype normal.bam \
|
||||
* --popFile populationAlleleFrequencies.vcf \
|
||||
* -L populationSites.interval_list
|
||||
* [-L targets.interval_list] \
|
||||
* -isr INTERSECTION \
|
||||
* -o output.txt
|
||||
* </pre>
|
||||
*
|
||||
*<h3>Output</h3>
|
||||
* A text file containing estimated percent contamination, as well as error bars on this estimate.
|
||||
*
|
||||
* <h3>Notes</h3>
|
||||
* Multiple modes are supported simultaneously, e.g. contamination by sample and readgroup can be computed in the same run.
|
||||
*/
|
||||
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
|
||||
@Allows(value = {DataSource.READS, DataSource.REFERENCE})
|
||||
@Requires(value = {DataSource.READS, DataSource.REFERENCE}, referenceMetaData = @RMD(name = "genotypes", type = VariantContext.class))
|
||||
@By(DataSource.READS)
|
||||
public class ContEst extends RodWalker<Map<String, Map<String, ContaminationStats>>, ContaminationResults> {
|
||||
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
// Some constants we use
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
/** what type of run stats would we like: */
|
||||
public enum ContaminationRunType {
|
||||
SAMPLE, // calculate contamination for each sample
|
||||
READGROUP, // for each read group
|
||||
META // for all inputs as a single source
|
||||
}
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
// inputs
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
// the genotypes ROD; this contains information about the genotypes from our sample
|
||||
@Input(fullName="genotypes", shortName = "genotypes", doc="the genotype information for our sample", required=false)
|
||||
public RodBinding<VariantContext> genotypes;
|
||||
|
||||
// the population information; the allele frequencies for each position in known populations
|
||||
@Input(fullName="popfile", shortName = "pf", doc="the variant file containing information about the population allele frequencies", required=true)
|
||||
public RodBinding<VariantContext> pop;
|
||||
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
// outputs and args
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
@Output
|
||||
PrintStream out; // the general output of the tool
|
||||
|
||||
@Argument(fullName = "min_qscore", required = false, doc = "threshold for minimum base quality score")
|
||||
public int MIN_QSCORE = 20;
|
||||
|
||||
@Argument(fullName = "min_mapq", required = false, doc = "threshold for minimum mapping quality score")
|
||||
public int MIN_MAPQ = 20;
|
||||
|
||||
@Argument(fullName = "trim_fraction", doc = "at most, what fraction of sites should be trimmed based on BETA_THRESHOLD", required = false)
|
||||
public double TRIM_FRACTION = 0.01;
|
||||
|
||||
@Argument(fullName = "beta_threshold", doc = "threshold for p(f>=0.5) to trim", required = false)
|
||||
public double BETA_THRESHOLD = 0.95;
|
||||
|
||||
@Argument(shortName = "llc", fullName = "lane_level_contamination", doc = "set to META (default), SAMPLE or READGROUP to produce per-bam, per-sample or per-lane estimates", required = false)
|
||||
private Set<ContaminationRunType> laneStats = null;
|
||||
|
||||
@Argument(shortName = "sn", fullName = "sample_name", doc = "The sample name; used to extract the correct genotypes from mutli-sample truth vcfs", required = false)
|
||||
private String sampleName = "unknown";
|
||||
|
||||
@Argument(shortName = "pc", fullName = "precision", doc = "the degree of precision to which the contamination tool should estimate (e.g. the bin size)", required = false)
|
||||
private double precision = 0.1;
|
||||
|
||||
@Argument(shortName = "br", fullName = "base_report", doc = "Where to write a full report about the loci we processed", required = false)
|
||||
public PrintStream baseReport = null;
|
||||
|
||||
@Argument(shortName = "lf", fullName = "likelihood_file", doc = "write the likelihood values to the specified location", required = false)
|
||||
public PrintStream likelihoodFile = null;
|
||||
|
||||
@Argument(shortName = "vs", fullName = "verify_sample", doc = "should we verify that the sample name is in the genotypes file?", required = false)
|
||||
public boolean verifySample = false;
|
||||
|
||||
@Argument(shortName = "mbc", fullName = "minimum_base_count", doc = "what minimum number of bases do we need to see to call contamination in a lane / sample?", required = false)
|
||||
public Integer minBaseCount = 500;
|
||||
|
||||
@Argument(shortName = "population", fullName = "population", doc = "evaluate contamination for just a single contamination population", required = false)
|
||||
public String population = "CEU";
|
||||
|
||||
@Argument(shortName = "gm", fullName = "genotype_mode", doc = "which approach should we take to getting the genotypes (only in array-free mode)", required = false)
|
||||
public SeqGenotypeMode genotypeMode = SeqGenotypeMode.HARD_THRESHOLD;
|
||||
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
// hidden arguments
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
@Hidden
|
||||
@Argument(fullName = "trim_interval", doc = "progressively trim from 0 to TRIM_FRACTION by this interval", required = false)
|
||||
public double TRIM_INTERVAL = 0;
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "min_site_depth", required = false, doc = "minimum depth at a site to consider in calculation")
|
||||
public int MIN_SITE_DEPTH = 0;
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "fixed_epsilon_qscore", required = false, doc = "use a constant epsilon (phred scale) for calculation")
|
||||
public Byte FIXED_EPSILON = null;
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "min_genotype_depth", required = false, doc = "what minimum depth is required to call a site in seq genotype mode")
|
||||
public int MIN_GENOTYPE_DEPTH_FOR_SEQ = 50;
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "min_genotype_ratio", required = false, doc = "the ratio of alt to other bases to call a site a hom non-ref variant")
|
||||
public double MIN_GENOTYPE_RATIO = 0.80;
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "min_genotype_llh", required = false, doc = "the min log likelihood for UG to call a genotype")
|
||||
public double MIN_UG_LOG_LIKELIHOOD = 5;
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
// global variables to the walker
|
||||
// ------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
private static final Map<Integer,Allele> alleles = new HashMap<Integer,Allele>(); // the set of alleles we work with
|
||||
private boolean verifiedSampleName = false; // have we yet verified the sample name?
|
||||
private final Map<String, ContaminationRunType> contaminationNames = new LinkedHashMap<String, ContaminationRunType>(); // a list, containing the contamination names, be it read groups or bam file names
|
||||
private static String[] ALL_POPULATIONS = new String[]{"ALL", "CHD", "LWK", "CHB", "CEU", "MXL", "GIH", "MKK", "TSI", "CLM", "GBR", "ASW", "YRI", "IBS", "FIN", "PUR", "JPT", "CHS"};
|
||||
private String[] populationsToEvaluate;
|
||||
|
||||
// variables involved in the array-free mode
|
||||
private boolean useSequencingGenotypes = false; // if false we're using the sequencing geneotypes; otherwise we require array genotypes
|
||||
public static final String EVAL_BAM_TAG = "eval";
|
||||
public static final String GENOTYPE_BAM_TAG = "genotype";
|
||||
String evalSample = null;
|
||||
String genotypeSample = null;
|
||||
|
||||
|
||||
// counts for each of the possible combinations
|
||||
int totalSites = 0;
|
||||
int countPopulationSites = 0;
|
||||
int countGenotypeNonHomVar = 0;
|
||||
int countGenotypeHomVar = 0;
|
||||
int countPassCoverage = 0;
|
||||
int countResults = 0;
|
||||
|
||||
public enum SeqGenotypeMode { HARD_THRESHOLD, UNIFIED_GENOTYPER }
|
||||
// create our list of allele characters for conversion
|
||||
static {
|
||||
alleles.put(0,Allele.create((byte) 'A'));
|
||||
alleles.put(1,Allele.create((byte) 'C'));
|
||||
alleles.put(2,Allele.create((byte) 'G'));
|
||||
alleles.put(3,Allele.create((byte) 'T'));
|
||||
}
|
||||
|
||||
// a bunch of setup to initialize the walker
|
||||
public void initialize() {
|
||||
// set the genotypes source - figure out what to do if we're not using arrays
|
||||
if (genotypes == null || !genotypes.isBound()) {
|
||||
logger.info("Running in sequencing mode");
|
||||
useSequencingGenotypes = true;
|
||||
// if were not using arrays, we need to figure out what samples are what
|
||||
for(SAMReaderID id : getToolkit().getReadsDataSource().getReaderIDs()) {
|
||||
if (id.getTags().getPositionalTags().size() == 0)
|
||||
throw new UserException.BadInput("BAMs must be tagged with " + GENOTYPE_BAM_TAG + " and " + EVAL_BAM_TAG + " when running in array-free mode. Please see the ContEst documentation for more details");
|
||||
|
||||
// now sort out what tags go with what bam
|
||||
for (String tag : id.getTags().getPositionalTags()) {
|
||||
if (GENOTYPE_BAM_TAG.equalsIgnoreCase(tag)) {
|
||||
try {
|
||||
if (getToolkit().getReadsDataSource().getHeader(id).getReadGroups().size() == 0)
|
||||
throw new RuntimeException("No Read Groups found for Genotyping BAM -- Read Groups are Required in sequencing genotype mode!");
|
||||
genotypeSample = getToolkit().getReadsDataSource().getHeader(id).getReadGroups().get(0).getSample();
|
||||
} catch (NullPointerException npe) {
|
||||
throw new UserException.BadInput("Unable to fetch read group from the bam files tagged with " + GENOTYPE_BAM_TAG);
|
||||
}
|
||||
} else if (EVAL_BAM_TAG.equalsIgnoreCase(tag)) {
|
||||
try {
|
||||
if (getToolkit().getReadsDataSource().getHeader(id).getReadGroups().size() == 0)
|
||||
throw new RuntimeException("No Read Groups found for Genotyping BAM -- Read Groups are Required in sequencing genotype mode!");
|
||||
evalSample = getToolkit().getReadsDataSource().getHeader(id).getReadGroups().get(0).getSample();
|
||||
} catch (NullPointerException npe) {
|
||||
throw new UserException.BadInput("Unable to fetch read group from the bam files tagged with " + EVAL_BAM_TAG);
|
||||
}
|
||||
} else {
|
||||
throw new UserException.BadInput("Unable to process " + tag + " tag, it's not either of the two accepted values: " + GENOTYPE_BAM_TAG + " or " + EVAL_BAM_TAG);
|
||||
}
|
||||
}
|
||||
}
|
||||
if (evalSample == null || genotypeSample == null)
|
||||
throw new UserException.BadInput("You must provide both a " + GENOTYPE_BAM_TAG + " tagged bam and a " + EVAL_BAM_TAG + " tagged bam file. Please see the ContEst documentation");
|
||||
|
||||
} else {
|
||||
logger.info("Running in array mode");
|
||||
}
|
||||
if (laneStats == null) {
|
||||
laneStats = new HashSet<ContaminationRunType>();
|
||||
laneStats.add(ContaminationRunType.META);
|
||||
}
|
||||
|
||||
for (ContaminationRunType type : laneStats) {
|
||||
if (type == ContaminationRunType.READGROUP) {
|
||||
for (SAMReadGroupRecord name : getToolkit().getSAMFileHeader().getReadGroups())
|
||||
this.contaminationNames.put(name.getId(),ContaminationRunType.READGROUP);
|
||||
} else if (type == ContaminationRunType.SAMPLE) {
|
||||
for (SAMReadGroupRecord name : getToolkit().getSAMFileHeader().getReadGroups())
|
||||
this.contaminationNames.put(name.getSample(),ContaminationRunType.SAMPLE);
|
||||
} else if (type == ContaminationRunType.META)
|
||||
this.contaminationNames.put("META",ContaminationRunType.META);
|
||||
else
|
||||
throw new IllegalArgumentException("Unknown type name " + laneStats);
|
||||
}
|
||||
if (baseReport != null)
|
||||
baseReport.println("lane\tchrom\tposition\trs_id\tref\tfreq_major_allele\tfreq_minor_allele\tgeli_gt\tmaf\tmajor_allele_counts\tminor_allele_counts\ta_counts\tc_counts\tg_counts\tt_counts");
|
||||
|
||||
this.populationsToEvaluate = (population == null || "EVERY".equals(population)) ? ALL_POPULATIONS : new String[]{population};
|
||||
|
||||
}
|
||||
/**
|
||||
* our map function, which emits a contamination stats for each of the subgroups (lanes, samples, etc) that we encounter
|
||||
*
|
||||
* @param tracker the reference meta data tracker, from which we get the array truth data
|
||||
* @param ref the reference information at this position
|
||||
* @param context the read context, where we get the alignment data
|
||||
* @return a mapping of our subgroup name to contamination estimate
|
||||
*/
|
||||
@Override
|
||||
public Map<String, Map<String, ContaminationStats>> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
|
||||
totalSites++;
|
||||
if (tracker == null) return null;
|
||||
if (context == null) return null;
|
||||
|
||||
VariantContext popVC = tracker.getFirstValue(pop);
|
||||
byte referenceBase = ref.getBase();
|
||||
if (popVC == null) return null;
|
||||
countPopulationSites++;
|
||||
Genotype genotype = getGenotype(tracker,context,ref,useSequencingGenotypes);
|
||||
|
||||
// only use homozygous sites
|
||||
if (genotype == null || !genotype.isHomVar()) {
|
||||
countGenotypeNonHomVar++;
|
||||
return null;
|
||||
} else {
|
||||
countGenotypeHomVar++;
|
||||
}
|
||||
|
||||
|
||||
// only use non-reference sites
|
||||
byte myBase = genotype.getAllele(0).getBases()[0];
|
||||
|
||||
String rsNumber = "";
|
||||
|
||||
// our map of contamination results
|
||||
Map<String, Map<String, ContaminationStats>> stats = new HashMap<String, Map<String, ContaminationStats>>();
|
||||
|
||||
// get the base pileup. This is only really required when we have both a genotyping and EVAL_BAM_TAG tagged bams
|
||||
// becuase we only want contamination estimates drawn from the eval tagged bam
|
||||
ReadBackedPileup defaultPile;
|
||||
if (this.useSequencingGenotypes)
|
||||
defaultPile = context.getBasePileup().getPileupForSample(evalSample);
|
||||
else
|
||||
defaultPile = context.getBasePileup();
|
||||
|
||||
// if we're by-lane, get those stats
|
||||
for (Map.Entry<String, ContaminationRunType> namePair : contaminationNames.entrySet()) {
|
||||
ReadBackedPileup pile;
|
||||
if (namePair.getValue() == ContaminationRunType.READGROUP)
|
||||
pile = defaultPile.getPileupForReadGroup(namePair.getKey());
|
||||
else if (namePair.getValue() == ContaminationRunType.META)
|
||||
pile = defaultPile;
|
||||
else if (namePair.getValue() == ContaminationRunType.SAMPLE)
|
||||
pile = defaultPile.getPileupForSample(namePair.getKey());
|
||||
else
|
||||
throw new IllegalStateException("Unknown state, contamination type = " + laneStats + " is unsupported");
|
||||
if (pile != null) {
|
||||
|
||||
ReadBackedPileup filteredPile =
|
||||
pile.getBaseAndMappingFilteredPileup(MIN_QSCORE, MIN_MAPQ);
|
||||
|
||||
byte[] bases = filteredPile.getBases();
|
||||
|
||||
// restrict to sites that have greater than our required total depth
|
||||
if (bases.length < MIN_SITE_DEPTH) {
|
||||
continue;
|
||||
} else {
|
||||
countPassCoverage++;
|
||||
}
|
||||
|
||||
byte[] quals;
|
||||
if (FIXED_EPSILON == null) {
|
||||
quals = filteredPile.getQuals();
|
||||
} else {
|
||||
quals = new byte[bases.length];
|
||||
Arrays.fill(quals, FIXED_EPSILON);
|
||||
}
|
||||
|
||||
Map<String, ContaminationStats> results =
|
||||
calcStats(referenceBase,
|
||||
bases,
|
||||
quals,
|
||||
myBase,
|
||||
rsNumber,
|
||||
popVC,
|
||||
baseReport,
|
||||
context.getLocation(),
|
||||
precision,
|
||||
namePair.getKey(),
|
||||
populationsToEvaluate);
|
||||
|
||||
if (results.size() > 0) {
|
||||
countResults++;
|
||||
stats.put(namePair.getKey(), results);
|
||||
}
|
||||
}
|
||||
}
|
||||
// return our collected stats
|
||||
return stats;
|
||||
}
|
||||
|
||||
/**
|
||||
* get the genotype for the sample at the current position
|
||||
* @param tracker the reference meta data (RODs)
|
||||
* @param context the reads
|
||||
* @param referenceContext the reference information
|
||||
* @param useSeq are we using sequencing to get our genotypes
|
||||
* @return a genotype call, which could be null
|
||||
*/
|
||||
private Genotype getGenotype(RefMetaDataTracker tracker, AlignmentContext context, ReferenceContext referenceContext, boolean useSeq) {
|
||||
if (!useSeq) {
|
||||
Genotype g = getGenotypeFromArray(tracker, this.genotypes,this.verifiedSampleName,this.verifySample,this.sampleName);
|
||||
if (g != null) this.verifiedSampleName = true;
|
||||
return g;
|
||||
} else {
|
||||
return getGenotypeFromSeq(
|
||||
context,
|
||||
referenceContext,
|
||||
this.alleles,
|
||||
this.genotypeMode,
|
||||
this.MIN_GENOTYPE_RATIO,
|
||||
this.MIN_GENOTYPE_DEPTH_FOR_SEQ,
|
||||
this.MIN_UG_LOG_LIKELIHOOD,
|
||||
this.genotypeSample,
|
||||
this.sampleName,
|
||||
this.getToolkit());
|
||||
}
|
||||
}
|
||||
|
||||
static Genotype getGenotypeFromSeq(AlignmentContext context,
|
||||
ReferenceContext referenceContext,
|
||||
Map<Integer, Allele> alleles,
|
||||
SeqGenotypeMode genotypeMode,
|
||||
double minGenotypeRatio,
|
||||
int minGenotypingDepth,
|
||||
double minGenotypingLOD,
|
||||
String genotypingSample,
|
||||
String sampleName,
|
||||
GenomeAnalysisEngine toolKit) {
|
||||
ReadBackedPileup pileup = context.getBasePileup().getPileupForSample(genotypingSample);
|
||||
if (pileup == null || pileup.isEmpty()) return null;
|
||||
|
||||
// which genotyping mode are we using
|
||||
if (genotypeMode == SeqGenotypeMode.HARD_THRESHOLD) {
|
||||
if (sum(pileup.getBaseCounts()) < minGenotypingDepth) return null;
|
||||
int[] bases = pileup.getBaseCounts();
|
||||
int mx = maxPos(bases);
|
||||
int allGenotypes = sum(bases);
|
||||
String refBase = String.valueOf((char)referenceContext.getBase());
|
||||
if (bases[mx] / (float)allGenotypes >= minGenotypeRatio && !refBase.equals(alleles.get(mx).getBaseString())) {
|
||||
List<Allele> al = new ArrayList<Allele>();
|
||||
al.add(alleles.get(mx));
|
||||
GenotypeBuilder builder = new GenotypeBuilder(sampleName, al);
|
||||
return builder.make();
|
||||
}
|
||||
} else if (genotypeMode == SeqGenotypeMode.UNIFIED_GENOTYPER) {
|
||||
UnifiedArgumentCollection basicUAC = new UnifiedArgumentCollection();
|
||||
UnifiedGenotypingEngine engine = new UnifiedGenotypingEngine(basicUAC, FixedAFCalculatorProvider.createThreadSafeProvider(toolKit, basicUAC, logger),toolKit);
|
||||
AlignmentContext contextSubset = new AlignmentContext(context.getLocation(),pileup,0,true);
|
||||
List<VariantCallContext> callContexts = engine.calculateLikelihoodsAndGenotypes(null, referenceContext, contextSubset);
|
||||
if (callContexts != null && callContexts.size() == 1)
|
||||
for (Genotype g : callContexts.get(0).getGenotypes()){
|
||||
if (g.isCalled() && g.isHomVar() && g.getLog10PError() > minGenotypingLOD)
|
||||
return g;
|
||||
}
|
||||
}
|
||||
else {
|
||||
throw new GATKException("Unknown genotyping mode, being an enum this really shouldn't be seen ever.");
|
||||
}
|
||||
return null;
|
||||
}
|
||||
|
||||
// utils
|
||||
private static int sum(int[] a) {int sm = 0; for (int i : a) {sm = sm + i;} return sm;}
|
||||
private static int maxPos(int[] a) {int mx = 0; for (int i = 0;i < a.length; i++) {if (a[i] > a[mx]) mx = i;} return mx;}
|
||||
|
||||
private static Genotype getGenotypeFromArray(RefMetaDataTracker tracker, RodBinding<VariantContext> genotypes, boolean verifiedSampleName, boolean verifySample, String sampleName) {
|
||||
// get the truthForSample and the hapmap information for this site; if either are null we can't move forward
|
||||
Collection<VariantContext> truths = tracker.getValues(genotypes);
|
||||
if (truths == null || truths.size() == 0) return null;
|
||||
|
||||
VariantContext truthForSample = truths.iterator().next();
|
||||
|
||||
// verify that the sample name exists in the input genotype file
|
||||
if (!verifiedSampleName && verifySample) {
|
||||
if (!truthForSample.getSampleNames().contains(sampleName))
|
||||
throw new UserException.BadInput("The sample name was set to " + sampleName + " but this sample isn't in your genotypes file. Please Verify your sample name");
|
||||
verifiedSampleName = true;
|
||||
}
|
||||
|
||||
GenotypesContext gt = truthForSample.getGenotypes();
|
||||
|
||||
// if we are supposed to verify the sample name, AND the sample doesn't exist in the genotypes -- skip this site
|
||||
if (verifySample && !gt.containsSample(sampleName)) return null;
|
||||
|
||||
// if the sample doesn't exist in genotypes AND there is more than one sample in the genotypes file -- skip this site
|
||||
if (!gt.containsSample(sampleName) && gt.size() != 1) return null;
|
||||
|
||||
// if there is more than one sample in the genotypes file, get it by name. Otherwise just get the sole sample genotype
|
||||
return gt.size() != 1 ? gt.get(sampleName) : gt.get(0);
|
||||
}
|
||||
|
||||
|
||||
private static class PopulationFrequencyInfo {
|
||||
private byte majorAllele;
|
||||
private byte minorAllele;
|
||||
private double minorAlleleFrequency;
|
||||
|
||||
private PopulationFrequencyInfo(byte majorAllele, byte minorAllele, double minorAlleleFrequency) {
|
||||
this.majorAllele = majorAllele;
|
||||
this.minorAllele = minorAllele;
|
||||
this.minorAlleleFrequency = minorAlleleFrequency;
|
||||
}
|
||||
|
||||
public byte getMajorAllele() {
|
||||
return majorAllele;
|
||||
}
|
||||
|
||||
public byte getMinorAllele() {
|
||||
return minorAllele;
|
||||
}
|
||||
|
||||
public double getMinorAlleleFrequency() {
|
||||
return minorAlleleFrequency;
|
||||
}
|
||||
}
|
||||
|
||||
private static PopulationFrequencyInfo parsePopulationFrequencyInfo(VariantContext variantContext, String population) {
|
||||
PopulationFrequencyInfo info = null;
|
||||
|
||||
List<String> values = (List<String>) variantContext.getAttribute(population);
|
||||
|
||||
if (values != null) {
|
||||
byte majorAllele = 0;
|
||||
byte minorAllele = 0;
|
||||
double maf = -1;
|
||||
|
||||
for (String str : values) {
|
||||
// strip off the curly braces and trim whitespace
|
||||
if (str.startsWith("{")) str = str.substring(1, str.length());
|
||||
if (str.contains("}")) str = str.substring(0, str.indexOf("}"));
|
||||
str = str.trim();
|
||||
String spl[] = str.split("=");
|
||||
|
||||
byte allele = (byte) spl[0].trim().charAt(0);
|
||||
double af = Double.valueOf(spl[1].trim());
|
||||
|
||||
if (af <= 0.5 && minorAllele == 0) {
|
||||
minorAllele = allele;
|
||||
maf = af;
|
||||
} else {
|
||||
majorAllele = allele;
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
info = new PopulationFrequencyInfo(majorAllele, minorAllele, maf);
|
||||
}
|
||||
return info;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Calculate the contamination values per division, be it lane, meta, sample, etc
|
||||
* @param referenceBase the reference base
|
||||
* @param bases the bases seen
|
||||
* @param quals and the bases qual values
|
||||
* @param myAllele the allele we have (our hom var genotype allele)
|
||||
* @param rsNumber the dbsnp number if available
|
||||
* @param popVC the population variant context from hapmap
|
||||
* @param baseReport if we're writing a base report, write it here
|
||||
* @param loc our location
|
||||
* @param precision the percision we're aiming for
|
||||
* @param lane the lane name information
|
||||
* @param pops our pops to run over
|
||||
* @return a mapping of each target population to their estimated contamination
|
||||
*/
|
||||
private static Map<String, ContaminationStats> calcStats(byte referenceBase,
|
||||
byte[] bases,
|
||||
byte[] quals,
|
||||
byte myAllele,
|
||||
String rsNumber,
|
||||
VariantContext popVC,
|
||||
PrintStream baseReport,
|
||||
GenomeLoc loc,
|
||||
Double precision,
|
||||
String lane,
|
||||
String[] pops) {
|
||||
int[] alts = new int[4];
|
||||
int total = 0;
|
||||
// get the depth ratio we are aiming for
|
||||
for (byte base : bases) {
|
||||
if (base == 'A' || base == 'a') alts[0]++;
|
||||
if (base == 'C' || base == 'c') alts[1]++;
|
||||
if (base == 'G' || base == 'g') alts[2]++;
|
||||
if (base == 'T' || base == 't') alts[3]++;
|
||||
total++;
|
||||
}
|
||||
|
||||
Map<String, ContaminationStats> ret = new HashMap<String, ContaminationStats>();
|
||||
|
||||
for (String pop : pops) {
|
||||
PopulationFrequencyInfo info = parsePopulationFrequencyInfo(popVC, pop);
|
||||
double alleleFreq = info.getMinorAlleleFrequency();
|
||||
if (alleleFreq > 0.5) {
|
||||
throw new RuntimeException("Minor allele frequency is greater than 0.5, this is an error; we saw AF of " + alleleFreq);
|
||||
}
|
||||
|
||||
int majorCounts = alts[getBaseIndex(info.getMajorAllele())];
|
||||
int minorCounts = alts[getBaseIndex(info.getMinorAllele())];
|
||||
int otherCounts = total - majorCounts - minorCounts;
|
||||
|
||||
|
||||
// only use sites where this is the minor allele
|
||||
if (myAllele == info.minorAllele) {
|
||||
|
||||
if (pops.length == 1) {
|
||||
if (baseReport != null) {
|
||||
baseReport.print(
|
||||
StringUtil.join("\t",
|
||||
lane,
|
||||
loc.getContig(),
|
||||
"" + loc.getStart(),
|
||||
rsNumber,
|
||||
"" + (char) referenceBase,
|
||||
"" + (char) info.getMajorAllele(),
|
||||
"" + (char) info.getMinorAllele(),
|
||||
"" + (char) info.getMinorAllele() + "" + (char) info.getMinorAllele(),
|
||||
String.format("%1.4f", alleleFreq),
|
||||
"" + majorCounts,
|
||||
"" + minorCounts));
|
||||
|
||||
for (long cnt : alts)
|
||||
baseReport.print("\t" + cnt);
|
||||
baseReport.println();
|
||||
}
|
||||
}
|
||||
|
||||
ContaminationEstimate est = new ContaminationEstimate(precision, alleleFreq, bases, quals, info.getMinorAllele(), info.getMajorAllele(), pop, loc);
|
||||
ret.put(pop, new ContaminationStats(loc, 1, alleleFreq, minorCounts, majorCounts, otherCounts, alts, est));
|
||||
|
||||
}
|
||||
|
||||
}
|
||||
return ret;
|
||||
}
|
||||
|
||||
private static int getBaseIndex(byte base) {
|
||||
if (base == 'A' || base == 'a') return 0;
|
||||
if (base == 'C' || base == 'c') return 1;
|
||||
if (base == 'G' || base == 'g') return 2;
|
||||
if (base == 'T' || base == 't') return 3;
|
||||
return -1;
|
||||
}
|
||||
|
||||
// create a ContaminationResults to store the run information
|
||||
@Override
|
||||
public ContaminationResults reduceInit() {
|
||||
return new ContaminationResults(precision);
|
||||
}
|
||||
|
||||
|
||||
@Override
|
||||
public ContaminationResults reduce(Map<String, Map<String, ContaminationStats>> value, ContaminationResults sum) {
|
||||
if (value != null)
|
||||
sum.add(value);
|
||||
return sum;
|
||||
}
|
||||
|
||||
/**
|
||||
* on traversal done, output all the stats to the appropriate files
|
||||
*
|
||||
* @param result the results of our contamination estimate
|
||||
*/
|
||||
public void onTraversalDone(ContaminationResults result) {
|
||||
|
||||
// filter out lanes / samples that don't have the minBaseCount
|
||||
Map<String, Map<String, ContaminationStats>> cleanedMap = new HashMap<String, Map<String, ContaminationStats>>();
|
||||
for (Map.Entry<String, Map<String, ContaminationStats>> entry : result.getStats().entrySet()) {
|
||||
|
||||
Map<String, ContaminationStats> newMap = new HashMap<String, ContaminationStats>();
|
||||
|
||||
Map<String, ContaminationStats> statMap = entry.getValue();
|
||||
for (String popKey : statMap.keySet()) {
|
||||
ContaminationStats stat = statMap.get(popKey);
|
||||
if (stat.getBasesMatching() + stat.getBasesMismatching() >= minBaseCount) newMap.put(popKey, stat);
|
||||
}
|
||||
|
||||
|
||||
if (newMap.size() > 0)
|
||||
cleanedMap.put(entry.getKey(), newMap);
|
||||
else
|
||||
out.println("Warning: We're throwing out lane " + entry.getKey() + " since it has fewer than " + minBaseCount +
|
||||
" read bases at genotyped positions");
|
||||
}
|
||||
|
||||
// output results at the end, based on the input parameters
|
||||
result.setStats(cleanedMap);
|
||||
result.outputReport(precision, out, TRIM_FRACTION, TRIM_INTERVAL, BETA_THRESHOLD);
|
||||
if (likelihoodFile != null) result.writeCurves(likelihoodFile);
|
||||
logger.info("Total sites: " + totalSites);
|
||||
logger.info("Population informed sites: " + countPopulationSites);
|
||||
logger.info("Non homozygous variant sites: " + countGenotypeNonHomVar);
|
||||
logger.info("Homozygous variant sites: " + countGenotypeHomVar);
|
||||
logger.info("Passed coverage: " + countPassCoverage);
|
||||
logger.info("Results: " + countResults);
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,234 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer.contamination;
|
||||
|
||||
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.broadinstitute.gatk.utils.collections.Pair;
|
||||
|
||||
import java.util.Arrays;
|
||||
|
||||
/**
|
||||
* a class that estimates and stores the contamination values for a site.
|
||||
*/
|
||||
class ContaminationEstimate {
|
||||
private final double precision; // to what precision do we want to run; e.g. if set to 1, we run using 1% increments
|
||||
private final double[] bins; // the bins representing the discrete contamination levels we're evaluating
|
||||
private double populationFit = 0.0;
|
||||
private String popultationName = "";
|
||||
|
||||
private static double[] precalculatedEpsilon;
|
||||
|
||||
private int arrayAlleleObservations = 0;
|
||||
private int alternateAlleleObservations = 0;
|
||||
|
||||
// precalculate the 128 values of epsilon that are possible
|
||||
static {
|
||||
precalculatedEpsilon = new double[Byte.MAX_VALUE+1];
|
||||
|
||||
for(int i=0; i <= (int)Byte.MAX_VALUE; i++) {
|
||||
precalculatedEpsilon[i] = Math.pow(10.0,-1.0*(((double)i)/10.0));
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* create the contamination estimate, given:
|
||||
* @param precision the precision value, to what level are we calculating the contamination
|
||||
*/
|
||||
public ContaminationEstimate(double precision,
|
||||
double maf,
|
||||
byte[] bases,
|
||||
byte[] quals,
|
||||
byte arrayAllele,
|
||||
byte hapmapAlt,
|
||||
String popName,
|
||||
GenomeLoc locus
|
||||
) {
|
||||
// setup the bins to the correct precision
|
||||
this.precision = precision;
|
||||
bins = new double[(int)Math.ceil(100/precision)+1];
|
||||
if (maf == 0) maf = 0.00001;
|
||||
|
||||
popultationName = popName;
|
||||
|
||||
Arrays.fill(bins,0.0); // just to make sure we don't have any residual values
|
||||
|
||||
// convert the quals
|
||||
double[] realQuals = new double[quals.length];
|
||||
int qIndex = 0;
|
||||
for (byte qual : quals) {realQuals[qIndex++] = Math.pow(10.0,-1.0*(qual/10.0));}
|
||||
|
||||
// check our inputs
|
||||
if (maf > 1.0 || maf < 0.0) throw new IllegalArgumentException("Invalid allele Freq: must be between 0 and 1 (inclusive), maf was " + maf + " for population " + popName);
|
||||
|
||||
// calculate the contamination for each bin
|
||||
int qualOffset = 0;
|
||||
for (byte base : bases) {
|
||||
|
||||
if (base == arrayAllele) { arrayAlleleObservations++; }
|
||||
if (base == hapmapAlt) { alternateAlleleObservations++; }
|
||||
double epsilon = precalculatedEpsilon[quals[qualOffset++]];
|
||||
|
||||
for (int index = 0; index < bins.length; index++) {
|
||||
|
||||
|
||||
double contaminationRate = (1.0 - (double) index / (double) bins.length);
|
||||
|
||||
if (base == arrayAllele) {
|
||||
bins[index] += Math.log((1.0 - contaminationRate) * (1.0 - epsilon) +
|
||||
contaminationRate * ((maf) * (1.0 - epsilon) + (1.0 - maf) * (epsilon/3.0)));
|
||||
populationFit += Math.log(epsilon);
|
||||
|
||||
} else if(hapmapAlt == base) {
|
||||
bins[index] += Math.log((1.0 - contaminationRate) * (epsilon / 3.0) +
|
||||
contaminationRate * ((maf) * (epsilon/3.0) + (1.0 - maf) * (1.0 - epsilon)));
|
||||
|
||||
populationFit += Math.log(maf + epsilon);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public double[] getBins() {
|
||||
return bins;
|
||||
}
|
||||
|
||||
public void setPopulationFit(double populationFit) {
|
||||
this.populationFit = populationFit;
|
||||
}
|
||||
|
||||
public double getPopulationFit() {
|
||||
return populationFit;
|
||||
}
|
||||
|
||||
public String getPopultationName() {
|
||||
return popultationName;
|
||||
}
|
||||
|
||||
public static class ConfidenceInterval {
|
||||
|
||||
private double start;
|
||||
private double stop;
|
||||
private double contamination;
|
||||
private double maxLikelihood;
|
||||
double[] newBins;
|
||||
|
||||
public ConfidenceInterval(double bins[], double intervalArea) {
|
||||
// make a copy of the bins in non-log space
|
||||
int maxIndex = 0;
|
||||
for (int x = 0; x < bins.length; x++) if (bins[x] > bins[maxIndex]) maxIndex = x;
|
||||
newBins = new double[bins.length];
|
||||
maxLikelihood = bins[maxIndex];
|
||||
|
||||
int index = 0;
|
||||
double total = 0.0;
|
||||
for (double d : bins) {
|
||||
newBins[index] = Math.pow(10,(bins[index] - bins[maxIndex]));
|
||||
total += newBins[index];
|
||||
index++;
|
||||
}
|
||||
|
||||
for (int x = 0; x < newBins.length; x++) {
|
||||
newBins[x] = newBins[x] / total;
|
||||
}
|
||||
double areaUnderCurve = 0;
|
||||
int leftIndex = maxIndex;
|
||||
int rightIndex = maxIndex;
|
||||
while (areaUnderCurve < 0.95) {
|
||||
|
||||
// if the "left" bin is bigger, and can be moved, move it
|
||||
if (newBins[leftIndex] >= newBins[rightIndex] && leftIndex > 0) {
|
||||
leftIndex--;
|
||||
} else {
|
||||
// otherwise move the right bin if possible
|
||||
if (rightIndex < bins.length - 1) {
|
||||
rightIndex++;
|
||||
} else {
|
||||
// and if not move the left bin, or die
|
||||
if (leftIndex > 0) {
|
||||
leftIndex--;
|
||||
} else {
|
||||
throw new RuntimeException("Error trying to compute confidence interval");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
areaUnderCurve = 0.0;
|
||||
for (int x = leftIndex; x <= rightIndex; x++)
|
||||
areaUnderCurve += newBins[x];
|
||||
}
|
||||
start = (bins.length - rightIndex) * (100.0/bins.length);
|
||||
stop = (bins.length - leftIndex) * (100.0/bins.length);
|
||||
contamination = (bins.length - maxIndex) * (100.0/bins.length);
|
||||
}
|
||||
|
||||
public double getStart() {
|
||||
return start;
|
||||
}
|
||||
|
||||
public double getStop() {
|
||||
return stop;
|
||||
}
|
||||
|
||||
public double getContamination() {
|
||||
return contamination;
|
||||
}
|
||||
|
||||
public double getMaxLikelihood() {
|
||||
return maxLikelihood;
|
||||
}
|
||||
|
||||
public String toString() {
|
||||
return contamination + "[" + start + " - " + stop + "] log likelihood = " + maxLikelihood;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,304 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer.contamination;
|
||||
|
||||
|
||||
import org.apache.commons.math.MathException;
|
||||
import org.apache.commons.math.distribution.BetaDistribution;
|
||||
import org.apache.commons.math.distribution.BetaDistributionImpl;
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.broadinstitute.gatk.utils.Utils;
|
||||
|
||||
import java.io.*;
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* our contamination results object; this object aggregates the results of the contamination run over lanes, samples,
|
||||
* or whatever other divisor we've used on the read data
|
||||
*/
|
||||
public class ContaminationResults {
|
||||
|
||||
public static class ContaminationData implements Comparable<ContaminationData> {
|
||||
private GenomeLoc site;
|
||||
private long basesMatching = 0l;
|
||||
private long basesMismatching = 0l;
|
||||
private double mismatchFraction = -1d;
|
||||
private double[] bins;
|
||||
private double p;
|
||||
|
||||
public long getBasesMatching() {
|
||||
return basesMatching;
|
||||
}
|
||||
|
||||
public long getBasesMismatching() {
|
||||
return basesMismatching;
|
||||
}
|
||||
|
||||
public double getMismatchFraction() {
|
||||
return mismatchFraction;
|
||||
}
|
||||
|
||||
public double[] getBins() {
|
||||
return bins;
|
||||
}
|
||||
|
||||
public double getP() {
|
||||
return p;
|
||||
}
|
||||
|
||||
public ContaminationData(GenomeLoc site, long basesMatching, long basesMismatching, double[] bins) {
|
||||
this.site = site;
|
||||
this.basesMatching = basesMatching;
|
||||
this.basesMismatching = basesMismatching;
|
||||
this.bins = bins;
|
||||
long totalBases = this.basesMatching + this.basesMismatching;
|
||||
if (totalBases != 0) {
|
||||
this.mismatchFraction = (double)this.basesMismatching / (double) totalBases;
|
||||
}
|
||||
|
||||
int a = (int) this.getBasesMismatching() + 1;
|
||||
int b = (int) this.getBasesMatching() + 1;
|
||||
BetaDistribution dist = new BetaDistributionImpl(a,b);
|
||||
try {
|
||||
this.p = 1.0d - dist.cumulativeProbability(0.5d);
|
||||
} catch (MathException me) {
|
||||
throw new RuntimeException("Error! - " + me.toString());
|
||||
}
|
||||
|
||||
}
|
||||
|
||||
public int compareTo(ContaminationData other) {
|
||||
return -Double.compare(this.getP(), other.getP());
|
||||
}
|
||||
|
||||
@Override
|
||||
public String toString() {
|
||||
return "ContaminationData{" +
|
||||
"site=" + site +
|
||||
", basesMatching=" + basesMatching +
|
||||
", basesMismatching=" + basesMismatching +
|
||||
", mismatchFraction=" + mismatchFraction +
|
||||
'}';
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// what precision are we using in our calculations
|
||||
private final double precision;
|
||||
|
||||
// a map of our contamination targets and their stats
|
||||
// key: aggregation entity ("META", sample name, or lane name)
|
||||
// value: ContaminationStats (whcih
|
||||
private Map<String,Map<String, ContaminationStats>> stats = new HashMap<String,Map<String, ContaminationStats>>();
|
||||
|
||||
public ContaminationResults(double precision) {
|
||||
this.precision = precision;
|
||||
}
|
||||
|
||||
|
||||
Map<String, Map<String, List<ContaminationData>>> storedData = new HashMap<String, Map<String, List<ContaminationData>>>();
|
||||
|
||||
/**
|
||||
* add to the stats
|
||||
*
|
||||
* @param newAggregationStats a mapping of the stat name to their statistics collected
|
||||
*/
|
||||
public void add(Map<String, Map<String, ContaminationStats>> newAggregationStats) {
|
||||
|
||||
// for each aggregation level
|
||||
for (String aggregationKey : newAggregationStats.keySet()) {
|
||||
Map<String, ContaminationStats> populationContaminationStats = newAggregationStats.get(aggregationKey);
|
||||
|
||||
|
||||
// a new way of doing this... store all the data until the end...
|
||||
if (!storedData.containsKey(aggregationKey)) { storedData.put(aggregationKey, new HashMap<String, List<ContaminationData>>()); }
|
||||
for (String pop : populationContaminationStats.keySet()) {
|
||||
ContaminationStats newStats = populationContaminationStats.get(pop);
|
||||
|
||||
// if it exists... just merge it
|
||||
if (!storedData.get(aggregationKey).containsKey(pop)) {
|
||||
storedData.get(aggregationKey).put(pop, new ArrayList<ContaminationData>());
|
||||
}
|
||||
|
||||
double[] newData = new double[newStats.getContamination().getBins().length];
|
||||
System.arraycopy(newStats.getContamination().getBins(),0,newData,0,newStats.getContamination().getBins().length);
|
||||
storedData.get(aggregationKey).get(pop).add(new ContaminationData(newStats.getSite(), newStats.getBasesMatching(), newStats.getBasesMismatching(), newData));
|
||||
}
|
||||
|
||||
|
||||
|
||||
// merge the sets
|
||||
if (stats.containsKey(aggregationKey)) {
|
||||
|
||||
// and for each population
|
||||
for (String pop : populationContaminationStats.keySet()) {
|
||||
ContaminationStats newStats = populationContaminationStats.get(pop);
|
||||
|
||||
// if it exists... just merge it
|
||||
if (stats.get(aggregationKey).containsKey(pop)) {
|
||||
stats.get(aggregationKey).get(pop).add(newStats);
|
||||
} else {
|
||||
stats.get(aggregationKey).put(pop, newStats);
|
||||
}
|
||||
}
|
||||
} else {
|
||||
stats.put(aggregationKey, populationContaminationStats);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* output the contamination data, and return the contamination data
|
||||
* @param out the output source
|
||||
* @return the contamination value
|
||||
*/
|
||||
public void outputReport(double precision, PrintStream out, double fractionToTrim, double trimInterval, double betaThreshold) {
|
||||
out.println("name\tpopulation\tpopulation_fit\tcontamination\tconfidence_interval_95_width\tconfidence_interval_95_low\tconfidence_interval_95_high\tsites");
|
||||
|
||||
for (Map.Entry<String,Map<String, ContaminationStats>> entry : stats.entrySet()) {
|
||||
for (ContaminationStats stats : entry.getValue().values()) {
|
||||
String aggregationLevel = entry.getKey();
|
||||
String population = stats.getContamination().getPopultationName();
|
||||
|
||||
List<ContaminationData> newStats = storedData.get(aggregationLevel).get(population);
|
||||
String pm = "%3." + Math.round(Math.log10(1/precision)) +"f";
|
||||
|
||||
int bins = newStats.iterator().next().getBins().length;
|
||||
int maxTrim = (int) Math.floor((double)(newStats.size()) * fractionToTrim);
|
||||
|
||||
// sort the collection
|
||||
Collections.sort(newStats);
|
||||
|
||||
List<ContaminationData> data = new ArrayList<ContaminationData>(newStats);
|
||||
|
||||
// trim sites with > 95% p of being > 0.5 f (based on beta distribution)
|
||||
int trimmed = 0;
|
||||
for(Iterator<ContaminationData> i = data.iterator(); trimmed < maxTrim && i.hasNext();) {
|
||||
ContaminationData x = i.next();
|
||||
if (x.getP() >= betaThreshold) {
|
||||
System.out.println("Trimming " + x.toString() + " with p(f>=0.5) >= " + betaThreshold + " with a value of " + x.getP());
|
||||
i.remove();
|
||||
trimmed++;
|
||||
}
|
||||
}
|
||||
|
||||
double[][] matrix = new double[bins][data.size()];
|
||||
|
||||
for (int i = 0; i<bins; i++) {
|
||||
for (int j=0; j<data.size(); j++) {
|
||||
matrix[i][j] = data.get(j).getBins()[i];
|
||||
}
|
||||
}
|
||||
|
||||
// now perform the sum
|
||||
double[] output = new double[bins];
|
||||
for (int i = 0; i<bins; i++) {
|
||||
double[] binData = matrix[i];
|
||||
|
||||
// remove the top and bottom
|
||||
output[i] = 0;
|
||||
for (int x = 0; x < binData.length; x++) {
|
||||
output[i] += binData[x];
|
||||
}
|
||||
}
|
||||
double[] newTrimmedStats = output;
|
||||
|
||||
// get the confidence interval, at the set width
|
||||
ContaminationEstimate.ConfidenceInterval newInterval = new ContaminationEstimate.ConfidenceInterval(newTrimmedStats, 0.95);
|
||||
|
||||
out.println(
|
||||
String.format("%s\t%s\t%s\t"+pm+"\t"+pm+"\t"+pm+"\t"+pm+"\t"+"%d",
|
||||
aggregationLevel,
|
||||
population,
|
||||
"n/a",
|
||||
newInterval.getContamination(),
|
||||
(newInterval.getStop() - newInterval.getStart()),
|
||||
newInterval.getStart(),
|
||||
newInterval.getStop(),
|
||||
data.size())
|
||||
);
|
||||
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
public void writeCurves(PrintStream out) {
|
||||
boolean outputBins = false;
|
||||
for (Map.Entry<String, Map<String, ContaminationStats>> entry : stats.entrySet()) {
|
||||
for (ContaminationStats stats : entry.getValue().values()) {
|
||||
if (!outputBins) {
|
||||
String[] bins = new String[stats.getContamination().getBins().length];
|
||||
for (int index = 0; index < stats.getContamination().getBins().length; index++)
|
||||
bins[index] = String.valueOf(100.0 * (1 - (double) index / stats.getContamination().getBins().length));
|
||||
outputBins = true;
|
||||
out.print("name,pop,");
|
||||
out.println(Utils.join(",",bins));
|
||||
}
|
||||
String[] bins = new String[stats.getContamination().getBins().length];
|
||||
int index = 0;
|
||||
for (double value : stats.getContamination().getBins())
|
||||
bins[index++] = String.valueOf(value);
|
||||
out.print(entry.getKey()+",\""+stats.getContamination().getPopultationName()+"\",");
|
||||
out.println(Utils.join(",", bins));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
public Map<String, Map<String, ContaminationStats>> getStats() {
|
||||
return Collections.unmodifiableMap(stats);
|
||||
}
|
||||
|
||||
public void setStats(Map<String, Map<String,ContaminationStats>> stats) {
|
||||
this.stats = stats;
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,125 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2015 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.cancer.contamination;
|
||||
|
||||
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.omg.PortableInterceptor.SYSTEM_EXCEPTION;
|
||||
|
||||
/**
|
||||
* a class that tracks our contamination stats; both the estimate of contamination, as well as the number of sites and other
|
||||
* run-specific data
|
||||
*/
|
||||
public class ContaminationStats {
|
||||
final static int ALLELE_COUNT = 4;
|
||||
private GenomeLoc site;
|
||||
private int numberOfSites = 0;
|
||||
private double sumOfAlleleFrequency = 0.0;
|
||||
private long basesFor = 0l;
|
||||
private long basesAgainst = 0l;
|
||||
private long basesOther = 0l;
|
||||
private ContaminationEstimate contaminationEstimate;
|
||||
private final int[] alleleBreakdown;
|
||||
|
||||
public ContaminationStats(GenomeLoc site, int numberOfSites, double sumOfAlleleFrequency, long basesFor, long basesAgainst, long basesOther, int alleleBreakdown[], ContaminationEstimate estimate) {
|
||||
this.site = site;
|
||||
this.numberOfSites = numberOfSites;
|
||||
this.sumOfAlleleFrequency = sumOfAlleleFrequency;
|
||||
this.basesFor = basesFor;
|
||||
this.basesAgainst = basesAgainst;
|
||||
this.contaminationEstimate = estimate;
|
||||
if (alleleBreakdown.length != ALLELE_COUNT) throw new IllegalArgumentException("Allele breakdown should have length " + ALLELE_COUNT);
|
||||
this.alleleBreakdown = alleleBreakdown;
|
||||
}
|
||||
|
||||
public int getNumberOfSites() {
|
||||
return numberOfSites;
|
||||
}
|
||||
|
||||
public double getMinorAlleleFrequency() {
|
||||
return sumOfAlleleFrequency /(double)numberOfSites;
|
||||
}
|
||||
|
||||
public long getBasesMatching() {
|
||||
return basesFor;
|
||||
}
|
||||
|
||||
public long getBasesOther() {
|
||||
return basesOther;
|
||||
}
|
||||
|
||||
public long getBasesMismatching() {
|
||||
return basesAgainst;
|
||||
}
|
||||
|
||||
public ContaminationEstimate getContamination() {
|
||||
return this.contaminationEstimate;
|
||||
}
|
||||
|
||||
public GenomeLoc getSite() {
|
||||
return site;
|
||||
}
|
||||
|
||||
public void add(ContaminationStats other) {
|
||||
if (other == null) return;
|
||||
this.numberOfSites += other.numberOfSites;
|
||||
this.sumOfAlleleFrequency += other.sumOfAlleleFrequency;
|
||||
this.basesOther += other.basesOther;
|
||||
this.basesFor += other.basesFor;
|
||||
this.basesAgainst += other.basesAgainst;
|
||||
for (int x = 0; x < ALLELE_COUNT; x++) this.alleleBreakdown[x] += other.alleleBreakdown[x];
|
||||
for (int i = 0; i < this.contaminationEstimate.getBins().length; i++) {
|
||||
this.contaminationEstimate.getBins()[i] += other.contaminationEstimate.getBins()[i];
|
||||
}
|
||||
this.contaminationEstimate.setPopulationFit(this.contaminationEstimate.getPopulationFit() +other.contaminationEstimate.getPopulationFit());
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,77 @@
|
|||
# Dream Challenge Evaluation
|
||||
|
||||
In order to evaluate the performance of M2, we use two sets of data from the SMC DREAM Challenge. Specifically challenges #3 and #4.
|
||||
|
||||
All scripts referenced here are relative to the current working directory of ```
|
||||
/dsde/working/mutect/dream_smc```
|
||||
|
||||
### Current Performance (Unmasked)
|
||||
From the output of the evaluation method
|
||||
|
||||
(gsa-unstable 7/13/15, commit:9e93a70)
|
||||
|
||||
|set | subset | type | sensitivity | specificity | accuracy |
|
||||
|----|--------|------|-------------|-------------|----------|
|
||||
|SMC 3|chr21|SNP|0.935897435897|0.935897435897|0.935897435897|
|
||||
|SMC 3|chr21|INDEL|0.904255319149|0.977011494253|0.940633406701|
|
||||
|SMC 3|wgs|SNP|0.930532709098|0.955188985583|0.94286084734|
|
||||
|SMC 3|wgs|INDEL|0.902139907396|0.970516962843|0.93632843512|
|
||||
|SMC 4|chr21|SNP|0.769607843137|0.969135802469|0.869371822803|
|
||||
|SMC 4|chr21|INDEL|0.771241830065|0.991596638655|0.88141923436|
|
||||
|SMC 4|wgs|SNP|0.764507007622|0.975374480433|0.869940744028|
|
||||
|SMC 4|wgs|INDEL|0.768634634353|0.989389679877|0.879012157115|
|
||||
|
||||
|
||||
|
||||
### How To Run
|
||||
The SCALA script for running M2 can be found in the gsa-unstable repository under ```private/gatk-tools-private/src/main/java/org/broadinstitute/gatk/tools/walkers/cancer/m2```
|
||||
|
||||
First, chose the appropriate settings (runnable as environment variables here)
|
||||
```
|
||||
QUEUE_JAR=<your-queue-jar>
|
||||
OUT_VCF=<your-output-vcf>
|
||||
GSA_UNSTABLE_HOME=<path-to-your-gsa-unstable-checkout>
|
||||
|
||||
# for Dream 3
|
||||
NORMAL_BAM=/dsde/working/mutect/dream_smc/bams/synthetic.challenge.set3.normal.bam
|
||||
TUMOR_BAM=/dsde/working/mutect/dream_smc/bams/synthetic.challenge.set3.tumor.bam
|
||||
|
||||
# for Dream 4
|
||||
NORMAL_BAM=/dsde/working/mutect/dream_smc/bams/synthetic.challenge.set4.normal.bam
|
||||
TUMOR_BAM=/dsde/working/mutect/dream_smc/bams/synthetic.challenge.set4.tumor.bam
|
||||
|
||||
# for WGS
|
||||
INTERVALS=/dsde/working/mutect/dream_smc/bams/wgs_calling_regions.v1.interval_list
|
||||
|
||||
# for chromosome 21 only
|
||||
INTERVALS=/dsde/working/mutect/ts/c21_wgs_calling_regions.v1.interval_list
|
||||
|
||||
TEMPDIR=/broad/hptmp/kcibul/mutect
|
||||
```
|
||||
|
||||
and then run the following Queue command
|
||||
```
|
||||
java \
|
||||
-Djava.io.tmpdir=$TEMPDIR \
|
||||
-jar $QUEUE_JAR \
|
||||
-S $GSA_UNSTABLE_HOME/private/gatk-tools-private/src/main/java/org/broadinstitute/gatk/tools/walkers/cancer/m2/run_M2_dream.scala \
|
||||
--job_queue gsa -qsub -jobResReq virtual_free=5G -startFromScratch \
|
||||
-sc 200 \
|
||||
-normal $NORMAL_BAM \
|
||||
-tumor $TUMOR_BAM \
|
||||
-L $INTERVALS \
|
||||
-o $OUT_VCF \
|
||||
-run
|
||||
```
|
||||
|
||||
### How To Evaluate
|
||||
|
||||
Run the following
|
||||
```
|
||||
/dsde/working/mutect/dream_smc/dream_eval.pl [3|4] [wgs|21] [SNV|INDEL] input.vcf
|
||||
```
|
||||
where
|
||||
- [3|4] the dream challenge round
|
||||
- [wgs|21] evaluate the whole genome, or just a subset (chromosome 21)
|
||||
- [SNV|INDEL] evaulate SNV (SNPs) or INDELS
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue