Build on Laura's code and finish porting MuTect1 clustered read position filter.

This commit is contained in:
Takuto Sato 2016-06-15 15:32:17 -04:00
parent 77d8da6bb2
commit d6d0678b50
11 changed files with 456 additions and 270 deletions

View File

@ -0,0 +1,350 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE
* SOFTWARE LICENSE AGREEMENT
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 ("BROAD") and the LICENSEE and is effective at the date the downloading is completed ("EFFECTIVE DATE").
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. PHONE-HOME FEATURE
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system ("PHONE-HOME") which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE'S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
*
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012-2016 Broad Institute, Inc.
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 5. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 6. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 7. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 8. MISCELLANEOUS
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.cancer;
import htsjdk.variant.variantcontext.Allele;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.vcf.VCFInfoHeaderLine;
import org.apache.commons.math3.stat.descriptive.rank.Median;
import org.apache.log4j.Logger;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
import org.broadinstitute.gatk.tools.walkers.cancer.m2.MuTect2;
import org.broadinstitute.gatk.utils.QualityUtils;
import org.broadinstitute.gatk.utils.collections.Pair;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
import org.broadinstitute.gatk.utils.sam.ReadUtils;
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
import java.util.*;
import java.util.stream.Collectors;
/**
* Detect clustering of variants near the ends of reads
*
* <p> This annotation detects clustering of evidence for a somatic variant near the ends of reads. To turn on the annotation and the accompanying filter (clustered_read_position), add --enable_clustered_read_position_filter flag in the commandline.
*
*
* <h3>Statistical notes</h3>
* <p> ClusteredReadPosition produces four INFO field annotations. At a given somatic variant site, MEDIAN_LEFT_OFFSET is the median of the number of bases from the left end of the tumor read to the variant. MEDIAN_RIGHT_OFFSET is similar, but counts from the right end of the read. MAD_LEFT_OFFSET and MAD_RIGHT_OFFSET measure the median absolute deviations. The median gives us the offset of a representative read, while the median absolute deviation captures the spread. We filter a variant if MEDIAN_LEFT_OFFSET <= 10 and MAD_LEFT_OFFSET <= 3, or if MEDIAN_RIGHT_OFFSET <= 10 and MAD_RIGHT_OFFSET <= 3.
*
*
* <h3>Caveat</h3>
* <p> ClusteredReadPosition is available with MuTect2 only </p>
*
* <h3>RelatedAnnotation</h3>
* <li><b><a href="https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_ReadPosRankSumTest.php">ReadPosRankSum</a></b> is a similar annotation designed for germline variants.
*
*/
public class ClusteredReadPosition extends InfoFieldAnnotation implements ActiveRegionBasedAnnotation {
private final static Logger logger = Logger.getLogger(ClusteredReadPosition.class);
private String tumorSampleName = null;
@Override
public List<String> getKeyNames() { return Arrays.asList(
GATKVCFConstants.MEDIAN_LEFT_OFFSET_KEY,
GATKVCFConstants.MEDIAN_RIGHT_OFFSET_KEY,
GATKVCFConstants.MAD_MEDIAN_LEFT_OFFSET_KEY,
GATKVCFConstants.MAD_MEDIAN_RIGHT_OFFSET_KEY);
}
@Override
public List<VCFInfoHeaderLine> getDescriptions() {
List<VCFInfoHeaderLine> descriptions = new ArrayList<>();
for (final String infoFieldKey : getKeyNames()){
descriptions.add(GATKVCFHeaderLines.getInfoLine(infoFieldKey));
}
return descriptions;
// the following causes a cryptic class not found error, similar to the one in computeReadPositionStats
// return getKeyNames().stream().map(GATKVCFHeaderLines::getInfoLine).collect(Collectors.toList());
}
@Override
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
final AnnotatorCompatible walker,
final ReferenceContext ref,
final Map<String, AlignmentContext> stratifiedContexts,
final VariantContext vc,
final Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap) {
// TODO: might make sense to move this code to SomaticGenoypingEngine.
// FIXME: checking walker is mutect2 is not ideal...moving this annotation to SomaticGenoypingEngine will solve it
// populate tumorSampleName the first time we call this method. skip afterwards.
if (tumorSampleName == null){
if (walker instanceof MuTect2) {
tumorSampleName = ((MuTect2) walker).getTumorSampleName();
} else {
throw new IllegalStateException("ClusteredReadPosition: walker is not MuTect2");
}
}
// we skip multi-allelic sites
if (vc.getAlternateAlleles().size() > 1){
return null;
}
final Map<String, Object> result = new HashMap<>();
if ( stratifiedPerReadAlleleLikelihoodMap != null ) {
final PerReadAlleleLikelihoodMap likelihoodMap = stratifiedPerReadAlleleLikelihoodMap.get(tumorSampleName);
if ( likelihoodMap != null && !likelihoodMap.isEmpty() ) {
final Optional<MedianStatistics> readPositionStatsOption = computeReadPositionStats(vc, likelihoodMap);
if (readPositionStatsOption.isPresent()){
MedianStatistics readPositionStats = readPositionStatsOption.get();
result.put(GATKVCFConstants.MEDIAN_LEFT_OFFSET_KEY, readPositionStats.getLeftMedian());
result.put(GATKVCFConstants.MEDIAN_RIGHT_OFFSET_KEY, readPositionStats.getRightMedian());
result.put(GATKVCFConstants.MAD_MEDIAN_LEFT_OFFSET_KEY, readPositionStats.getLeftMAD());
result.put(GATKVCFConstants.MAD_MEDIAN_RIGHT_OFFSET_KEY, readPositionStats.getRightMAD());
} else {
return null;
}
}
}
return result;
}
/**
*
* @param vc
* @param pralm
* @return median of left and right offsets and their median absolute deviations. does not return null.
*/
private Optional<MedianStatistics> computeReadPositionStats(final VariantContext vc,
final PerReadAlleleLikelihoodMap pralm) {
final int variantStartPosition = vc.getStart();
final List<Integer> tumorLeftOffsets = new ArrayList<>();
final List<Integer> tumorRightOffsets = new ArrayList<>();
for ( final Map.Entry<GATKSAMRecord, Map<Allele,Double>> readAlleleLikelihood : pralm.getLikelihoodReadMap().entrySet() ) {
final MostLikelyAllele mostLikelyAllele = PerReadAlleleLikelihoodMap.getMostLikelyAllele(readAlleleLikelihood.getValue());
final GATKSAMRecord read = readAlleleLikelihood.getKey();
if ( mostLikelyAllele.getMostLikelyAllele().isReference() || ! mostLikelyAllele.isInformative() || ! isUsableRead(read)) {
continue;
}
final Pair<OptionalInt, OptionalInt> offsetPair = getVariantPositionInRead(read, variantStartPosition);
final OptionalInt variantPositionInReadFromLeft = offsetPair.getFirst();
final OptionalInt variantPositionInReadFromRight = offsetPair.getSecond();
// suffices to check only the left offset because the right offset depends on it
if ( variantPositionInReadFromLeft.isPresent() ) {
tumorLeftOffsets.add(variantPositionInReadFromLeft.getAsInt());
tumorRightOffsets.add(variantPositionInReadFromRight.getAsInt());
}
}
if (tumorLeftOffsets.isEmpty() || tumorRightOffsets.isEmpty()) {
// This condition seems to arise when the reads as aligned in the bam (as represented by PRALM) do not contain the alt read found by HaplotypeCaller
logger.warn("At Position " + vc.getContig() + ": " + vc.getStart() + " , the left or right offset list is empty");
return Optional.empty();
}
// The following (mapToDouble() in particular) causes ClusteredReadPosition to be not added to ClassMap
// leftMedian = median.evaluate(tumorLeftOffsets.stream().mapToDouble( x -> x ).toArray());
// rightMedian = median.evaluate(tumorRightOffsets.stream().mapToDouble( x -> x).toArray());
// until we understand why mapToDouble() causes the above error, have to compute medians in two steps
// first use a for loop to manually cast integer to doubles, then call median :: evaluate
double[] tumorLeftOffsetsDouble = new double[tumorLeftOffsets.size()];
double[] tumorRightOffsetsDouble = new double[tumorRightOffsets.size()];
for (int i = 0; i < tumorLeftOffsets.size(); i++){
tumorLeftOffsetsDouble[i] = (double) tumorLeftOffsets.get(i);
tumorRightOffsetsDouble[i] = (double) tumorRightOffsets.get(i);
}
Median median = new Median();
double leftMedian = median.evaluate(tumorLeftOffsetsDouble);
double rightMedian = median.evaluate(tumorRightOffsetsDouble);
double leftMAD = calculateMAD(tumorLeftOffsets, leftMedian);
double rightMAD = calculateMAD(tumorRightOffsets, rightMedian);
return( Optional.of(new MedianStatistics(leftMedian, rightMedian, leftMAD, rightMAD) ) );
}
private static class MedianStatistics {
private double leftMedian;
private double rightMedian;
private double leftMAD;
private double rightMAD;
public MedianStatistics(double leftMedian, double rightMedian, double leftMAD, double rightMAD) {
this.leftMedian = leftMedian;
this.rightMedian = rightMedian;
this.leftMAD = leftMAD;
this.rightMAD = rightMAD;
}
public double getLeftMedian() {
return leftMedian;
}
public double getRightMedian() {
return rightMedian;
}
public double getLeftMAD() {
return leftMAD;
}
public double getRightMAD() {
return rightMAD;
}
}
/**
Examples below show how we compute the position of the variant with respect to the left and right end of the reads.
Note that a variant may be SNP, deletion, or insertion, and we are counting the number of bases from the left/right end of the read to that variant.
We first compute the left offset. Then, right offset = read length - left offset.
This means that if there is an insertion between the either end of a read and the variant, we count the inserted bases. Conversely, we do not count the deleted bases between the end of a read and a variant.
We count soft-clipped bases.
example 1 : SNP
right offset: 9 8 7 6 5 4 3 2 1 0
ref: _ _ _ _ _ _ _ _ _ _
read: _ _ _ _ x _ _ _ _ _
left offset: 0 1 2 3 4 5 6 7 8 9
left-offset = 4. right offset = 5.
read.getReadLength() = 10. numReadBasesToVariant = 5.
example 2: deletion
We count from the left end of the read to the last non-deleted base i.e. the first deleted base is not counted.
From the right end, we count bases to the *end* of the deletion.
right offset: 9 8 7 6 5 4 3 2 1 0
ref: _ _ _ _ _ _ _ _ _ _
read: _ _ _ _|- - - -|_ _
left offset: 0 1 2 3 4 5 6 7 8 9
left-offset = 3. right-offset = 2.
read.getReadLength() = 6. numReadBasesToVariant = 4
example 3: insertion
For insertions, we count from the left to the first inserted base. From the right, we count all the way to the first inserted base.
In the future, we may modify this; it might be desirable to count from the right to the *last* inserted base.
right offset: 9 8 7 6 5 4 3 2 1 0
ref: _ _ _ _ _ _ _ _
read: _ _ _ I I I _ _ _ _
left offset: 0 1 2 3 4 5 6 7 8 9
left-offset = 3. right offset = 6
read.getReadLength() = 10. numReadBasesToVariant = 4.
*/
/**
* The function assumes that read contains the variant allele.
*
* @param read
* @param variantStartPosition the location of the variant in the reference
* @return
*/
protected Pair<OptionalInt, OptionalInt> getVariantPositionInRead(final GATKSAMRecord read, final int variantStartPosition) {
final Pair<Integer, Boolean> refPositionAndDeletionFlag = ReadUtils.getReadCoordinateForReferenceCoordinate(read.getSoftStart(), read.getCigar(), variantStartPosition, true);
// the +1 is needed there because getReadCoordinateForReferenceCoordinate() returns the number of read bases from the left end to the variant - 1
int numReadBasesFromLeftEndToVariant = refPositionAndDeletionFlag.getFirst() + 1;
// we don't take advantage of fallsInsideOrJustBeforeDeletionOrSkippedRegion flag now, but we might want to, so I will leave it here in comments.
// boolean fallsInsideOrJustBeforeDeletionOrSkippedRegion = refPositionAndDeletionFlag.getSecond();
if ( numReadBasesFromLeftEndToVariant == ReadUtils.CLIPPING_GOAL_NOT_REACHED ) {
return new Pair(OptionalInt.empty(), OptionalInt.empty());
} else {
int leftOffset = numReadBasesFromLeftEndToVariant - 1;
int rightOffset = read.getReadLength() - numReadBasesFromLeftEndToVariant;
return new Pair(OptionalInt.of(leftOffset), OptionalInt.of(rightOffset));
}
}
/**
* Can the read be used in comparative tests between ref / alt bases?
*
* @param read the read to consider
* @return false if MQ is either 0 or unavailable. true otherwise.
*/
private boolean isUsableRead(final GATKSAMRecord read) {
return( read.getMappingQuality() != 0 || read.getMappingQuality() != QualityUtils.MAPPING_QUALITY_UNAVAILABLE);
}
/**
*
* @param offsets a list of integers
* @param median median of the list offsets.
* @return median absolute deviation (median of the list of deviations from the median)
*/
private double calculateMAD(final List<Integer> offsets, final double median) {
// This code is concise but somehow leads to ClusteredReadPosition class being removed from ClassMap.
// mapToDouble() seems to be the trigger
// return new Median().evaluate(offsets.stream().mapToDouble(x -> Math.abs(x - median)).toArray());
double[] medianAbsoluteDeviations = new double[offsets.size()];
for (int i = 0; i < offsets.size(); i++){
medianAbsoluteDeviations[i] = Math.abs(offsets.get(i) - median);
}
return new Median().evaluate(medianAbsoluteDeviations);
}
}

View File

@ -49,60 +49,37 @@
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.cancer.m2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
package org.broadinstitute.gatk.tools.walkers.cancer;
/**
* Collection of Statistical methods and tests used by MuTect
* Created by tsato on 6/27/16.
*/
public class MuTectStats {
public static double calculateMAD(ArrayList<Double> xs, double median) {
ArrayList<Double> deviations = new ArrayList<>(xs.size());
for(double x : xs) {
deviations.add(Math.abs(x - median));
}
return getMedian(deviations);
public class MedianStatistics {
private double leftMedian;
private double rightMedian;
private double leftMAD;
private double rightMAD;
public MedianStatistics(double leftMedian, double rightMedian, double leftMAD, double rightMAD) {
this.leftMedian = leftMedian;
this.rightMedian = rightMedian;
this.leftMAD = leftMAD;
this.rightMAD = rightMAD;
}
public static double getMedian(ArrayList<Double> data) {
Collections.sort(data);
Double result;
if (data.size() % 2 == 1) {
// If the number of entries in the list is not even.
// Get the middle value.
// You must floor the result of the division to drop the
// remainder.
result = data.get((int) Math.floor(data.size()/2) );
} else {
// If the number of entries in the list are even.
// Get the middle two values and average them.
Double lowerMiddle = data.get(data.size()/2 );
Double upperMiddle = data.get(data.size()/2 - 1 );
result = (lowerMiddle + upperMiddle) / 2;
}
return result;
public double getLeftMedian() {
return leftMedian;
}
public static double[] convertIntegersToDoubles(List<Integer> integers)
{
double[] ret = new double[integers.size()];
for (int i=0; i < ret.length; i++)
{
ret[i] = integers.get(i);
}
return ret;
public double getRightMedian() {
return rightMedian;
}
public double getLeftMAD() {
return leftMAD;
}
public double getRightMAD() {
return rightMAD;
}
}

View File

@ -1,191 +0,0 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE
* SOFTWARE LICENSE AGREEMENT
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 ("BROAD") and the LICENSEE and is effective at the date the downloading is completed ("EFFECTIVE DATE").
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. PHONE-HOME FEATURE
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system ("PHONE-HOME") which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE'S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
*
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012-2016 Broad Institute, Inc.
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 5. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 6. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 7. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 8. MISCELLANEOUS
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.cancer.m2;
import htsjdk.variant.variantcontext.Allele;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.vcf.VCFHeaderLineType;
import htsjdk.variant.vcf.VCFInfoHeaderLine;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation;
import org.broadinstitute.gatk.utils.QualityUtils;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele;
import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.utils.sam.AlignmentUtils;
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
import org.broadinstitute.gatk.utils.sam.ReadUtils;
import java.util.*;
/**
* Created by gauthier on 7/27/15.
*/
public class ClusteredEventsAnnotator extends InfoFieldAnnotation implements ActiveRegionBasedAnnotation {
private String tumorSampleName = null;
@Override
public List<String> getKeyNames() { return Arrays.asList("tumorForwardOffsetMedian","tumorReverseOffsetMedian","tumorForwardOffsetMAD","tumorReverseOffsetMAD"); }
@Override
public List<VCFInfoHeaderLine> getDescriptions() {
//TODO: this needs a lot of re-phrasing
return Arrays.asList(new VCFInfoHeaderLine("TUMOR_FWD_POS_MEDIAN", 1, VCFHeaderLineType.Integer, "Median offset of tumor variant position from positive read end"),
new VCFInfoHeaderLine("TUMOR_FWD_POS_MAD", 1, VCFHeaderLineType.Integer, "Median absolute deviation from the median for tumor forward read positions"),
new VCFInfoHeaderLine("TUMOR_REV_POS_MEDIAN", 1, VCFHeaderLineType.Integer, "Median offset of tumor variant position from negative read end"),
new VCFInfoHeaderLine("TUMOR_REV_POS_MAD", 1, VCFHeaderLineType.Integer, "Median absolute deviation from the median for tumor reverse read positions"));
}
@Override
public Map<String, Object> annotate(final RefMetaDataTracker tracker,
final AnnotatorCompatible walker,
final ReferenceContext ref,
final Map<String, AlignmentContext> stratifiedContexts,
final VariantContext vc,
final Map<String, PerReadAlleleLikelihoodMap> stratifiedPerReadAlleleLikelihoodMap) {
if (tumorSampleName == null){
if (walker instanceof MuTect2 ) {
tumorSampleName = ((MuTect2) walker).tumorSampleName;
} else {
// ts: log error and exit
throw new IllegalStateException("ClusteredEventsAnnotator: walker is not MuTect2");
}
}
final Map<String, Object> map = new HashMap<>();
if ( stratifiedPerReadAlleleLikelihoodMap != null ) {
final PerReadAlleleLikelihoodMap likelihoodMap = stratifiedPerReadAlleleLikelihoodMap.get(tumorSampleName);
MuTect2.logReadInfo("HAVCYADXX150109:2:2209:19034:53394", likelihoodMap.getLikelihoodReadMap().keySet(), "Present inside ClusteredEventsAnnotator:annotate");
if ( likelihoodMap != null && !likelihoodMap.isEmpty() ) {
double[] list = fillQualsFromLikelihoodMap(vc.getStart(), likelihoodMap); // [fwdMedian, revMedian, fwdMAD, revMAD]
final int FWDMEDIAN = 0, REVMEDIAN = 1, FWDMAD = 2, REVMAD = 3; // ts: make a class to contain these values
map.put("TUMOR_FWD_POS_MEDIAN", list[FWDMEDIAN]);
map.put("TUMOR_REV_POS_MEDIAN", list[REVMEDIAN]);
map.put("TUMOR_FWD_POS_MAD", list[FWDMAD]);
map.put("TUMOR_REV_POS_MAD", list[REVMAD]);
}
}
return map;
}
private double[] fillQualsFromLikelihoodMap(final int refLoc,
final PerReadAlleleLikelihoodMap likelihoodMap) {
final ArrayList<Double> tumorFwdOffset = new ArrayList<>();
final ArrayList<Double> tumorRevOffset = new ArrayList<>();
for ( final Map.Entry<GATKSAMRecord, Map<Allele,Double>> el : likelihoodMap.getLikelihoodReadMap().entrySet() ) {
final MostLikelyAllele a = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue());
if ( ! a.isInformative() )
continue; // read is non-informative
final GATKSAMRecord read = el.getKey();
if ( isUsableRead(read, refLoc) ) {
if ( a.getMostLikelyAllele().isReference() )
continue;
final Double valueRight = getElementForRead(read, refLoc, ReadUtils.ClippingTail.RIGHT_TAIL);
if ( valueRight == null )
continue;
tumorFwdOffset.add(valueRight);
final Double valueLeft = getElementForRead(read, refLoc, ReadUtils.ClippingTail.LEFT_TAIL);
if ( valueLeft == null )
continue;
tumorRevOffset.add(valueLeft);
}
}
double fwdMedian = 0.0;
double revMedian = 0.0;
double fwdMAD = 0.0;
double revMAD = 0.0;
if (!tumorFwdOffset.isEmpty() && !tumorRevOffset.isEmpty()) {
fwdMedian = MuTectStats.getMedian(tumorFwdOffset);
revMedian = MuTectStats.getMedian(tumorRevOffset);
fwdMAD = MuTectStats.calculateMAD(tumorFwdOffset, fwdMedian);
revMAD = MuTectStats.calculateMAD(tumorRevOffset, revMedian);
}
return( new double[] {fwdMedian, revMedian, fwdMAD, revMAD} ); // TODO: make an object container instead of array
}
protected Double getElementForRead(final GATKSAMRecord read, final int refLoc, final ReadUtils.ClippingTail tail) {
final int offset = ReadUtils.getReadCoordinateForReferenceCoordinate(read.getSoftStart(), read.getCigar(), refLoc, tail, true);
if ( offset == ReadUtils.CLIPPING_GOAL_NOT_REACHED ) // offset is the number of bases in the read, including inserted bases, from start of read to the variant
return null;
int readPos = AlignmentUtils.calcAlignmentByteArrayOffset(read.getCigar(), offset, false, 0, 0); // readpos is the number of REF bases from start to variant. I would name it as such...
final int numAlignedBases = AlignmentUtils.getNumAlignedBasesCountingSoftClips( read );
if (readPos > numAlignedBases / 2)
readPos = numAlignedBases - (readPos + 1);
return (double)readPos;
}
/**
* Can the read be used in comparative tests between ref / alt bases?
*
* @param read the read to consider
* @param refLoc the reference location
* @return true if this read is meaningful for comparison, false otherwise
*/
protected boolean isUsableRead(final GATKSAMRecord read, final int refLoc) {
return !( read.getMappingQuality() == 0 ||
read.getMappingQuality() == QualityUtils.MAPPING_QUALITY_UNAVAILABLE );
}
}

View File

@ -138,6 +138,9 @@ public class M2ArgumentCollection extends AssemblyBasedCallerArgumentCollection
@Argument(fullName = "enable_strand_artifact_filter", required = false, doc = "turn on strand artifact filter")
public boolean ENABLE_STRAND_ARTIFACT_FILTER = false;
@Argument(fullName = "enable_clustered_read_position_filter", required = false, doc = "turn on clustered read position filter")
public boolean ENABLE_CLUSTERED_READ_POSITION_FILTER = false;
/**
* This argument is used for the M1-style read position filter
*/

View File

@ -205,8 +205,6 @@ public class MuTect2 extends ActiveRegionWalker<List<VariantContext>, Integer> i
private byte MIN_TAIL_QUALITY;
private double log10GlobalReadMismappingRate;
@ArgumentCollection
protected M2ArgumentCollection MTAC = new M2ArgumentCollection();
@ -364,6 +362,9 @@ public class MuTect2 extends ActiveRegionWalker<List<VariantContext>, Integer> i
private VariantAnnotatorEngine initializeVCFOutput() {
// initialize the output VCF header
if (MTAC.ENABLE_CLUSTERED_READ_POSITION_FILTER) {
annotationsToUse.add("ClusteredReadPosition");
}
final VariantAnnotatorEngine annotationEngine = new VariantAnnotatorEngine(Arrays.asList(annotationClassesToUse), annotationsToUse, annotationsToExclude, this, getToolkit());
Set<VCFHeaderLine> headerInfo = new HashSet<>();
@ -418,6 +419,7 @@ public class MuTect2 extends ActiveRegionWalker<List<VariantContext>, Integer> i
headerInfo.add(GATKVCFHeaderLines.getFilterLine(GATKVCFConstants.GERMLINE_RISK_FILTER_NAME));
headerInfo.add(GATKVCFHeaderLines.getFilterLine(GATKVCFConstants.TRIALLELIC_SITE_FILTER_NAME));
headerInfo.add(GATKVCFHeaderLines.getFilterLine(GATKVCFConstants.STRAND_ARTIFACT_FILTER_NAME));
headerInfo.add(GATKVCFHeaderLines.getFilterLine(GATKVCFConstants.CLUSTERED_READ_POSITION_FILTER_NAME));
if ( ! doNotRunPhysicalPhasing ) {
@ -835,10 +837,21 @@ public class MuTect2 extends ActiveRegionWalker<List<VariantContext>, Integer> i
filters.add(GATKVCFConstants.CLUSTERED_EVENTS_FILTER_NAME);
}
// TODO: Add clustered read position filter here
// TODO: Move strand bias filter here
return filters;
// clustered read position filter
if (MTAC.ENABLE_CLUSTERED_READ_POSITION_FILTER){
Double tumorFwdPosMedian = (Double) vc.getAttribute(GATKVCFConstants.MEDIAN_LEFT_OFFSET_KEY);
Double tumorRevPosMedian = (Double) vc.getAttribute(GATKVCFConstants.MEDIAN_RIGHT_OFFSET_KEY);
Double tumorFwdPosMAD = (Double) vc.getAttribute(GATKVCFConstants.MAD_MEDIAN_LEFT_OFFSET_KEY);
Double tumorRevPosMAD = (Double) vc.getAttribute(GATKVCFConstants.MAD_MEDIAN_RIGHT_OFFSET_KEY);
//If the variant is near the read end (median threshold) and the positions are very similar (MAD threshold) then filter
if ( (tumorFwdPosMedian != null && tumorFwdPosMedian <= MTAC.PIR_MEDIAN_THRESHOLD && tumorFwdPosMAD != null && tumorFwdPosMAD <= MTAC.PIR_MAD_THRESHOLD) ||
(tumorRevPosMedian != null && tumorRevPosMedian <= MTAC.PIR_MEDIAN_THRESHOLD && tumorRevPosMAD != null && tumorRevPosMAD <= MTAC.PIR_MAD_THRESHOLD))
filters.add(GATKVCFConstants.CLUSTERED_READ_POSITION_FILTER_NAME);
}
// TODO: Move strand bias filter here
return filters;
}
@ -1313,6 +1326,11 @@ public class MuTect2 extends ActiveRegionWalker<List<VariantContext>, Integer> i
return normalSampleName != null && normalSampleName.equals(rec.getReadGroup().getSample());
}
public String getTumorSampleName(){
return tumorSampleName;
}
// KCIBUL: new stuff -- read up on this!!
/**
* As of GATK 3.3, HaplotypeCaller outputs physical (read-based) information (see version 3.3 release notes and documentation for details). This argument disables that behavior.

View File

@ -209,7 +209,7 @@ public class SomaticGenotypingEngine extends HaplotypeCallerGenotypingEngine {
// TODO: CONFIRM WITH GSA IF IT IS OK TO REMOVE READS FROM THE PRALM (should be... they do it in filterPoorlyModeledReads!)
PerReadAlleleLikelihoodMap tumorPRALM = readAlleleLikelihoods.toPerReadAlleleLikelihoodMap(readAlleleLikelihoods.sampleIndex(tumorSampleName));
filterPRALMForOverlappingReads(tumorPRALM, mergedVC.getReference(), loc, false);
MuTect2.logReadInfo(DEBUG_READ_NAME, tumorPRALM.getLikelihoodReadMap().keySet(), "Present after filtering for overlapping reads");
MuTect2.logReadInfo(DEBUG_READ_NAME, tumorPRALM.getLikelihoodReadMap().keySet(), "Present in Tumor PRALM after filtering for overlapping reads");
// extend to multiple samples
// compute tumor LOD for each alternate allele
@ -249,12 +249,11 @@ public class SomaticGenotypingEngine extends HaplotypeCallerGenotypingEngine {
if (hasNormal) {
normalPRALM = readAlleleLikelihoods.toPerReadAlleleLikelihoodMap(readAlleleLikelihoods.sampleIndex(matchedNormalSampleName));
filterPRALMForOverlappingReads(normalPRALM, mergedVC.getReference(), loc, true);
MuTect2.logReadInfo(DEBUG_READ_NAME, normalPRALM.getLikelihoodReadMap().keySet(), "Present after filtering for overlapping reads");
MuTect2.logReadInfo(DEBUG_READ_NAME, normalPRALM.getLikelihoodReadMap().keySet(), "Present after in Nomral PRALM filtering for overlapping reads");
GenomeLoc eventGenomeLoc = genomeLocParser.createGenomeLoc(activeRegionWindow.getContig(), loc);
Collection<VariantContext> cosmicVC = tracker.getValues(cosmicRod, eventGenomeLoc);
Collection<VariantContext> dbsnpVC = tracker.getValues(dbsnpRod, eventGenomeLoc);
// remove the effect of cosmic from dbSNP
final boolean germlineAtRisk = (!dbsnpVC.isEmpty() && cosmicVC.isEmpty());
@ -320,6 +319,10 @@ public class SomaticGenotypingEngine extends HaplotypeCallerGenotypingEngine {
final PerReadAlleleLikelihoodMap reversePRALM = new PerReadAlleleLikelihoodMap();
splitPRALMintoForwardAndReverseReads(tumorPRALM, forwardPRALM, reversePRALM);
MuTect2.logReadInfo(DEBUG_READ_NAME, tumorPRALM.getLikelihoodReadMap().keySet(), "Present in tumor PRALM after PRALM is split");
MuTect2.logReadInfo(DEBUG_READ_NAME, forwardPRALM.getLikelihoodReadMap().keySet(), "Present in forward PRALM after PRALM is split");
MuTect2.logReadInfo(DEBUG_READ_NAME, reversePRALM.getLikelihoodReadMap().keySet(), "Present in reverse PRALM after PRALM is split");
// TODO: build a new type for probability, likelihood, and log_likelihood. e.g. f_fwd :: probability[], tumorGLs_fwd :: likelihood[]
// TODO: don't want to call getHetGenotypeLogLikelihoods on more than one alternate alelle. May need to overload it to take a scalar f_fwd.
final PerAlleleCollection<Double> alleleFractionsForward = estimateAlleleFraction(mergedVC, forwardPRALM, true);
@ -328,6 +331,22 @@ public class SomaticGenotypingEngine extends HaplotypeCallerGenotypingEngine {
final PerAlleleCollection<Double> alleleFractionsReverse = estimateAlleleFraction(mergedVC, reversePRALM, true);
final PerAlleleCollection<Double> tumorGenotypeLLReverse = getHetGenotypeLogLikelihoods(mergedVC, reversePRALM, originalNormalReadQualities, alleleFractionsReverse);
if( configuration.DEBUG && logger != null ) {
StringBuilder forwardMessage = new StringBuilder("Calculated forward allelic fraction at " + loc + " = [");
StringBuilder reverseMessage = new StringBuilder("Calculated reverse allelic fraction at " + loc + " = [");
for (Allele altAllele : altAlleleFractions.getAltAlleles()){
forwardMessage.append( altAllele + ": " + alleleFractionsForward.getAlt(altAllele) + ", ");
reverseMessage.append( altAllele + ": " + alleleFractionsReverse.getAlt(altAllele) + ", ");
}
forwardMessage.append("]");
reverseMessage.append("]");
logger.info(forwardMessage.toString());
logger.info(reverseMessage.toString());
}
double tumorLod_fwd = tumorGenotypeLLForward.getAlt(alleleWithHighestTumorLOD) - tumorGenotypeLLForward.getRef();
double tumorLod_rev = tumorGenotypeLLReverse.getAlt(alleleWithHighestTumorLOD) - tumorGenotypeLLReverse.getRef();
@ -500,6 +519,10 @@ public class SomaticGenotypingEngine extends HaplotypeCallerGenotypingEngine {
for ( final Allele altAllele : vc.getAlternateAlleles() ) {
int altCount = alleleCounts.getAlt(altAllele);
double alleleFraction = (double) altCount / (refCount + altCount);
// weird case, but I've seen it happen in one strand cases
if (refCount == 0 && altCount == refCount ) {
alleleFraction = 0;
}
alleleFractions.setAlt(altAllele, alleleFraction);
// logger.info("Counted " + refCount + " ref and " + altCount + " alt " );
}

View File

@ -57,8 +57,6 @@ import org.testng.annotations.Test;
import java.util.*;
public class MuTect2IntegrationTest extends WalkerTest {
final static String REF = hg19Reference;
final static String CCLE_MICRO_TUMOR_BAM = privateTestDir + "HCC1143.cghub.ccle.micro.bam";
final static String CCLE_MICRO_NORMAL_BAM = privateTestDir + "HCC1143_BL.cghub.ccle.micro.bam";
final static String CCLE_MICRO_INTERVALS_FILE = privateTestDir + "HCC1143.cghub.ccle.micro.intervals";
@ -72,8 +70,6 @@ public class MuTect2IntegrationTest extends WalkerTest {
final static String DREAM3_TP_INTERVALS_FILE = privateTestDir + "m2_dream3.tp.intervals";
final static String DREAM3_FP_INTERVALS_FILE = privateTestDir + "m2_dream3.fp.intervals";
final static String MULTIALLELIC_TUMOR_BAM = privateTestDir + "m2-multiallelic-tumor.bam";
final String commandLine =
@ -82,7 +78,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
private void M2Test(String tumorBam, String normalBam, String intervals, String args, String md5) {
final String base = String.format(
commandLine,
REF, DBSNP, COSMIC, PON, tumorBam, normalBam, intervals) +
hg19Reference, DBSNP, COSMIC, PON, tumorBam, normalBam, intervals) +
" -o %s ";
final WalkerTestSpec spec = new WalkerTestSpec(base + " " + args, Arrays.asList(md5));
@ -97,7 +93,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
private void m2TumorOnlyTest(String tumorBam, String intervals, String args, String md5) {
final String base = String.format(
"-T MuTect2 --no_cmdline_in_header -dt NONE --disableDithering -alwaysloadVectorHMM -pairHMM LOGLESS_CACHING -ip 50 -R %s --dbsnp %s --cosmic %s --normal_panel %s -I:tumor %s -L %s",
REF, DBSNP, COSMIC, PON, tumorBam, intervals) +
hg19Reference, DBSNP, COSMIC, PON, tumorBam, intervals) +
" -o %s ";
final WalkerTestSpec spec = new WalkerTestSpec(base + " " + args, Arrays.asList(md5));
@ -109,7 +105,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
private void M2TestWithDroppedReads(String tumorBam, String normalBam, String intervals, String args, String md5Variants, String md5Bamout) {
final String base = String.format(
commandLine,
REF, DBSNP, COSMIC, PON, tumorBam, normalBam, intervals) +
hg19Reference, DBSNP, COSMIC, PON, tumorBam, normalBam, intervals) +
" -o %s " +
"-bamout %s --emitDroppedReads";
@ -124,7 +120,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
@Test
public void testMicroRegression() {
M2Test(CCLE_MICRO_TUMOR_BAM, CCLE_MICRO_NORMAL_BAM, CCLE_MICRO_INTERVALS_FILE, "", "dc6d742e85a59b237f5541109a6d343e");
M2Test(CCLE_MICRO_TUMOR_BAM, CCLE_MICRO_NORMAL_BAM, CCLE_MICRO_INTERVALS_FILE, "", "dd3bb9526c85c0aed39545c4639ff138");
}
/**
@ -134,7 +130,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
*/
@Test
public void testTruePositivesDream3() {
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_TP_INTERVALS_FILE, "", "7faeb329798cca63a42867404111847c");
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_TP_INTERVALS_FILE, "", "5bd540d238916a2b91e827aed3592e59");
}
/**
@ -143,7 +139,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
@Test
public void testTruePositivesDream3TrackedDropped() {
M2TestWithDroppedReads(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, "21:10935369", "",
"a2e6cc12a21219d510b6719ee86c676e",
"48a446d47bb10434cb7f0ee726d15721",
"b536e76870326b4be01b8d6b83c1cf1c");
}
@ -153,7 +149,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
*/
@Test
public void testFalsePositivesDream3() {
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_FP_INTERVALS_FILE, "", "fe3adcf8ac45e8ec9a9feb26908f67a9"); // e2413f4166b6ed20be6cdee6616ba43d
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_FP_INTERVALS_FILE, "", "c23f794866797f9bbcb3ed04451758be"); // e2413f4166b6ed20be6cdee6616ba43d
}
/**
@ -161,7 +157,7 @@ public class MuTect2IntegrationTest extends WalkerTest {
*/
@Test
public void testContaminationCorrection() {
M2Test(CCLE_MICRO_TUMOR_BAM, CCLE_MICRO_NORMAL_BAM, CCLE_MICRO_INTERVALS_FILE, "-contamination 0.1", "4ffcef4c72ac72b9b8738efdcf3e04e9");
M2Test(CCLE_MICRO_TUMOR_BAM, CCLE_MICRO_NORMAL_BAM, CCLE_MICRO_INTERVALS_FILE, "-contamination 0.1", "c25e48edd704bbb436cd6456d9f47d8b");
}
/**
@ -169,19 +165,18 @@ public class MuTect2IntegrationTest extends WalkerTest {
*/
@Test
public void testTumorOnly(){
m2TumorOnlyTest(CCLE_MICRO_TUMOR_BAM, "2:166000000-167000000", "", "6044780242414820090c5b4b1d4b8ac0");
m2TumorOnlyTest(CCLE_MICRO_TUMOR_BAM, "2:166000000-167000000", "", "2af2253b1f09ea8fd354e1bf2c4612f0");
}
@Test
public void testStrandArtifactFilter(){
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_FP_INTERVALS_FILE, "--enable_strand_artifact_filter", "b988ba4b5f3af4674e28b3501bd3b124");
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_FP_INTERVALS_FILE, "--enable_strand_artifact_filter", "75c9349ff9f8dc84291396ac50871f64");
}
// @Test
// public void testMultiAllelicSite(){
// // TODO need b38 reference
// m2TumorOnlyTest(MULTIALLELIC_TUMOR_BAM, "1:23558000-23560000", "", "5c7182623391c1faec3f7c05c0506781")
// }
@Test
public void testClusteredReadPositionFilter() {
M2Test(DREAM3_TUMOR_BAM, DREAM3_NORMAL_BAM, DREAM3_FP_INTERVALS_FILE, "--enable_clustered_read_position_filter", "c333f7dc11e39e0713147ad9af2bf4db");
}
}

View File

@ -501,7 +501,7 @@ public class ReadUtils {
if (refBases + cigarElement.getLength() < goal)
shift = cigarElement.getLength();
else
shift = goal - refBases;
shift = goal - refBases; // get to the goal
refBases += shift;
}
@ -515,7 +515,7 @@ public class ReadUtils {
final boolean endsWithinCigar = shift < cigarElement.getLength();
// If it isn't, we need to check the next one. There should *ALWAYS* be a next one
// since we checked if the goal coordinate is within the read length, so this is just a sanity check.
// since we checked if the goal coordinate is within the read length, this is just a sanity check.
if (!endsWithinCigar && !cigarElementIterator.hasNext()) {
if (allowGoalNotReached) {
return new Pair<Integer, Boolean>(CLIPPING_GOAL_NOT_REACHED, false);

View File

@ -138,6 +138,10 @@ public final class GATKVCFConstants {
public static final String TLOD_REV_KEY = "TLOD_REV";
public static final String TUMOR_SB_POWER_FWD_KEY = "TUMOR_SB_POWER_FWD";
public static final String TUMOR_SB_POWER_REV_KEY = "TUMOR_SB_POWER_REV";
public static final String MEDIAN_LEFT_OFFSET_KEY = "MEDIAN_LEFT_OFFSET";
public static final String MEDIAN_RIGHT_OFFSET_KEY = "MEDIAN_RIGHT_OFFSET";
public static final String MAD_MEDIAN_LEFT_OFFSET_KEY = "MAD_LEFT_OFFSET";
public static final String MAD_MEDIAN_RIGHT_OFFSET_KEY = "MAD_RIGHT_OFFSET";
//FORMAT keys
@ -179,6 +183,8 @@ public final class GATKVCFConstants {
public static final String TUMOR_LOD_FILTER_NAME = "t_lod_fstar"; //M2
public static final String TRIALLELIC_SITE_FILTER_NAME = "triallelic_site"; //M2
public static final String STRAND_ARTIFACT_FILTER_NAME = "strand_artifact"; // M2
public static final String CLUSTERED_READ_POSITION_FILTER_NAME = "clustered_read_position"; // M2
// Symbolic alleles
public final static String SYMBOLIC_ALLELE_DEFINITION_HEADER_TAG = "ALT";

View File

@ -72,8 +72,8 @@ public class GATKVCFHeaderLines {
addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.TUMOR_LOD_FILTER_NAME, "Tumor does not meet likelihood threshold"));
addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.STR_CONTRACTION_FILTER_NAME, "Site filtered due to contraction of short tandem repeat region"));
addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.TRIALLELIC_SITE_FILTER_NAME, "Site filtered because more than two alt alleles pass tumor LOD"));
addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.STRAND_ARTIFACT_FILTER_NAME, "Strand bias detected: evidence for alt allele comes from one read direction only"));
// addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.CLUSTERED_READ_POSITION_FILTER_NAME, "Variant appears in similar read positions"));
addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.STRAND_ARTIFACT_FILTER_NAME, "Evidence for alt allele comes from one read direction only"));
addFilterLine(new VCFFilterHeaderLine(GATKVCFConstants.CLUSTERED_READ_POSITION_FILTER_NAME, "Evidence for somatic variant clusters near the ends of reads"));
@ -197,7 +197,7 @@ public class GATKVCFHeaderLines {
addInfoLine(new VCFInfoHeaderLine(BEAGLE_AF_COMP_KEY, 1, VCFHeaderLineType.Integer, "Allele Frequency from Comparison ROD at this site"));
addInfoLine(new VCFInfoHeaderLine(BEAGLE_AN_COMP_KEY, 1, VCFHeaderLineType.Float, "Allele Number from Comparison ROD at this site"));
// M2-related info lines
// More M2-related info lines
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.EVENT_COUNT_IN_HAPLOTYPE_KEY, 1, VCFHeaderLineType.String, "Number of events in this haplotype"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.EVENT_DISTANCE_MAX_KEY, 1, VCFHeaderLineType.Integer, "Maximum distance between events in this active region"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.EVENT_DISTANCE_MIN_KEY, 1, VCFHeaderLineType.Integer, "Minimum distance between events in this active region"));
@ -209,6 +209,10 @@ public class GATKVCFHeaderLines {
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.TLOD_REV_KEY,1,VCFHeaderLineType.Float,"TLOD from reverse reads only"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.TUMOR_SB_POWER_FWD_KEY,1,VCFHeaderLineType.Float,"Strand bias power for forward reads"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.TUMOR_SB_POWER_REV_KEY,1,VCFHeaderLineType.Float,"Stand bias power for reverse reads"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.MEDIAN_LEFT_OFFSET_KEY, 1, VCFHeaderLineType.Float, "Median of the number of bases between the left end of the tumor read and the variant"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.MEDIAN_RIGHT_OFFSET_KEY, 1, VCFHeaderLineType.Float, "Median of the number of bases between the variant and the right end of the tumor read"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.MAD_MEDIAN_LEFT_OFFSET_KEY, 1, VCFHeaderLineType.Float, "Median absolute deviation of medians of the number of bases between the left end of the tumor read and the variant"));
addInfoLine(new VCFInfoHeaderLine(GATKVCFConstants.MAD_MEDIAN_RIGHT_OFFSET_KEY, 1, VCFHeaderLineType.Float, "Median absolute deviation of medians of the number of bases between the variant and the right end of the tumor read"));
}
}

View File

@ -97,6 +97,7 @@ public abstract class BaseTest {
public static final String hg19Reference = "/seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta";
public static final String b36KGReference = "/humgen/1kg/reference/human_b36_both.fasta";
public static final String b37KGReference = "/humgen/1kg/reference/human_g1k_v37.fasta";
public static final String b38Reference = "/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta";
public static final String b37KGReferenceWithDecoy = "/humgen/gsa-hpprojects/GATK/bundle/current/b37/human_g1k_v37_decoy.fasta";
public static final String hg19ReferenceWithChrPrefixInChromosomeNames = "/humgen/gsa-hpprojects/GATK/bundle/current/hg19/ucsc.hg19.fasta";
public static final String GATKDataLocation = "/humgen/gsa-hpprojects/GATK/data/";