diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_BaseQualityRankSumTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_BaseQualityRankSumTest.java new file mode 100644 index 000000000..10439d918 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_BaseQualityRankSumTest.java @@ -0,0 +1,99 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.vcf.VCFInfoHeaderLine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.sam.ReadUtils; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; + +import java.util.Arrays; +import java.util.List; + + +/** + * Allele-specific rank Sum Test of REF versus each ALT base quality scores + * + *

This variant-level annotation tests compares the base qualities of the data supporting the reference allele with those supporting each alternate allele. The ideal result is a value close to zero, which indicates there is little to no difference. A negative value indicates that the bases supporting the alternate allele have lower quality scores than those supporting the reference allele. Conversely, a positive value indicates that the bases supporting the alternate allele have higher quality scores than those supporting the reference allele. Finding a statistically significant difference either way suggests that the sequencing process may have been biased or affected by an artifact.

+ * + *

Statistical notes

+ *

The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for base qualities (bases supporting REF vs. bases supporting ALT). See the method document on statistical tests for a more detailed explanation of the ranksum test.

+ * + *

Caveat

+ *

Uninformative reads are not used in these calculations.

+ * + */ +public class AS_BaseQualityRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation { + @Override + public List getKeyNames() { + return Arrays.asList(GATKVCFConstants.AS_BASE_QUAL_RANK_SUM_KEY); + } + + @Override + public String getRawKeyName() { return GATKVCFConstants.AS_RAW_BASE_QUAL_RANK_SUM_KEY;} + + /** + * Get the element for the given read at the given reference position + * + * @param read the read + * @param refLoc the reference position + * @return a Double representing the element to be used in the rank sum test, or null if it should not be used + */ + @Override + protected Double getElementForRead(final GATKSAMRecord read, final int refLoc) { + return (double) read.getBaseQualities()[ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, refLoc, ReadUtils.ClippingTail.RIGHT_TAIL)]; + } + +} \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_FisherStrand.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_FisherStrand.java new file mode 100644 index 000000000..7a8479594 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_FisherStrand.java @@ -0,0 +1,134 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*; +import org.broadinstitute.gatk.utils.QualityUtils; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; + +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + + +/** + * Allele specific strand bias estimated using Fisher's Exact Test + * + */ +public class AS_FisherStrand extends AS_StrandBiasTest implements AS_StandardAnnotation { + + @Override + public List getKeyNames() { + return Collections.singletonList(GATKVCFConstants.AS_FISHER_STRAND_KEY); + } + + @Override + protected Map calculateAnnotationFromLikelihoodMap(final Map stratifiedPerReadAlleleLikelihoodMap, + final VariantContext vc) { + // either SNP with no alignment context, or indels: per-read likelihood map needed + final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc, MIN_COUNT); + //logger.info("VC " + vc); + //printTable(table, 0.0); + return pValueAnnotationForBestTable(table, null); + } + + /** + * Create an annotation for the highest (i.e., least significant) p-value of table1 and table2 + * + * @param table1 a contingency table, may be null + * @param table2 a contingency table, may be null + * @return annotation result for FS given tables + */ + private Map pValueAnnotationForBestTable(final int[][] table1, final int[][] table2) { + if ( table2 == null ) + return table1 == null ? null : annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1)); + else if (table1 == null) + return annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2)); + else { // take the one with the best (i.e., least significant pvalue) + double pvalue1 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1); + double pvalue2 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2); + return annotationForOneTable(Math.max(pvalue1, pvalue2)); + } + } + + /** + * Returns an annotation result given a pValue + * + * @param pValue + * @return a hash map from FS -> phred-scaled pValue + */ + protected Map annotationForOneTable(final double pValue) { + final Object value = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(pValue, MIN_PVALUE))); // prevent INFINITYs + return Collections.singletonMap(getKeyNames().get(0), value); + } + + @Override + protected Map calculateReducedData(AlleleSpecificAnnotationData> combinedData) { + final Map annotationMap = new HashMap<>(); + final Map> perAlleleData = combinedData.getAttributeMap(); + final List refStrandCounts = perAlleleData.get(combinedData.getRefAllele()); + for (final Allele a : perAlleleData.keySet()) { + if(a.equals(combinedData.getRefAllele(),true)) + continue; + final List altStrandCounts = combinedData.getAttribute(a); + final int[][] refAltTable = new int[][] {new int[]{refStrandCounts.get(0),refStrandCounts.get(1)},new int[]{altStrandCounts.get(0),altStrandCounts.get(1)}}; + annotationMap.put(a,QualityUtils.phredScaleErrorRate(Math.max(StrandBiasTableUtils.FisherExactPValueForContingencyTable(refAltTable), MIN_PVALUE))); + } + return annotationMap; + } + + +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_MappingQualityRankSumTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_MappingQualityRankSumTest.java new file mode 100644 index 000000000..b3255fdf1 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_MappingQualityRankSumTest.java @@ -0,0 +1,87 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; +import htsjdk.variant.vcf.VCFInfoHeaderLine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.pileup.PileupElement; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; + +import java.util.Arrays; +import java.util.List; +import java.util.Map; + + +/** + * Allele specific Rank Sum Test for mapping qualities of REF versus each ALT reads + * + * Currently this annotation duplicate the MappingQualityRankSumTest annotation + * + */ +public class AS_MappingQualityRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation { + @Override + public List getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_MAP_QUAL_RANK_SUM_KEY); } + + @Override + public String getRawKeyName() { return GATKVCFConstants.AS_RAW_MAP_QUAL_RANK_SUM_KEY;} + + @Override + protected Double getElementForRead(final GATKSAMRecord read, final int refLoc) { + return (double)read.getMappingQuality(); + } +} \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSAnnotation.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSAnnotation.java new file mode 100644 index 000000000..33930d2a7 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSAnnotation.java @@ -0,0 +1,192 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.Genotype; +import htsjdk.variant.variantcontext.GenotypesContext; +import htsjdk.variant.variantcontext.VariantContext; +import htsjdk.variant.vcf.VCFHeaderLine; +import htsjdk.variant.vcf.VCFInfoHeaderLine; +import org.apache.log4j.Logger; +import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; + +import java.util.*; + +/** + * Allele-specific implementation of root-mean-squared annotations + */ +public abstract class AS_RMSAnnotation extends RMSAnnotation { + protected final static Logger logger = Logger.getLogger(AS_RMSAnnotation.class); + protected final String splitDelim = "\\|"; //String.split takes a regex, so we need to escape the pipe + protected final String printDelim = "|"; + protected AnnotatorCompatible callingWalker; + + + @Override + public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set headerLines) { + if (!AnnotationUtils.walkerSupportsAlleleSpecificAnnotations(walker)) + logger.warn("Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output"); + callingWalker = walker; + } + + public List getDescriptions() { + if (AnnotationUtils.walkerRequiresRawData(callingWalker)) + return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getRawKeyName())); + else + return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); + } + + //For the raw data here, we're only keeping track of the sum of the squares of our values + //When we go to reduce, we'll use the AD info to get the number of reads + public void calculateRawData(final VariantContext vc, + final Map perReadAlleleLikelihoodMap, + final ReducibleAnnotationData myData) { + + //must use perReadAlleleLikelihoodMap for allele-specific annotations + if (perReadAlleleLikelihoodMap != null) { + if ( perReadAlleleLikelihoodMap.size() == 0 ) + return; + getRMSDataFromPRALM(perReadAlleleLikelihoodMap, myData); + } + else + return; + } + + abstract void getRMSDataFromPRALM(final Map perReadAlleleLikelihoodMap, final ReducibleAnnotationData myData); + + @Override + public Map finalizeRawData(final VariantContext vc, final VariantContext originalVC) { + if (!vc.hasAttribute(getRawKeyName())) + return new HashMap<>(); + final String rawMQdata = vc.getAttributeAsString(getRawKeyName(),null); + if (rawMQdata == null) + return new HashMap<>(); + + final Map annotations = new HashMap<>(); + final ReducibleAnnotationData myData = new AlleleSpecificAnnotationData(originalVC.getAlleles(), rawMQdata); + parseRawDataString(myData); + + final String annotationString = makeFinalizedAnnotationString(vc, myData.getAttributeMap()); + annotations.put(getKeyNames().get(0), annotationString); + return annotations; + } + + @Override + protected void parseRawDataString(final ReducibleAnnotationData myData) { + final String rawDataString = myData.getRawData(); + //get per-allele data by splitting on allele delimiter + final String[] rawDataPerAllele = rawDataString.split(splitDelim); + for (int i=0; i combineRawData(final List vcAlleles, final List annotationList) { + //VC already contains merged alleles from ReferenceConfidenceVariantContextMerger + ReducibleAnnotationData combinedData = new AlleleSpecificAnnotationData(vcAlleles, null); + + for (final ReducibleAnnotationData currentValue : annotationList) { + parseRawDataString(currentValue); + combineAttributeMap(currentValue, combinedData); + + } + final Map annotations = new HashMap<>(); + String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap()); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + + @Override + public void combineAttributeMap(final ReducibleAnnotationData toAdd, final ReducibleAnnotationData combined) { + //check that alleles match + for (final Allele currentAllele : combined.getAlleles()){ + //combined is initialized with all alleles, but toAdd might have only a subset + if(toAdd.getAttribute(currentAllele) == null) + continue; + if (toAdd.getAttribute(currentAllele) != null && combined.getAttribute(currentAllele) != null) { + combined.putAttribute(currentAllele, (double) combined.getAttribute(currentAllele) + (double) toAdd.getAttribute(currentAllele)); + } + else + combined.putAttribute(currentAllele, toAdd.getAttribute(currentAllele)); + } + } + + protected Map getADcounts(final VariantContext vc) { + final GenotypesContext genotypes = vc.getGenotypes(); + if ( genotypes == null || genotypes.size() == 0 ) { + logger.warn("VC does not have genotypes -- annotations were calculated in wrong order"); + return null; + } + + final Map variantADs = new HashMap<>(); + for(final Allele a : vc.getAlleles()) + variantADs.put(a,0); + + for (final Genotype gt : vc.getGenotypes()) { + if(!gt.hasAD()) { + continue; + } + final int[] ADs = gt.getAD(); + for(int i = 1; i < vc.getNAlleles(); i++) { + variantADs.put(vc.getAlternateAllele(i-1), variantADs.get(vc.getAlternateAllele(i-1))+ADs[i]); //here -1 is to reconcile allele index with alt allele index + } + } + return variantADs; + } +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSMappingQuality.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSMappingQuality.java new file mode 100644 index 000000000..1b44d66bc --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RMSMappingQuality.java @@ -0,0 +1,147 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*; +import org.broadinstitute.gatk.utils.QualityUtils; +import org.broadinstitute.gatk.utils.contexts.AlignmentContext; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; + +import java.util.*; + + +/** + * Allele-specific Root Mean Square of the mapping quality of reads across all samples. + * + *

This annotation provides an estimation of the mapping quality of reads supporting each alternate allele in a variant call. Depending on the tool it is called from, it produces either raw data (sum of squared MQs) or the calculated root mean square.

+ * + * The raw data is used to accurately calculate the root mean square when combining more than one sample. + * + *

Statistical notes

+ *

The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.

+ * + *

Related annotations

+ * + * + *

Caveat

+ *

Uninformative reads are not used in these annotations.

+ * + */ +public class AS_RMSMappingQuality extends AS_RMSAnnotation implements AS_StandardAnnotation, ActiveRegionBasedAnnotation { + + protected final String printFormat = "%.2f"; + + public List getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_RMS_MAPPING_QUALITY_KEY); } + + public String getRawKeyName() { return GATKVCFConstants.AS_RAW_RMS_MAPPING_QUALITY_KEY; } + + public void getRMSDataFromPRALM(Map perReadAlleleLikelihoodMap, ReducibleAnnotationData myData) { + //over all the samples in the Map... + for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : perReadAlleleLikelihoodMap.values() ) { + //for each read... + for ( final Map.Entry> readLikelihoods : perReadLikelihoods.getLikelihoodReadMap().entrySet() ) { + final int mq = readLikelihoods.getKey().getMappingQuality(); + if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) { + if (!PerReadAlleleLikelihoodMap.getMostLikelyAllele(readLikelihoods.getValue()).isInformative()) + continue; + final Allele bestAllele =PerReadAlleleLikelihoodMap.getMostLikelyAllele(readLikelihoods.getValue()).getMostLikelyAllele(); + double currSquareSum = 0; + if (myData.hasAttribute(bestAllele)) + currSquareSum += (double)myData.getAttribute(bestAllele); + myData.putAttribute(bestAllele, currSquareSum + mq * mq); + } + } + } + } + + @Override + public String makeRawAnnotationString(final List vcAlleles, final Map perAlleleValues) { + String annotationString = ""; + for (final Allele current : vcAlleles) { + if (!annotationString.isEmpty()) + annotationString += printDelim; + if(perAlleleValues.get(current) != null) + annotationString += String.format(printFormat,perAlleleValues.get(current)); + else + annotationString += String.format(printFormat, 0.0); + } + return annotationString; + } + + //this just overrides the RMSAnnotation function that's used for UG -- we don't do allele-specific annotations for UG + @Override + public String makeFinalizedAnnotationString(final VariantContext vc, final Map perAlleleData, final Map stratifiedContexts, final Map perReadAlleleLikelihoodMap) { + return makeFinalizedAnnotationString(vc, perAlleleData); + } + + @Override + public String makeFinalizedAnnotationString(final VariantContext vc, final Map perAlleleValues) { + final Map variantADs = getADcounts(vc); + String annotationString = ""; + for (final Allele current : vc.getAlternateAlleles()) { + if (!annotationString.isEmpty()) + annotationString += ","; + if (perAlleleValues.containsKey(current)) + annotationString += String.format(printFormat, Math.sqrt((double) perAlleleValues.get(current) / variantADs.get(current))); + else { + logger.warn("ERROR: VC allele is not found in annotation alleles -- maybe there was trimming?"); + } + } + return annotationString; + } +} \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RankSumTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RankSumTest.java new file mode 100644 index 000000000..637f4574f --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_RankSumTest.java @@ -0,0 +1,329 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; +import htsjdk.variant.vcf.VCFHeaderLine; +import htsjdk.variant.vcf.VCFInfoHeaderLine; +import org.apache.log4j.Logger; +import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation; +import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller; +import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs; +import org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs; +import org.broadinstitute.gatk.utils.MannWhitneyU; +import org.broadinstitute.gatk.utils.collections.Pair; +import org.broadinstitute.gatk.utils.contexts.AlignmentContext; +import org.broadinstitute.gatk.utils.contexts.ReferenceContext; +import org.broadinstitute.gatk.utils.exceptions.GATKException; +import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; + +import java.util.*; + +/** + * Allele-specific implementation of rank sum test annotations + */ +public abstract class AS_RankSumTest extends RankSumTest implements ReducibleAnnotation { + private final static Logger logger = Logger.getLogger(AS_RMSAnnotation.class); + protected final String splitDelim = "\\|"; //String.split takes a regex, so we need to escape the pipe + protected final String printDelim = "|"; + protected final String reducedDelim = ","; + protected AnnotatorCompatible callingWalker; + + @Override + public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set headerLines) { + if (!AnnotationUtils.walkerSupportsAlleleSpecificAnnotations(walker)) + logger.warn("Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output"); + callingWalker = walker; + super.initialize(walker, toolkit, headerLines); + } + + public List getDescriptions() { + if (AnnotationUtils.walkerRequiresRawData(callingWalker)) + return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getRawKeyName())); + else + return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); + } + + public Map annotateRawData(final RefMetaDataTracker tracker, + final AnnotatorCompatible walker, + final ReferenceContext ref, + final Map stratifiedContexts, + final VariantContext vc, + final Map perReadAlleleLikelihoodMap ) { + + if ( perReadAlleleLikelihoodMap == null) + return new HashMap<>(); + + final Map annotations = new HashMap<>(); + final AlleleSpecificAnnotationData> myData = initializeNewAnnotationData(vc.getAlleles()); + calculateRawData(vc, perReadAlleleLikelihoodMap, myData); + final String annotationString = makeRawAnnotationString(vc.getAlleles(), myData.getAttributeMap()); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + + protected void parseRawDataString(final ReducibleAnnotationData> myData) { + final String rawDataString = myData.getRawData(); + String rawDataNoBrackets; + final Map> perAlleleValues = new HashMap<>(); + //Initialize maps + for (final Allele current : myData.getAlleles()) { + perAlleleValues.put(current, new CompressedDataList()); + } + //Map gives back list with [] + if (rawDataString.charAt(0) == '[') { + rawDataNoBrackets = rawDataString.substring(1, rawDataString.length() - 1); + } + else { + rawDataNoBrackets = rawDataString; + } + //rawDataPerAllele is the list of values for each allele (each of variable length) + final String[] rawDataPerAllele = rawDataNoBrackets.split(splitDelim); + for (int i=0; i alleleList = perAlleleValues.get(myData.getAlleles().get(i)); + final String[] rawListEntriesAsStringVector = alleleData.split(","); + if (rawListEntriesAsStringVector.length %2 != 0) + throw new GATKException("ERROR: rank sum test raw annotation data must occur in pairs"); + for (int j=0; j combineRawData(final List vcAlleles, final List annotationList) { + //VC already contains merged alleles from ReferenceConfidenceVariantContextMerger + final ReducibleAnnotationData combinedData = initializeNewAnnotationData(vcAlleles); + + for (final ReducibleAnnotationData currentValue : annotationList) { + parseRawDataString(currentValue); + combineAttributeMap(currentValue, combinedData); + + } + final Map annotations = new HashMap<>(); + final String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap()); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + + protected AlleleSpecificAnnotationData initializeNewAnnotationData(final List vcAlleles) { + Map> perAlleleValues = new HashMap<>(); + for (Allele a : vcAlleles) { + perAlleleValues.put(a, new CompressedDataList()); + } + AlleleSpecificAnnotationData ret = new AlleleSpecificAnnotationData(vcAlleles, perAlleleValues.toString()); + ret.setAttributeMap(perAlleleValues); + return ret; + } + + protected void combineAttributeMap(final ReducibleAnnotationData> toAdd, final ReducibleAnnotationData> combined) { + for (final Allele a : combined.getAlleles()) { + if (toAdd.hasAttribute(a)) { + final CompressedDataList alleleData = combined.getAttribute(a); + alleleData.add(toAdd.getAttribute(a)); + combined.putAttribute(a, alleleData); + } + } + } + + protected String makeRawAnnotationString(final List vcAlleles, final Map> perAlleleValues) { + String annotationString = ""; + for (int i =0; i< vcAlleles.size(); i++) { + if (i!=0) + annotationString += printDelim; + CompressedDataList alleleValues = perAlleleValues.get(vcAlleles.get(i)); + annotationString += alleleValues.toString(); + } + return annotationString; + } + + protected String makeReducedAnnotationString(VariantContext vc, Map perAltRankSumResults) { + String annotationString = ""; + for (final Allele a : vc.getAlternateAlleles()) { + if (!annotationString.isEmpty()) + annotationString += reducedDelim; + if (!perAltRankSumResults.containsKey(a)) + logger.warn("ERROR: VC allele not found in annotation alleles -- maybe there was trimming?"); + else + annotationString += String.format("%.3f", perAltRankSumResults.get(a)); + } + return annotationString; + } + + /** + * + * @param vc -- contains the final set of alleles, possibly subset by GenotypeGVCFs + * @param originalVC -- used to get all the alleles for all gVCFs + * @return + */ + public Map finalizeRawData(final VariantContext vc, final VariantContext originalVC) { + if (!vc.hasAttribute(getRawKeyName())) + return new HashMap<>(); + + final String rawRankSumData = vc.getAttributeAsString(getRawKeyName(),null); + if (rawRankSumData == null) + return new HashMap<>(); + + final Map annotations = new HashMap<>(); + final AlleleSpecificAnnotationData> myData = new AlleleSpecificAnnotationData(originalVC.getAlleles(), rawRankSumData); + parseRawDataString(myData); + + final Map perAltRankSumResults = calculateReducedData(myData.getAttributeMap(), myData.getRefAllele()); + //shortcut for no ref values + if (perAltRankSumResults.isEmpty()) + return annotations; + final String annotationString = makeReducedAnnotationString(vc, perAltRankSumResults); + annotations.put(getKeyNames().get(0), annotationString); + return annotations; + } + + public void calculateRawData(VariantContext vc, Map pralm, ReducibleAnnotationData myData) { + if(pralm == null) + return; + + final Map> perAlleleValues = myData.getAttributeMap(); + for ( final PerReadAlleleLikelihoodMap likelihoodMap : pralm.values() ) { + if ( likelihoodMap != null && !likelihoodMap.isEmpty() ) { + fillQualsFromLikelihoodMap(vc.getAlleles(), vc.getStart(), likelihoodMap, perAlleleValues); + } + } + + } + + private void fillQualsFromLikelihoodMap(final List alleles, + final int refLoc, + final PerReadAlleleLikelihoodMap likelihoodMap, + final Map> perAlleleValues) { + for ( final Map.Entry> el : likelihoodMap.getLikelihoodReadMap().entrySet() ) { + final MostLikelyAllele a = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue()); + if ( ! a.isInformative() ) + continue; // read is non-informative + + final GATKSAMRecord read = el.getKey(); + if ( isUsableRead(read, refLoc) ) { + final Double value = getElementForRead(read, refLoc, a); + if ( value == null ) + continue; + + if(perAlleleValues.containsKey(a.getMostLikelyAllele())) + perAlleleValues.get(a.getMostLikelyAllele()).add(value.intValue()); + } + } + } + + public Map calculateReducedData(final Map> perAlleleValues, final Allele ref) { + final Map perAltRankSumResults = new HashMap<>(); + //shortcut to not try to calculate rank sum if there are no reads that unambiguously support the ref + if (perAlleleValues.get(ref).isEmpty()) + return perAltRankSumResults; + for (final Allele alt : perAlleleValues.keySet()) { + if (alt.equals(ref, false)) + continue; + final MannWhitneyU mannWhitneyU = new MannWhitneyU(useDithering); + //load alts + for (final Number qual : perAlleleValues.get(alt)) { + mannWhitneyU.add(qual, MannWhitneyU.USet.SET1); + } + //load refs + for (final Number qual : perAlleleValues.get(ref)) { + mannWhitneyU.add(qual, MannWhitneyU.USet.SET2); + } + + if (DEBUG) { + System.out.format("%s, REF QUALS:", this.getClass().getName()); + for (final Number qual : perAlleleValues.get(ref)) + System.out.format("%d ", qual); + System.out.println(); + System.out.format("%s, ALT QUALS:", this.getClass().getName()); + for (final Number qual : perAlleleValues.get(alt)) + System.out.format("%d ", qual); + System.out.println(); + + } + // we are testing that set1 (the alt bases) have lower quality scores than set2 (the ref bases) + final Pair testResults = mannWhitneyU.runOneSidedTest(MannWhitneyU.USet.SET1); + perAltRankSumResults.put(alt, testResults.first); + } + return perAltRankSumResults; + } + +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_ReadPosRankSumTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_ReadPosRankSumTest.java new file mode 100644 index 000000000..5cd4e060b --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_ReadPosRankSumTest.java @@ -0,0 +1,107 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + + +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AS_StandardAnnotation; +import org.broadinstitute.gatk.utils.pileup.PileupElement; +import org.broadinstitute.gatk.utils.sam.AlignmentUtils; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.sam.ReadUtils; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; + +import java.util.Arrays; +import java.util.List; + +/** + * Allele-specific Rank Sum Test for relative positioning of REF versus each ALT allele within reads + * + *

This variant-level annotation tests whether there is evidence of bias in the position of alleles within the reads that support them, between the reference and each alternate allele. Seeing an allele only near the ends of reads is indicative of error, because that is where sequencers tend to make the most errors. However, some variants located near the edges of sequenced regions will necessarily be covered by the ends of reads, so we can't just set an absolute "minimum distance from end of read" threshold. That is why we use a rank sum test to evaluate whether there is a difference in how well the reference allele and the alternate allele are supported.

+ * + *

The ideal result is a value close to zero, which indicates there is little to no difference in where the alleles are found relative to the ends of reads. A negative value indicates that the alternate allele is found at the ends of reads more often than the reference allele. Conversely, a positive value indicates that the reference allele is found at the ends of reads more often than the alternate allele.

+ * + *

This annotation can be used to evaluate confidence in a variant call and is a recommended covariate for variant recalibration (VQSR). Finding a statistically significant difference in relative position either way suggests that the sequencing process may have been biased or affected by an artifact. In practice, we only filter out low negative values when evaluating variant quality because the idea is to filter out variants for which the quality of the data supporting the alternate allele is comparatively low. The reverse case, where it is the quality of data supporting the reference allele that is lower (resulting in positive ranksum scores), is not really informative for filtering variants.

+ * + *

Statistical notes

+ *

The value output for this annotation is the u-based z-approximation from the Mann-Whitney-Wilcoxon Rank Sum Test for site position within reads (position within reads supporting REF vs. position within reads supporting ALT). See the method document on statistical tests for a more detailed explanation of the ranksum test.

+ * + *

Caveat

+ *

Uninformative reads are not used in these annotations.

+ * + * + */ +public class AS_ReadPosRankSumTest extends AS_RankSumTest implements AS_StandardAnnotation { + + @Override + public List getKeyNames() { return Arrays.asList(GATKVCFConstants.AS_READ_POS_RANK_SUM_KEY); } + + @Override + public String getRawKeyName() { return GATKVCFConstants.AS_RAW_READ_POS_RANK_SUM_KEY;} + + @Override + protected Double getElementForRead(final GATKSAMRecord read, final int refLoc) { + final int offset = ReadUtils.getReadCoordinateForReferenceCoordinate(read.getSoftStart(), read.getCigar(), refLoc, ReadUtils.ClippingTail.RIGHT_TAIL, true); + if ( offset == ReadUtils.CLIPPING_GOAL_NOT_REACHED ) + return null; + + int readPos = AlignmentUtils.calcAlignmentByteArrayOffset(read.getCigar(), offset, false, 0, 0); + final int numAlignedBases = AlignmentUtils.getNumAlignedBasesCountingSoftClips( read ); + if (readPos > numAlignedBases / 2) + readPos = numAlignedBases - (readPos + 1); + return (double)readPos; + } + + @Override + protected boolean isUsableRead(final GATKSAMRecord read, final int refLoc) { + return super.isUsableRead(read, refLoc) && read.getSoftStart() + read.getCigar().getReadLength() > refLoc; + } +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandBiasTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandBiasTest.java new file mode 100644 index 000000000..acda6c7ad --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandBiasTest.java @@ -0,0 +1,379 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.GenotypesContext; +import htsjdk.variant.variantcontext.VariantContext; +import htsjdk.variant.vcf.VCFHeaderLine; +import htsjdk.variant.vcf.VCFInfoHeaderLine; +import org.apache.log4j.Logger; +import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation; +import org.broadinstitute.gatk.utils.contexts.AlignmentContext; +import org.broadinstitute.gatk.utils.contexts.ReferenceContext; +import org.broadinstitute.gatk.utils.exceptions.GATKException; +import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; + +import java.util.*; + +/** + * Allele-specific implementation of strand bias annotations + */ +public abstract class AS_StrandBiasTest extends StrandBiasTest implements ReducibleAnnotation { + private final static Logger logger = Logger.getLogger(StrandBiasTest.class); + protected final String splitDelim = "\\|"; //String.split takes a regex, so we need to escape the pipe + protected final String printDelim = "|"; + protected final String reducedDelim = ","; + protected AnnotatorCompatible callingWalker; + protected final int MIN_COUNT = 2; + protected static final double MIN_PVALUE = 1E-320; + protected final int FORWARD = 0; + protected final int REVERSE = 1; + protected final ArrayList ZERO_LIST = new ArrayList<>(); + + @Override + public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set headerLines) { + if (!AnnotationUtils.walkerSupportsAlleleSpecificAnnotations(walker)) + logger.warn("Allele-specific annotations can only be used with HaplotypeCaller, CombineGVCFs and GenotypeGVCFs -- no data will be output"); + callingWalker = walker; + ZERO_LIST.add(0,0); + ZERO_LIST.add(1,0); + } + + @Override + public List getDescriptions() { + if (AnnotationUtils.walkerRequiresRawData(callingWalker)) + return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getRawKeyName())); + else + return Arrays.asList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); + } + + @Override + public String getRawKeyName() { return GATKVCFConstants.AS_SB_TABLE_KEY; } + + public Map annotateRawData(final RefMetaDataTracker tracker, + final AnnotatorCompatible walker, + final ReferenceContext ref, + final Map stratifiedContexts, + final VariantContext vc, + final Map perReadAlleleLikelihoodMap ) { + + //for allele-specific annotations we only call from HC and we only use perReadAlleleLikelihoodMap + if ( perReadAlleleLikelihoodMap == null) + return new HashMap<>(); + + // calculate the annotation from the stratified per read likelihood map + // stratifiedPerReadAllelelikelihoodMap can come from HaplotypeCaller call to VariantAnnotatorEngine + else if (perReadAlleleLikelihoodMap != null) { + final HashMap annotations = new HashMap<>(); + final ReducibleAnnotationData> myData = new AlleleSpecificAnnotationData<>(vc.getAlleles(),null); + calculateRawData(vc, perReadAlleleLikelihoodMap, myData); + final Map> perAlleleValues = myData.getAttributeMap(); + final String annotationString = makeRawAnnotationString(vc.getAlleles(), perAlleleValues); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + else { + // for non-snp variants, we need per-read likelihoods. + // for snps, we can get same result from simple pileup + // for indels that do not have a computed strand bias (SB) or strand bias by sample (SBBS) + return null; + } + } + + protected void parseRawDataString(ReducibleAnnotationData> myData) { + final String rawDataString = myData.getRawData(); + String[] rawDataPerAllele; + String[] rawListEntriesAsStringVector; + Map> perAlleleValues = new HashMap<>(); + //Initialize maps + for (Allele current : myData.getAlleles()) { + perAlleleValues.put(current, new LinkedList()); + } + //rawDataPerAllele is the list of values for each allele (each of variable length) + rawDataPerAllele = rawDataString.split(splitDelim); + for (int i=0; i alleleList = perAlleleValues.get(myData.getAlleles().get(i)); + rawListEntriesAsStringVector = alleleData.split(","); + //Read counts will only ever be integers + for (String s : rawListEntriesAsStringVector) { + if (!s.isEmpty()) + alleleList.add(Integer.parseInt(s.trim())); + } + } + myData.setAttributeMap(perAlleleValues); + } + + @Override + public Map combineRawData(final List vcAlleles, final List annotationList) { + //VC already contains merged alleles from ReferenceConfidenceVariantContextMerger + ReducibleAnnotationData combinedData = new AlleleSpecificAnnotationData(vcAlleles, null); + + for (final ReducibleAnnotationData currentValue : annotationList) { + parseRawDataString(currentValue); + combineAttributeMap(currentValue, combinedData); + } + final Map annotations = new HashMap<>(); + final String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap()); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + + protected void combineAttributeMap(final ReducibleAnnotationData> toAdd, final ReducibleAnnotationData> combined) { + for (final Allele a : combined.getAlleles()) { + if (toAdd.hasAttribute(a) && toAdd.getAttribute(a) != null) { + if (combined.getAttribute(a) != null) { + combined.getAttribute(a).set(0, (int) combined.getAttribute(a).get(0) + (int) toAdd.getAttribute(a).get(0)); + combined.getAttribute(a).set(1, (int) combined.getAttribute(a).get(1) + (int) toAdd.getAttribute(a).get(1)); + } + else { + List alleleData = new ArrayList<>(); + alleleData.add(0, toAdd.getAttribute(a).get(0)); + alleleData.add(1, toAdd.getAttribute(a).get(1)); + combined.putAttribute(a,alleleData); + } + } + } + } + + protected String makeRawAnnotationString(final List vcAlleles, final Map> perAlleleValues) { + String annotationString = ""; + for (final Allele a : vcAlleles) { + if (!annotationString.isEmpty()) + annotationString += printDelim; + List alleleValues = perAlleleValues.get(a); + if (alleleValues == null) + alleleValues = ZERO_LIST; + annotationString += encode(alleleValues); + } + return annotationString; + } + + protected String encode(List alleleValues) { + String annotationString = ""; + for (int j =0; j < alleleValues.size(); j++) { + annotationString += alleleValues.get(j); + if (j < alleleValues.size()-1) + annotationString += ","; + } + return annotationString; + } + + + + protected String makeReducedAnnotationString(VariantContext vc, Map perAltsStrandCounts) { + String annotationString = ""; + for (Allele a : vc.getAlternateAlleles()) { + if (!annotationString.isEmpty()) + annotationString += reducedDelim; + if (!perAltsStrandCounts.containsKey(a)) + logger.warn("ERROR: VC allele not found in annotation alleles -- maybe there was trimming?"); + else + annotationString += String.format("%.3f", perAltsStrandCounts.get(a)); + } + return annotationString; + } + + /** + * + * @param vc -- contains the final set of alleles, possibly subset by GenotypeGVCFs + * @param originalVC -- used to get all the alleles for all gVCFs + * @return + */ + @Override + public Map finalizeRawData(final VariantContext vc, final VariantContext originalVC) { + Map annotations = new HashMap<>(); + if (!vc.hasAttribute(getRawKeyName())) + return new HashMap<>(); + String rawRankSumData = vc.getAttributeAsString(getRawKeyName(),null); + if (rawRankSumData == null) + return new HashMap<>(); + + AlleleSpecificAnnotationData> myData = new AlleleSpecificAnnotationData<>(originalVC.getAlleles(), rawRankSumData); + parseRawDataString(myData); + + Map perAltRankSumResults = calculateReducedData(myData); + + String annotationString = makeReducedAnnotationString(vc, perAltRankSumResults); + annotations.put(getKeyNames().get(0), annotationString); + return annotations; + } + + @Override + public void calculateRawData(final VariantContext vc, Map pralm, final ReducibleAnnotationData rawAnnotations) { + if(pralm == null) + return; + + getStrandCountsFromLikelihoodMap(vc, pralm, rawAnnotations, MIN_COUNT); + } + + protected abstract Map calculateReducedData(final AlleleSpecificAnnotationData> combinedData ); + + /** + Allocate and fill a 2x2 strand contingency table. In the end, it'll look something like this: + * fw rc + * allele1 # # + * allele2 # # + * @return a 2x2 contingency table + */ + public void getStrandCountsFromLikelihoodMap( final VariantContext vc, + final Map stratifiedPerReadAlleleLikelihoodMap, + final ReducibleAnnotationData perAlleleValues, + final int minCount) { + if( stratifiedPerReadAlleleLikelihoodMap == null ) + return; + if( vc == null ) + return; + + final Allele ref = vc.getReference(); + final List allAlts = vc.getAlternateAlleles(); + + for (final PerReadAlleleLikelihoodMap maps : stratifiedPerReadAlleleLikelihoodMap.values() ) { + final ReducibleAnnotationData> sampleTable = new AlleleSpecificAnnotationData<>(vc.getAlleles(),null); + for (final Map.Entry> el : maps.getLikelihoodReadMap().entrySet()) { + final MostLikelyAllele mostLikelyAllele = PerReadAlleleLikelihoodMap.getMostLikelyAllele(el.getValue()); + final GATKSAMRecord read = el.getKey(); + updateTable(mostLikelyAllele.getAlleleIfInformative(), read, ref, allAlts, sampleTable); + } + //for each sample (value in stratified PRALM), only include it if there are >minCount informative reads + if ( passesMinimumThreshold(sampleTable, minCount) ) + combineAttributeMap(sampleTable, perAlleleValues); + } + } + + private void updateTable(final Allele bestAllele, final GATKSAMRecord read, final Allele ref, final List allAlts, final ReducibleAnnotationData> perAlleleValues) { + + final boolean matchesRef = bestAllele.equals(ref, true); + final boolean matchesAnyAlt = allAlts.contains(bestAllele); + + //for uninformative reads + if(bestAllele.isNoCall()) + return; + + //can happen if a read's most likely allele has been removed when --max_alternate_alleles is exceeded + if (!( matchesRef || matchesAnyAlt )) + return; + + final List alleleStrandCounts; + if (perAlleleValues.hasAttribute(bestAllele) && perAlleleValues.getAttribute(bestAllele) != null) + alleleStrandCounts = perAlleleValues.getAttribute(bestAllele); + else { + alleleStrandCounts = new ArrayList<>(); + alleleStrandCounts.add(0,0); + alleleStrandCounts.add(1,0); + } + if (read.isStrandless()) { + // a strandless read counts as observations on both strand, at 50% weight, with a minimum of 1 + // (the 1 is to ensure that a strandless read always counts as an observation on both strands, even + // if the read is only seen once, because it's a merged read or other) + alleleStrandCounts.set(FORWARD, alleleStrandCounts.get(FORWARD)+1); + alleleStrandCounts.set(REVERSE, alleleStrandCounts.get(REVERSE)+1); + } else { + // a normal read with an actual strand + final boolean isFW = !read.getReadNegativeStrandFlag(); + if (isFW) + alleleStrandCounts.set(FORWARD, alleleStrandCounts.get(FORWARD)+1); + else + alleleStrandCounts.set(REVERSE, alleleStrandCounts.get(REVERSE)+1); + } + perAlleleValues.putAttribute(bestAllele, alleleStrandCounts); + } + + /** + * Does this strand data array pass the minimum threshold for inclusion? + * + * @param sampleTable the per-allele fwd/rev read counts for a single sample + * @param minCount The minimum threshold of counts in the array + * @return true if it passes the minimum threshold, false otherwise + */ + protected boolean passesMinimumThreshold(final ReducibleAnnotationData> sampleTable, final int minCount) { + // the read total must be greater than MIN_COUNT + int readTotal = 0; + for (final List alleleValues : sampleTable.getAttributeMap().values()) { + if (alleleValues != null) { + readTotal += alleleValues.get(FORWARD); + readTotal += alleleValues.get(REVERSE); + } + } + return readTotal > minCount; + } + + + @Override + //Allele-specific annotations cannot be called from walkers other than HaplotypeCaller + protected Map calculateAnnotationFromGTfield(final GenotypesContext genotypes){ + return new HashMap<>(); + } + + @Override + //Allele-specific annotations cannot be called from walkers other than HaplotypeCaller + protected Map calculateAnnotationFromStratifiedContexts(final Map stratifiedContexts, + final VariantContext vc){ + return new HashMap<>(); + } + + @Override + //This just calls the non-allele-specific code in StrandBiasTest.java + protected abstract Map calculateAnnotationFromLikelihoodMap(final Map stratifiedPerReadAlleleLikelihoodMap, + final VariantContext vc); + +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandOddsRatio.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandOddsRatio.java new file mode 100644 index 000000000..40cfe39b2 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AS_StrandOddsRatio.java @@ -0,0 +1,123 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; + +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * Allele-specific strand bias estimated by the Symmetric Odds Ratio test + * + */ +public class AS_StrandOddsRatio extends AS_StrandBiasTest implements AS_StandardAnnotation { + + @Override + public List getKeyNames() { + return Collections.singletonList(GATKVCFConstants.AS_STRAND_ODDS_RATIO_KEY); + } + + @Override + protected Map calculateAnnotationFromLikelihoodMap(Map stratifiedPerReadAlleleLikelihoodMap, + final VariantContext vc){ + // either SNP with no alignment context, or indels: per-read likelihood map needed + final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc, MIN_COUNT); + final double ratio = calculateSOR(table); + return Collections.singletonMap(getKeyNames().get(0), (Object)String.format("%.3f",ratio)); + } + + @Override + protected Map calculateReducedData(AlleleSpecificAnnotationData> combinedData) { + final Map annotationMap = new HashMap<>(); + final Map> perAlleleData = combinedData.getAttributeMap(); + final List refStrandCounts = perAlleleData.get(combinedData.getRefAllele()); + for (final Allele a : perAlleleData.keySet()) { + List altStrandCounts = perAlleleData.get(a); + int[][] refAltTable = new int[][] {new int[]{refStrandCounts.get(0),refStrandCounts.get(1)},new int[]{altStrandCounts.get(0),altStrandCounts.get(1)}}; + annotationMap.put(a,calculateSOR(refAltTable)); + } + return annotationMap; + } + + /** + * Computes the SOR value of a table after augmentation (adding pseudocounts). Based on the symmetric odds ratio but modified to take on + * low values when the reference +/- read count ratio is skewed but the alt count ratio is not. Natural log is taken + * to keep values within roughly the same range as other annotations. + * + * Adding pseudocounts prevent divide-by-zero. + * + * @param originalTable The table before augmentation + * @return the SOR annotation value + */ + final protected double calculateSOR(final int[][] originalTable) { + final double[][] augmentedTable = StrandBiasTableUtils.augmentContingencyTable(originalTable); + + double ratio = 0; + + ratio += (augmentedTable[0][0] / augmentedTable[0][1]) * (augmentedTable[1][1] / augmentedTable[1][0]); + ratio += (augmentedTable[0][1] / augmentedTable[0][0]) * (augmentedTable[1][0] / augmentedTable[1][1]); + + final double refRatio = (Math.min(augmentedTable[0][0], augmentedTable[0][1])/Math.max(augmentedTable[0][0], augmentedTable[0][1])); + final double altRatio = (Math.min(augmentedTable[1][0], augmentedTable[1][1])/Math.max(augmentedTable[1][0], augmentedTable[1][1])); + + ratio = ratio*refRatio/altRatio; + + return Math.log(ratio); + } +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtils.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtils.java index 48119e9a2..843f95108 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtils.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtils.java @@ -51,18 +51,40 @@ package org.broadinstitute.gatk.tools.walkers.annotator; +import htsjdk.samtools.Cigar; +import htsjdk.samtools.CigarElement; +import htsjdk.samtools.CigarOperator; +import htsjdk.samtools.SAMRecord; import htsjdk.variant.variantcontext.Genotype; import org.apache.log4j.Logger; import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; +import org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel; +import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs; +import org.broadinstitute.gatk.tools.walkers.variantutils.GenotypeGVCFs; import org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException; import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; public class AnnotationUtils { - public static final String ANNOTATION_HC_WARN_MSG = " annotation will not be calculated, must be called from HaplotyepCaller"; + public static final String ANNOTATION_HC_WARN_MSG = " annotation will not be calculated, must be called from HaplotypeCaller"; public static final int WARNINGS_LOGGED_SIZE = 3; + /** + * Checks if the walker is compatible with allele-specific annotations + */ + public static boolean walkerSupportsAlleleSpecificAnnotations(final AnnotatorCompatible walker) { + return ((walker instanceof HaplotypeCaller) || (walker instanceof CombineGVCFs) || (walker instanceof GenotypeGVCFs)); + } + + /** + * Checks if the walker should get raw annotation data + */ + public static boolean walkerRequiresRawData(final AnnotatorCompatible walker) { + return ((walker instanceof HaplotypeCaller && ((HaplotypeCaller) walker).emitReferenceConfidence()) || walker instanceof CombineGVCFs); + } + /** * Checks if the input data is appropriate * @@ -129,4 +151,98 @@ public class AnnotationUtils { return true; } + + + /** + * Get the position of a variant within a read with respect to the closer end, accounting for hard clipped bases and low quality ends + * Used by ReadPosRankSum annotations + * + * @param read a read containing the variant + * @param initialReadPosition the position based on the modified, post-hard-clipped CIGAR + * @return read position + */ + public static int getFinalVariantReadPosition(final GATKSAMRecord read, final int initialReadPosition) { + final int numAlignedBases = getNumAlignedBases(read); + + int readPos = initialReadPosition; + //TODO: this doesn't work for the middle-right position if we index from zero + if (initialReadPosition > numAlignedBases / 2) { + readPos = numAlignedBases - (initialReadPosition + 1); + } + return readPos; + + } + + /** + * + * @param read a read containing the variant + * @return the number of hard clipped and low qual bases at the read start (where start is the leftmost end w.r.t. the reference) + */ + public static int getNumClippedBasesAtStart(final SAMRecord read) { + // check for hard clips (never consider these bases): + final Cigar c = read.getCigar(); + final CigarElement first = c.getCigarElement(0); + + int numStartClippedBases = 0; + if (first.getOperator() == CigarOperator.H) { + numStartClippedBases = first.getLength(); + } + final byte[] unclippedReadBases = read.getReadBases(); + final byte[] unclippedReadQuals = read.getBaseQualities(); + + // Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative, + // and may leave a string of Q2 bases still hanging off the reads. + //TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails + for (int i = numStartClippedBases; i < unclippedReadBases.length; i++) { + if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD) + numStartClippedBases++; + else + break; + + } + + return numStartClippedBases; + } + + + /** + * + * @param read a read containing the variant + * @return number of non-hard clipped, aligned bases (excluding low quality bases at either end) + */ + //TODO: this is bizarre -- this code counts hard clips, but then subtracts them from the read length, which already doesn't count hard clips + public static int getNumAlignedBases(final GATKSAMRecord read) { + return read.getReadLength() - getNumClippedBasesAtStart(read) - getNumClippedBasesAtEnd(read); + } + + + /** + * + * @param read a read containing the variant + * @return number of hard clipped and low qual bases at the read end (where end is right end w.r.t. the reference) + */ + public static int getNumClippedBasesAtEnd(final GATKSAMRecord read) { + // check for hard clips (never consider these bases): + final Cigar c = read.getCigar(); + CigarElement last = c.getCigarElement(c.numCigarElements() - 1); + + int numEndClippedBases = 0; + if (last.getOperator() == CigarOperator.H) { + numEndClippedBases = last.getLength(); + } + final byte[] unclippedReadBases = read.getReadBases(); + final byte[] unclippedReadQuals = read.getBaseQualities(); + + // Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative, + // and may leave a string of Q2 bases still hanging off the reads. + //TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails + for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) { + if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD) + numEndClippedBases++; + else + break; + } + + return numEndClippedBases; + } } diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ChromosomeCounts.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ChromosomeCounts.java index c990ffd20..94a04ddfc 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ChromosomeCounts.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ChromosomeCounts.java @@ -90,6 +90,7 @@ import java.util.*; public class ChromosomeCounts extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation { private Set founderIds = new HashSet(); + private boolean didUniquifiedSampleNameCheck = false; public Map annotate(final RefMetaDataTracker tracker, final AnnotatorCompatible walker, @@ -99,6 +100,11 @@ public class ChromosomeCounts extends InfoFieldAnnotation implements StandardAnn final Map perReadAlleleLikelihoodMap ) { if ( ! vc.hasGenotypes() ) return null; + //if none of the "founders" are in the vc samples, assume we uniquified the samples upstream and they are all founders + if (!didUniquifiedSampleNameCheck) { + checkSampleNames(vc); + didUniquifiedSampleNameCheck = true; + } return VariantContextUtils.calculateChromosomeCounts(vc, new HashMap(), true,founderIds); } @@ -113,4 +119,21 @@ public class ChromosomeCounts extends InfoFieldAnnotation implements StandardAnn } public List getDescriptions() { return Arrays.asList(ChromosomeCountConstants.descriptions); } + + //this method is intended to reconcile uniquified sample names + // it comes into play when calling this annotation from GenotypeGVCFs with --uniquifySamples because founderIds + // is derived from the sampleDB, which comes from the input sample names, but vc will have uniquified (i.e. different) + // sample names. Without this check, the founderIds won't be found in the vc and the annotation won't be calculated. + protected void checkSampleNames(final VariantContext vc) { + Set vcSamples = new HashSet<>(); + vcSamples.addAll(vc.getSampleNames()); + if (!vcSamples.isEmpty()) { + if (founderIds!=null) { + vcSamples.retainAll(founderIds); + if (vcSamples.isEmpty()) + founderIds = vc.getSampleNames(); + } + } + } + } \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/DepthPerSampleHC.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/DepthPerSampleHC.java index 6c769eed7..adb4dbbca 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/DepthPerSampleHC.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/DepthPerSampleHC.java @@ -55,6 +55,7 @@ import org.apache.commons.lang.StringUtils; import org.apache.log4j.Logger; import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.GenotypeAnnotation; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation; import org.broadinstitute.gatk.utils.contexts.AlignmentContext; import org.broadinstitute.gatk.utils.contexts.ReferenceContext; import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele; diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/FisherStrand.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/FisherStrand.java index 61cba0875..0f83e6444 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/FisherStrand.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/FisherStrand.java @@ -51,9 +51,7 @@ package org.broadinstitute.gatk.tools.walkers.annotator; -import cern.jet.math.Arithmetic; import htsjdk.variant.variantcontext.GenotypesContext; -import org.apache.log4j.Logger; import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation; import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation; import org.broadinstitute.gatk.utils.contexts.AlignmentContext; @@ -90,16 +88,25 @@ import java.util.*; */ public class FisherStrand extends StrandBiasTest implements StandardAnnotation, ActiveRegionBasedAnnotation { private final static boolean ENABLE_DEBUGGING = false; - private final static Logger logger = Logger.getLogger(FisherStrand.class); private static final double MIN_PVALUE = 1E-320; private static final int MIN_QUAL_FOR_FILTERED_TEST = 17; private static final int MIN_COUNT = ARRAY_DIM; + @Override + public List getKeyNames() { + return Collections.singletonList(GATKVCFConstants.FISHER_STRAND_KEY); + } + + @Override + public List getDescriptions() { + return Collections.singletonList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); + } + @Override protected Map calculateAnnotationFromGTfield(final GenotypesContext genotypes){ final int[][] tableFromPerSampleAnnotations = getTableFromSamples( genotypes, MIN_COUNT ); - return ( tableFromPerSampleAnnotations != null )? pValueForBestTable(tableFromPerSampleAnnotations, null) : null; + return ( tableFromPerSampleAnnotations != null )? pValueAnnotationForBestTable(tableFromPerSampleAnnotations, null) : null; } @Override @@ -107,9 +114,11 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation, final VariantContext vc){ final int[][] tableNoFiltering = getSNPContingencyTable(stratifiedContexts, vc.getReference(), vc.getAlternateAlleles(), -1, MIN_COUNT); final int[][] tableFiltering = getSNPContingencyTable(stratifiedContexts, vc.getReference(), vc.getAlternateAlleles(), MIN_QUAL_FOR_FILTERED_TEST, MIN_COUNT); - printTable("unfiltered", tableNoFiltering); - printTable("filtered", tableFiltering); - return pValueForBestTable(tableFiltering, tableNoFiltering); + if (ENABLE_DEBUGGING) { + StrandBiasTableUtils.printTable("unfiltered", tableNoFiltering); + StrandBiasTableUtils.printTable("filtered", tableFiltering); + } + return pValueAnnotationForBestTable(tableFiltering, tableNoFiltering); } @Override @@ -119,11 +128,9 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation, final int[][] table = getContingencyTable(stratifiedPerReadAlleleLikelihoodMap, vc, MIN_COUNT); //logger.info("VC " + vc); //printTable(table, 0.0); - return pValueForBestTable(table, null); + return pValueAnnotationForBestTable(table, null); } - - /** * Create an annotation for the highest (i.e., least significant) p-value of table1 and table2 * @@ -131,14 +138,14 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation, * @param table2 a contingency table, may be null * @return annotation result for FS given tables */ - private Map pValueForBestTable(final int[][] table1, final int[][] table2) { + private Map pValueAnnotationForBestTable(final int[][] table1, final int[][] table2) { if ( table2 == null ) - return table1 == null ? null : annotationForOneTable(pValueForContingencyTable(table1)); + return table1 == null ? null : annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1)); else if (table1 == null) - return annotationForOneTable(pValueForContingencyTable(table2)); + return annotationForOneTable(StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2)); else { // take the one with the best (i.e., least significant pvalue) - double pvalue1 = pValueForContingencyTable(table1); - double pvalue2 = pValueForContingencyTable(table2); + double pvalue1 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table1); + double pvalue2 = StrandBiasTableUtils.FisherExactPValueForContingencyTable(table2); return annotationForOneTable(Math.max(pvalue1, pvalue2)); } } @@ -153,185 +160,4 @@ public class FisherStrand extends StrandBiasTest implements StandardAnnotation, final Object value = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(pValue, MIN_PVALUE))); // prevent INFINITYs return Collections.singletonMap(getKeyNames().get(0), value); } - - @Override - public List getKeyNames() { - return Collections.singletonList(GATKVCFConstants.FISHER_STRAND_KEY); - } - - @Override - public List getDescriptions() { - return Collections.singletonList(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); - } - - /** - * Helper function to turn the FisherStrand table into the SB annotation array - * @param table the table used by the FisherStrand annotation - * @return the array used by the per-sample Strand Bias annotation - */ - public static List getContingencyArray( final int[][] table ) { - if(table.length != ARRAY_DIM || table[0].length != ARRAY_DIM) { - logger.warn("Expecting a " + ARRAY_DIM + "x" + ARRAY_DIM + " strand bias table."); - return null; - } - - final List list = new ArrayList<>(ARRAY_SIZE); // TODO - if we ever want to do something clever with multi-allelic sites this will need to change - list.add(table[0][0]); - list.add(table[0][1]); - list.add(table[1][0]); - list.add(table[1][1]); - return list; - } - - public static Double pValueForContingencyTable(int[][] originalTable) { - final int[][] normalizedTable = normalizeContingencyTable(originalTable); - - int[][] table = copyContingencyTable(normalizedTable); - - double pCutoff = computePValue(table); - //printTable(table, pCutoff); - - double pValue = pCutoff; - while (rotateTable(table)) { - double pValuePiece = computePValue(table); - - //printTable(table, pValuePiece); - - if (pValuePiece <= pCutoff) { - pValue += pValuePiece; - } - } - - table = copyContingencyTable(normalizedTable); - while (unrotateTable(table)) { - double pValuePiece = computePValue(table); - - //printTable(table, pValuePiece); - - if (pValuePiece <= pCutoff) { - pValue += pValuePiece; - } - } - - //System.out.printf("P-cutoff: %f\n", pCutoff); - //System.out.printf("P-value: %f\n\n", pValue); - - // min is necessary as numerical precision can result in pValue being slightly greater than 1.0 - return Math.min(pValue, 1.0); - } - - // how large do we want the normalized table to be? - private static final double TARGET_TABLE_SIZE = 200.0; - - /** - * Normalize the table so that the entries are not too large. - * Note that this method does NOT necessarily make a copy of the table being passed in! - * - * @param table the original table - * @return a normalized version of the table or the original table if it is already normalized - */ - private static int[][] normalizeContingencyTable(final int[][] table) { - final int sum = table[0][0] + table[0][1] + table[1][0] + table[1][1]; - if ( sum <= TARGET_TABLE_SIZE * 2 ) - return table; - - final double normalizationFactor = (double)sum / TARGET_TABLE_SIZE; - - final int[][] normalized = new int[ARRAY_DIM][ARRAY_DIM]; - for ( int i = 0; i < ARRAY_DIM; i++ ) { - for ( int j = 0; j < ARRAY_DIM; j++ ) - normalized[i][j] = (int)(table[i][j] / normalizationFactor); - } - - return normalized; - } - - private static int [][] copyContingencyTable(int [][] t) { - int[][] c = new int[ARRAY_DIM][ARRAY_DIM]; - - for ( int i = 0; i < ARRAY_DIM; i++ ) - for ( int j = 0; j < ARRAY_DIM; j++ ) - c[i][j] = t[i][j]; - - return c; - } - - - private static void printTable(int[][] table, double pValue) { - logger.info(String.format("%d %d; %d %d : %f", table[0][0], table[0][1], table[1][0], table[1][1], pValue)); - } - - /** - * Printing information to logger.info for debugging purposes - * - * @param name the name of the table - * @param table the table itself - */ - private void printTable(final String name, final int[][] table) { - if ( ENABLE_DEBUGGING ) { - final String pValue = (String)annotationForOneTable(pValueForContingencyTable(table)).get(getKeyNames().get(0)); - logger.info(String.format("FS %s (REF+, REF-, ALT+, ALT-) = (%d, %d, %d, %d) = %s", - name, table[0][0], table[0][1], table[1][0], table[1][1], pValue)); - } - } - - private static boolean rotateTable(int[][] table) { - table[0][0]--; - table[1][0]++; - - table[0][1]++; - table[1][1]--; - - return (table[0][0] >= 0 && table[1][1] >= 0); - } - - private static boolean unrotateTable(int[][] table) { - table[0][0]++; - table[1][0]--; - - table[0][1]--; - table[1][1]++; - - return (table[0][1] >= 0 && table[1][0] >= 0); - } - - private static double computePValue(int[][] table) { - - int[] rowSums = { sumRow(table, 0), sumRow(table, 1) }; - int[] colSums = { sumColumn(table, 0), sumColumn(table, 1) }; - int N = rowSums[0] + rowSums[1]; - - // calculate in log space so we don't die with high numbers - double pCutoff = Arithmetic.logFactorial(rowSums[0]) - + Arithmetic.logFactorial(rowSums[1]) - + Arithmetic.logFactorial(colSums[0]) - + Arithmetic.logFactorial(colSums[1]) - - Arithmetic.logFactorial(table[0][0]) - - Arithmetic.logFactorial(table[0][1]) - - Arithmetic.logFactorial(table[1][0]) - - Arithmetic.logFactorial(table[1][1]) - - Arithmetic.logFactorial(N); - return Math.exp(pCutoff); - } - - private static int sumRow(int[][] table, int column) { - int sum = 0; - for (int r = 0; r < table.length; r++) { - sum += table[r][column]; - } - - return sum; - } - - private static int sumColumn(int[][] table, int row) { - int sum = 0; - for (int c = 0; c < table[row].length; c++) { - sum += table[row][c]; - } - - return sum; - } - - - } diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSAnnotation.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSAnnotation.java new file mode 100644 index 000000000..6e50d7192 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSAnnotation.java @@ -0,0 +1,248 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.Genotype; +import htsjdk.variant.variantcontext.VariantContext; +import htsjdk.variant.vcf.VCFConstants; +import htsjdk.variant.vcf.VCFHeaderLine; +import htsjdk.variant.vcf.VCFInfoHeaderLine; +import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ReducibleAnnotation; +import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller; +import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs; +import org.broadinstitute.gatk.utils.QualityUtils; +import org.broadinstitute.gatk.utils.contexts.AlignmentContext; +import org.broadinstitute.gatk.utils.contexts.ReferenceContext; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.pileup.PileupElement; +import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; + +import java.util.*; + +/** + * Abstract root for all RankSum-based annotations + */ +public abstract class RMSAnnotation extends InfoFieldAnnotation implements ReducibleAnnotation { + protected AnnotatorCompatible callingWalker; + + @Override + public void initialize(final AnnotatorCompatible walker, final GenomeAnalysisEngine toolkit, final Set headerLines) { + callingWalker = walker; + } + + @Override + public List getDescriptions() { + final List headerLines = new ArrayList<>(); + //ideally only HC in GVCF mode would get the raw header line, but that's a little more complicated + if (callingWalker instanceof HaplotypeCaller || callingWalker instanceof CombineGVCFs) + headerLines.add(GATKVCFHeaderLines.getInfoLine(getRawKeyName())); + headerLines.add(GATKVCFHeaderLines.getInfoLine(getKeyNames().get(0))); + return headerLines; + } + + @Override + public Map annotate(final RefMetaDataTracker tracker, + final AnnotatorCompatible walker, + final ReferenceContext ref, + final Map stratifiedContexts, + final VariantContext vc, + final Map perReadAlleleLikelihoodMap ) { + + if ( (stratifiedContexts == null || stratifiedContexts.isEmpty()) && perReadAlleleLikelihoodMap == null) + return null; + + final Map annotations = new HashMap<>(); + final ReducibleAnnotationData myData = new ReducibleAnnotationData<>(null); + calculateRawData(stratifiedContexts, perReadAlleleLikelihoodMap, myData); + final String annotationString = makeFinalizedAnnotationString(vc, myData.getAttributeMap(), stratifiedContexts, perReadAlleleLikelihoodMap); + annotations.put(getKeyNames().get(0), annotationString); + return annotations; + } + + public Map annotateRawData(final RefMetaDataTracker tracker, + final AnnotatorCompatible walker, + final ReferenceContext ref, + final Map stratifiedContexts, + final VariantContext vc, + final Map perReadAlleleLikelihoodMap ) { + + if ( perReadAlleleLikelihoodMap == null) + return new HashMap<>(); + + final Map annotations = new HashMap<>(); + ReducibleAnnotationData myData = new ReducibleAnnotationData<>(null); + calculateRawData(vc, perReadAlleleLikelihoodMap, myData); + String annotationString = makeRawAnnotationString(vc.getAlleles(), myData.getAttributeMap()); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + + @Override + public Map combineRawData(final List vcAlleles, final List annotationList) { + //VC already contains merged alleles from ReferenceConfidenceVariantContextMerger + ReducibleAnnotationData combinedData = new ReducibleAnnotationData(null); + + for (final ReducibleAnnotationData currentValue : annotationList) { + parseRawDataString(currentValue); + combineAttributeMap(currentValue, combinedData); + + } + final Map annotations = new HashMap<>(); + String annotationString = makeRawAnnotationString(vcAlleles, combinedData.getAttributeMap()); + annotations.put(getRawKeyName(), annotationString); + return annotations; + } + + @Override + public Map finalizeRawData(final VariantContext vc, final VariantContext originalVC) { + if (!vc.hasAttribute(getRawKeyName())) + return new HashMap<>(); + String rawMQdata = vc.getAttributeAsString(getRawKeyName(),null); + if (rawMQdata == null) + return new HashMap<>(); + + ReducibleAnnotationData myData = new ReducibleAnnotationData(rawMQdata); + parseRawDataString(myData); + + String annotationString = makeFinalizedAnnotationString(vc, myData.getAttributeMap()); + return Collections.singletonMap(getKeyNames().get(0), (Object)annotationString); + } + + protected void parseRawDataString(ReducibleAnnotationData myData) { + final String rawDataString = myData.getRawData(); + String[] rawMQdataAsStringVector; + rawMQdataAsStringVector = rawDataString.split(","); + double squareSum = Double.parseDouble(rawMQdataAsStringVector[0]); + myData.putAttribute(Allele.NO_CALL, squareSum); + } + + public void combineAttributeMap(ReducibleAnnotationData toAdd, ReducibleAnnotationData combined) { + if (combined.getAttribute(Allele.NO_CALL) != null) + combined.putAttribute(Allele.NO_CALL, (Double) combined.getAttribute(Allele.NO_CALL) + (Double) toAdd.getAttribute(Allele.NO_CALL)); + else + combined.putAttribute(Allele.NO_CALL, toAdd.getAttribute(Allele.NO_CALL)); + + } + + //Implementations of this method should return a string consisting of the sum of the squared values for the attribute being annotated (or a delimited list of those if allele-specific) + abstract protected String makeRawAnnotationString(List vcAlleles, Map sumOfSquares); + + //Implementations of this method should return a string with the finalized annotation value as will appear in the INFO field + abstract protected String makeFinalizedAnnotationString(VariantContext vc, Map sumOfSquares); + + //Implementations of this method should return a string with the finalized annotation value as will appear in the INFO field + abstract protected String makeFinalizedAnnotationString(VariantContext vc, Map sumOfSquares, Map stratifiedContexts, final Map perReadAlleleLikelihoodMap); + + protected void calculateRawData(final Map stratifiedContexts, + final Map perReadAlleleLikelihoodMap, + final ReducibleAnnotationData myData) { + if (perReadAlleleLikelihoodMap != null) { + calculateRawData((VariantContext) null, perReadAlleleLikelihoodMap, myData); + } + } + + /** + * + * @param vc + * @param perReadAlleleLikelihoodMap + * @param stratifiedContexts + * @return the number of reads at the vc position (-1 if all read data is null) + */ + public int getNumOfReads(final VariantContext vc, + final Map perReadAlleleLikelihoodMap, + final Map stratifiedContexts) { + //don't use the full depth because we don't calculate MQ for reference blocks + int numOfReads = 0; + if(vc.hasAttribute(VCFConstants.DEPTH_KEY)) { + numOfReads += Integer.parseInt(vc.getAttributeAsString(VCFConstants.DEPTH_KEY, "-1")); + if(vc.hasGenotypes()) { + for(Genotype gt : vc.getGenotypes()) { + if(gt.isHomRef() && gt.hasExtendedAttribute("MIN_DP")) //site-level DP contribution will come from MIN_DP for gVCF-called reference variants + numOfReads -= Integer.parseInt(gt.getExtendedAttribute("MIN_DP").toString()); + } + } + return numOfReads; + } + else if (stratifiedContexts != null && !stratifiedContexts.isEmpty()) { + for ( final Map.Entry sample : stratifiedContexts.entrySet() ) { + final AlignmentContext context = sample.getValue(); + for ( final PileupElement p : context.getBasePileup() ) { + int mq = p.getRead().getMappingQuality(); + if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) { + numOfReads++; + } + } + } + return numOfReads; + } + else if (perReadAlleleLikelihoodMap != null && !perReadAlleleLikelihoodMap.isEmpty()) + { + for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : perReadAlleleLikelihoodMap.values() ) { + for ( final GATKSAMRecord read : perReadLikelihoods.getStoredElements() ) { + int mq = read.getMappingQuality(); + if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) { + numOfReads++; + } + } + } + return numOfReads; + } + return -1; + } + +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSMappingQuality.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSMappingQuality.java index 534357d43..7d0332e7d 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSMappingQuality.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RMSMappingQuality.java @@ -51,22 +51,21 @@ package org.broadinstitute.gatk.tools.walkers.annotator; -import org.broadinstitute.gatk.utils.contexts.AlignmentContext; -import org.broadinstitute.gatk.utils.contexts.ReferenceContext; -import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation; -import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; -import org.broadinstitute.gatk.utils.MathUtils; -import org.broadinstitute.gatk.utils.QualityUtils; +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; import htsjdk.variant.vcf.VCFConstants; import htsjdk.variant.vcf.VCFInfoHeaderLine; import htsjdk.variant.vcf.VCFStandardHeaderLines; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*; +import org.broadinstitute.gatk.tools.walkers.haplotypecaller.HaplotypeCaller; +import org.broadinstitute.gatk.tools.walkers.variantutils.CombineGVCFs; +import org.broadinstitute.gatk.utils.QualityUtils; +import org.broadinstitute.gatk.utils.contexts.AlignmentContext; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; import org.broadinstitute.gatk.utils.pileup.PileupElement; import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; -import htsjdk.variant.variantcontext.VariantContext; +import org.broadinstitute.gatk.utils.variant.GATKVCFConstants; +import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines; import java.util.*; @@ -74,7 +73,9 @@ import java.util.*; /** * Root Mean Square of the mapping quality of reads across all samples. * - *

This annotation provides an estimation of the overall mapping quality of reads supporting a variant call, averaged over all samples in a cohort.

+ *

This annotation provides an estimation of the overall mapping quality of reads supporting a variant call. It produce both raw data (sum of square and num of total reads) and the calculated root mean square.

+ * + * The raw data is used to accurately calculate the root mean square when combining more than one sample. * *

Statistical notes

*

The root mean square is equivalent to the mean of the mapping qualities plus the standard deviation of the mapping qualities.

@@ -85,51 +86,89 @@ import java.util.*; * * */ -public class RMSMappingQuality extends InfoFieldAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation { +public class RMSMappingQuality extends RMSAnnotation implements StandardAnnotation, ActiveRegionBasedAnnotation, ReducibleAnnotation { - public Map annotate(final RefMetaDataTracker tracker, - final AnnotatorCompatible walker, - final ReferenceContext ref, - final Map stratifiedContexts, - final VariantContext vc, - final Map perReadAlleleLikelihoodMap ) { + @Override //this needs an override because MQ is a VCF standard so it's headerline is in a different place + public List getDescriptions() { + final List headerLines = new ArrayList<>(); + //only HC in GVCF mode should get the raw header line + if ((callingWalker instanceof HaplotypeCaller && ((HaplotypeCaller) callingWalker).emitReferenceConfidence()) || callingWalker instanceof CombineGVCFs) + headerLines.add(GATKVCFHeaderLines.getInfoLine(getRawKeyName())); + headerLines.add(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0))); + return headerLines; + } - final List qualities = new ArrayList<>(); + public List getKeyNames() { return Arrays.asList( + VCFConstants.RMS_MAPPING_QUALITY_KEY); + } + + public String getRawKeyName() { return GATKVCFConstants.RAW_RMS_MAPPING_QUALITY_KEY;} + + @Override + public void calculateRawData(final VariantContext vc, final Map pralm, final ReducibleAnnotationData rawAnnotations) { + Double squareSum = 0.0; + if ( pralm.size() == 0 ) + return; + + for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : pralm.values() ) { + for ( final GATKSAMRecord read : perReadLikelihoods.getStoredElements() ) { + int mq = read.getMappingQuality(); + if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) { + squareSum += mq * mq; + } + } + } + rawAnnotations.putAttribute(Allele.NO_CALL,squareSum); + } + + //this version applies to non-HaplotypeCaller annotators + @Override + protected void calculateRawData(final Map stratifiedContexts, + final Map perReadAlleleLikelihoodMap, + final ReducibleAnnotationData myData) { + + Double squareSum = 0.0; if ( stratifiedContexts != null ) { if ( stratifiedContexts.size() == 0 ) - return null; + return; for ( final Map.Entry sample : stratifiedContexts.entrySet() ) { final AlignmentContext context = sample.getValue(); - for ( final PileupElement p : context.getBasePileup() ) - fillMappingQualitiesFromPileup(p.getRead().getMappingQuality(), qualities); + for ( final PileupElement p : context.getBasePileup() ) { + int mq = p.getRead().getMappingQuality(); + if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) { + squareSum += mq * mq; + } + } } + myData.putAttribute(Allele.NO_CALL,squareSum); } else if (perReadAlleleLikelihoodMap != null) { - if ( perReadAlleleLikelihoodMap.size() == 0 ) - return null; - - for ( final PerReadAlleleLikelihoodMap perReadLikelihoods : perReadAlleleLikelihoodMap.values() ) { - for ( final GATKSAMRecord read : perReadLikelihoods.getStoredElements() ) - fillMappingQualitiesFromPileup(read.getMappingQuality(), qualities); - } - } - else - return null; - - final double rms = MathUtils.rms(qualities); - return Collections.singletonMap(getKeyNames().get(0), (Object)String.format("%.2f", rms)); - } - - private static void fillMappingQualitiesFromPileup(final int mq, final List qualities) { - if ( mq != QualityUtils.MAPPING_QUALITY_UNAVAILABLE ) { - qualities.add(mq); + calculateRawData((VariantContext) null, perReadAlleleLikelihoodMap, myData); } } - public List getKeyNames() { return Arrays.asList(VCFConstants.RMS_MAPPING_QUALITY_KEY); } - - public List getDescriptions() { - return Arrays.asList(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0))); + @Override + public String makeRawAnnotationString(final List vcAlleles, final Map perAlleleData) { + return String.format("%.2f", perAlleleData.get(Allele.NO_CALL)); } + + @Override + public String makeFinalizedAnnotationString(final VariantContext vc, final Map perAlleleData, final Map stratifiedContexts, final Map perReadAlleleLikelihoodMap) { + if ((stratifiedContexts != null && !stratifiedContexts.isEmpty()) || perReadAlleleLikelihoodMap != null) { + int numOfReads = getNumOfReads(vc, perReadAlleleLikelihoodMap, stratifiedContexts); + return String.format("%.2f", Math.sqrt((double) perAlleleData.get(Allele.NO_CALL) / numOfReads)); + } + else { + return makeFinalizedAnnotationString(vc, perAlleleData); + } + } + + @Override + public String makeFinalizedAnnotationString(final VariantContext vc, final Map perAlleleData) { + int numOfReads = getNumOfReads(vc, null, null); + return String.format("%.2f", Math.sqrt((double)perAlleleData.get(Allele.NO_CALL)/numOfReads)); + } + + } \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RankSumTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RankSumTest.java index 999ebb8cd..4ed64464b 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RankSumTest.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/RankSumTest.java @@ -52,12 +52,10 @@ package org.broadinstitute.gatk.tools.walkers.annotator; import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.*; import org.broadinstitute.gatk.utils.contexts.AlignmentContext; import org.broadinstitute.gatk.utils.contexts.ReferenceContext; import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotatorCompatible; -import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.InfoFieldAnnotation; import org.broadinstitute.gatk.utils.genotyper.MostLikelyAllele; import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; import org.broadinstitute.gatk.utils.MannWhitneyU; @@ -76,11 +74,12 @@ import java.util.*; /** - * Abstract root for all RankSum based annotations + * Abstract root for all RankSum-based annotations */ +//TODO: will eventually implement ReducibleAnnotation in order to preserve accuracy for CombineGVCFs and GenotypeGVCFs -- see RMSAnnotation.java for an example of an abstract ReducibleAnnotation public abstract class RankSumTest extends InfoFieldAnnotation implements ActiveRegionBasedAnnotation { static final boolean DEBUG = false; - private boolean useDithering = true; + protected boolean useDithering = true; public Map annotate(final RefMetaDataTracker tracker, final AnnotatorCompatible walker, diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReadPosRankSumTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReadPosRankSumTest.java index c39f872f9..3b33d80c6 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReadPosRankSumTest.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReadPosRankSumTest.java @@ -51,12 +51,7 @@ package org.broadinstitute.gatk.tools.walkers.annotator; -import htsjdk.samtools.Cigar; -import htsjdk.samtools.CigarElement; -import htsjdk.samtools.CigarOperator; -import htsjdk.samtools.SAMRecord; import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.StandardAnnotation; -import org.broadinstitute.gatk.tools.walkers.indels.PairHMMIndelErrorModel; import htsjdk.variant.vcf.VCFInfoHeaderLine; import org.broadinstitute.gatk.utils.pileup.PileupElement; import org.broadinstitute.gatk.utils.sam.AlignmentUtils; @@ -109,7 +104,7 @@ public class ReadPosRankSumTest extends RankSumTest implements StandardAnnotatio @Override protected Double getElementForPileupElement(final PileupElement p) { final int offset = AlignmentUtils.calcAlignmentByteArrayOffset(p.getRead().getCigar(), p, 0, 0); - return (double)getFinalReadPosition(p.getRead(), offset); + return (double)AnnotationUtils.getFinalVariantReadPosition(p.getRead(), offset); } @Override @@ -122,69 +117,5 @@ public class ReadPosRankSumTest extends RankSumTest implements StandardAnnotatio return super.isUsableRead(read, refLoc) && read.getSoftStart() + read.getCigar().getReadLength() > refLoc; } - private int getFinalReadPosition(final GATKSAMRecord read, final int initialReadPosition) { - final int numAlignedBases = getNumAlignedBases(read); - int readPos = initialReadPosition; - if (initialReadPosition > numAlignedBases / 2) { - readPos = numAlignedBases - (initialReadPosition + 1); - } - return readPos; - - } - - private int getNumClippedBasesAtStart(final SAMRecord read) { - // compute total number of clipped bases (soft or hard clipped) - // check for hard clips (never consider these bases): - final Cigar c = read.getCigar(); - final CigarElement first = c.getCigarElement(0); - - int numStartClippedBases = 0; - if (first.getOperator() == CigarOperator.H) { - numStartClippedBases = first.getLength(); - } - final byte[] unclippedReadBases = read.getReadBases(); - final byte[] unclippedReadQuals = read.getBaseQualities(); - - // Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative, - // and may leave a string of Q2 bases still hanging off the reads. - for (int i = numStartClippedBases; i < unclippedReadBases.length; i++) { - if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD) - numStartClippedBases++; - else - break; - - } - - return numStartClippedBases; - } - - private int getNumAlignedBases(final GATKSAMRecord read) { - return read.getReadLength() - getNumClippedBasesAtStart(read) - getNumClippedBasesAtEnd(read); - } - - private int getNumClippedBasesAtEnd(final GATKSAMRecord read) { - // compute total number of clipped bases (soft or hard clipped) - // check for hard clips (never consider these bases): - final Cigar c = read.getCigar(); - CigarElement last = c.getCigarElement(c.numCigarElements() - 1); - - int numEndClippedBases = 0; - if (last.getOperator() == CigarOperator.H) { - numEndClippedBases = last.getLength(); - } - final byte[] unclippedReadBases = read.getReadBases(); - final byte[] unclippedReadQuals = read.getBaseQualities(); - - // Do a stricter base clipping than provided by CIGAR string, since this one may be too conservative, - // and may leave a string of Q2 bases still hanging off the reads. - for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) { - if (unclippedReadQuals[i] < PairHMMIndelErrorModel.BASE_QUAL_THRESHOLD) - numEndClippedBases++; - else - break; - } - - return numEndClippedBases; - } } diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasBySample.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasBySample.java index dc08f58c7..0f3496015 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasBySample.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasBySample.java @@ -52,7 +52,6 @@ package org.broadinstitute.gatk.tools.walkers.annotator; import org.apache.log4j.Logger; -import org.apache.commons.lang.StringUtils; import org.broadinstitute.gatk.utils.contexts.AlignmentContext; import org.broadinstitute.gatk.utils.contexts.ReferenceContext; import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; @@ -117,7 +116,7 @@ public class StrandBiasBySample extends GenotypeAnnotation { final int[][] table = FisherStrand.getContingencyTable(Collections.singletonMap(g.getSampleName(), alleleLikelihoodMap), vc, 0); - gb.attribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY, FisherStrand.getContingencyArray(table)); + gb.attribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY, StrandBiasTableUtils.getContingencyArray(table)); } @Override diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTableUtils.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTableUtils.java new file mode 100644 index 000000000..563bbe155 --- /dev/null +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTableUtils.java @@ -0,0 +1,250 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import cern.jet.math.Arithmetic; +import org.apache.log4j.Logger; +import org.broadinstitute.gatk.utils.QualityUtils; + +import java.util.ArrayList; +import java.util.List; + +/** + * A class containing many convenience methods used in the strand bias annotation calculations + */ +public class StrandBiasTableUtils { + + private final static Logger logger = Logger.getLogger(StrandBiasTableUtils.class); + + //For now this is only for 2x2 contingency tables + protected static final int ARRAY_DIM = 2; + protected static final int ARRAY_SIZE = ARRAY_DIM * ARRAY_DIM; + private static double MIN_PVALUE = 1E-320; + // how large do we want the normalized table to be? + private static final double TARGET_TABLE_SIZE = 200.0; + private final static double AUGMENTATION_CONSTANT = 1.0; + + /** + * Computes a two-sided p-Value for a Fisher's exact test on the contingency table, after normalizing counts so that the sum does not exceed {@value org.broadinstitute.gatk.tools.walkers.annotator.StrandBiasTableUtils#TARGET_TABLE_SIZE} + * @param originalTable + * @return + */ + public static Double FisherExactPValueForContingencyTable(int[][] originalTable) { + final int[][] normalizedTable = normalizeContingencyTable(originalTable); + + int[][] table = copyContingencyTable(normalizedTable); + + double pCutoff = computePValue(table); + + double pValue = pCutoff; + while (rotateTable(table)) { + double pValuePiece = computePValue(table); + + if (pValuePiece <= pCutoff) { + pValue += pValuePiece; + } + } + + table = copyContingencyTable(normalizedTable); + while (unrotateTable(table)) { + double pValuePiece = computePValue(table); + + if (pValuePiece <= pCutoff) { + pValue += pValuePiece; + } + } + + // min is necessary as numerical precision can result in pValue being slightly greater than 1.0 + return Math.min(pValue, 1.0); + } + + /** + * Helper function to turn the FisherStrand table into the SB annotation array + * @param table the table used by the FisherStrand annotation + * @return the array used by the per-sample Strand Bias annotation + */ + public static List getContingencyArray( final int[][] table ) { + if(table.length != ARRAY_DIM || table[0].length != ARRAY_DIM) { + logger.warn("Expecting a " + ARRAY_DIM + "x" + ARRAY_DIM + " strand bias table."); + return null; + } + + final List list = new ArrayList<>(ARRAY_SIZE); + list.add(table[0][0]); + list.add(table[0][1]); + list.add(table[1][0]); + list.add(table[1][1]); + return list; + } + + /** + * Printing information to logger.info for debugging purposes + * + * @param name the name of the table + * @param table the table itself + */ + public static void printTable(final String name, final int[][] table) { + final String pValue = String.format("%.3f", QualityUtils.phredScaleErrorRate(Math.max(FisherExactPValueForContingencyTable(table), MIN_PVALUE))); + logger.info(String.format("FS %s (REF+, REF-, ALT+, ALT-) = (%d, %d, %d, %d) = %s", + name, table[0][0], table[0][1], table[1][0], table[1][1], pValue)); + } + + /** + * Adds the small value AUGMENTATION_CONSTANT to all the entries of the table. + * + * @param table the table to augment + * @return the augmented table + */ + protected static double[][] augmentContingencyTable(final int[][] table) { + double[][] augmentedTable = new double[ARRAY_DIM][ARRAY_DIM]; + for ( int i = 0; i < ARRAY_DIM; i++ ) { + for ( int j = 0; j < ARRAY_DIM; j++ ) + augmentedTable[i][j] = table[i][j] + AUGMENTATION_CONSTANT; + } + + return augmentedTable; + } + + /** + * Normalize the table so that the entries are not too large. + * Note that this method does NOT necessarily make a copy of the table being passed in! + * + * @param table the original table + * @return a normalized version of the table or the original table if it is already normalized + */ + protected static int[][] normalizeContingencyTable(final int[][] table) { + final int sum = table[0][0] + table[0][1] + table[1][0] + table[1][1]; + if ( sum <= TARGET_TABLE_SIZE * 2 ) + return table; + + final double normalizationFactor = (double)sum / TARGET_TABLE_SIZE; + + final int[][] normalized = new int[ARRAY_DIM][ARRAY_DIM]; + for ( int i = 0; i < ARRAY_DIM; i++ ) { + for ( int j = 0; j < ARRAY_DIM; j++ ) + normalized[i][j] = (int)(table[i][j] / normalizationFactor); + } + + return normalized; + } + + public static int [][] copyContingencyTable(int [][] t) { + int[][] c = new int[ARRAY_DIM][ARRAY_DIM]; + + for ( int i = 0; i < ARRAY_DIM; i++ ) { + //System.arraycopy(t,0,c,0,ARRAY_DIM); + for (int j = 0; j < ARRAY_DIM; j++) { + c[i][j] = t[i][j]; + } + } + + return c; + } + + protected static boolean rotateTable(int[][] table) { + table[0][0]--; + table[1][0]++; + + table[0][1]++; + table[1][1]--; + + return (table[0][0] >= 0 && table[1][1] >= 0); + } + + protected static boolean unrotateTable(int[][] table) { + table[0][0]++; + table[1][0]--; + + table[0][1]--; + table[1][1]++; + + return (table[0][1] >= 0 && table[1][0] >= 0); + } + + protected static double computePValue(int[][] table) { + + int[] rowSums = { sumRow(table, 0), sumRow(table, 1) }; + int[] colSums = { sumColumn(table, 0), sumColumn(table, 1) }; + int N = rowSums[0] + rowSums[1]; + + // calculate in log space for better precision + double pCutoff = Arithmetic.logFactorial(rowSums[0]) + + Arithmetic.logFactorial(rowSums[1]) + + Arithmetic.logFactorial(colSums[0]) + + Arithmetic.logFactorial(colSums[1]) + - Arithmetic.logFactorial(table[0][0]) + - Arithmetic.logFactorial(table[0][1]) + - Arithmetic.logFactorial(table[1][0]) + - Arithmetic.logFactorial(table[1][1]) + - Arithmetic.logFactorial(N); + return Math.exp(pCutoff); + } + + private static int sumRow(int[][] table, int column) { + int sum = 0; + for (int r = 0; r < table.length; r++) { + sum += table[r][column]; + } + + return sum; + } + + private static int sumColumn(int[][] table, int row) { + int sum = 0; + for (int c = 0; c < table[row].length; c++) { + sum += table[row][c]; + } + + return sum; + } + +} diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTest.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTest.java index 11ee801b8..ac944b0ec 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTest.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTest.java @@ -58,6 +58,7 @@ import htsjdk.variant.vcf.VCFFormatHeaderLine; import htsjdk.variant.vcf.VCFHeaderLine; import org.apache.log4j.Logger; import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.ActiveRegionBasedAnnotation; import org.broadinstitute.gatk.utils.contexts.AlignmentContext; import org.broadinstitute.gatk.utils.contexts.ReferenceContext; import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; @@ -77,7 +78,8 @@ import java.util.*; /** * Class of tests to detect strand bias. */ -public abstract class StrandBiasTest extends InfoFieldAnnotation { +//TODO: will eventually implement ReducibleAnnotation -- see RMSAnnotation.java for an example of an abstract ReducibleAnnotation +public abstract class StrandBiasTest extends InfoFieldAnnotation implements ActiveRegionBasedAnnotation { private final static Logger logger = Logger.getLogger(StrandBiasTest.class); private static boolean stratifiedPerReadAlleleLikelihoodMapWarningLogged = false; private static boolean inputVariantContextWarningLogged = false; @@ -312,7 +314,6 @@ public abstract class StrandBiasTest extends InfoFieldAnnotation { private static void updateTable(final int[] table, final Allele allele, final GATKSAMRecord read, final Allele ref, final List allAlts) { final boolean matchesRef = allele.equals(ref, true); - final boolean matchesAlt = allele.equals(allAlts.get(0), true); final boolean matchesAnyAlt = allAlts.contains(allele); if ( matchesRef || matchesAnyAlt ) { diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandOddsRatio.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandOddsRatio.java index 35986bdc7..38822a055 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandOddsRatio.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandOddsRatio.java @@ -101,7 +101,6 @@ import java.util.*; * */ public class StrandOddsRatio extends StrandBiasTest implements StandardAnnotation, ActiveRegionBasedAnnotation { - private final static double AUGMENTATION_CONSTANT = 1.0; private static final int MIN_COUNT = 0; @Override @@ -132,17 +131,17 @@ public class StrandOddsRatio extends StrandBiasTest implements StandardAnnotatio } /** - * Computes the SOR value of a table after augmentation. Based on the symmetric odds ratio but modified to take on + * Computes the SOR value of a table after augmentation (adding pseudocounts). Based on the symmetric odds ratio but modified to take on * low values when the reference +/- read count ratio is skewed but the alt count ratio is not. Natural log is taken * to keep values within roughly the same range as other annotations. * - * Augmentation avoids quotient by zero. + * Adding pseudocounts prevent divide-by-zero. * * @param originalTable The table before augmentation * @return the SOR annotation value */ final protected double calculateSOR(final int[][] originalTable) { - final double[][] augmentedTable = augmentContingencyTable(originalTable); + final double[][] augmentedTable = StrandBiasTableUtils.augmentContingencyTable(originalTable); double ratio = 0; @@ -158,22 +157,6 @@ public class StrandOddsRatio extends StrandBiasTest implements StandardAnnotatio } - /** - * Adds the small value AUGMENTATION_CONSTANT to all the entries of the table. - * - * @param table the table to augment - * @return the augmented table - */ - private static double[][] augmentContingencyTable(final int[][] table) { - double[][] augmentedTable = new double[ARRAY_DIM][ARRAY_DIM]; - for ( int i = 0; i < ARRAY_DIM; i++ ) { - for ( int j = 0; j < ARRAY_DIM; j++ ) - augmentedTable[i][j] = table[i][j] + AUGMENTATION_CONSTANT; - } - - return augmentedTable; - } - /** * Returns an annotation result given a ratio * diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCaller.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCaller.java index e2a1ba850..d37f1946e 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCaller.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCaller.java @@ -311,7 +311,7 @@ public class HaplotypeCaller extends ActiveRegionWalker, In * to provide a pedigree file for a pedigree-based annotation) may cause the run to fail. */ @Argument(fullName="group", shortName="G", doc="One or more classes/groups of annotations to apply to variant calls", required=false) - protected String[] annotationClassesToUse = { "Standard" }; + protected String[] annotationGroupsToUse = { "Standard" }; @ArgumentCollection private HaplotypeCallerArgumentCollection HCAC = new HaplotypeCallerArgumentCollection(); @@ -623,10 +623,14 @@ public class HaplotypeCaller extends ActiveRegionWalker, In genotypingEngine = new HaplotypeCallerGenotypingEngine(HCAC, samplesList, genomeLocParser, FixedAFCalculatorProvider.createThreadSafeProvider(getToolkit(), HCAC,logger), !doNotRunPhysicalPhasing); // initialize the output VCF header - final VariantAnnotatorEngine annotationEngine = new VariantAnnotatorEngine(Arrays.asList(annotationClassesToUse), annotationsToUse, annotationsToExclude, this, getToolkit()); + final VariantAnnotatorEngine annotationEngine = new VariantAnnotatorEngine(Arrays.asList(annotationGroupsToUse), annotationsToUse, annotationsToExclude, this, getToolkit()); final Set headerInfo = new HashSet<>(); + //initialize the annotations (this is particularly important to turn off RankSumTest dithering in integration tests) + //do this before we write the header because SnpEff adds to header lines + annotationEngine.invokeAnnotationInitializationMethods(headerInfo); + headerInfo.addAll(genotypingEngine.getAppropriateVCFInfoHeaders()); // all annotation fields from VariantAnnotatorEngine headerInfo.addAll(annotationEngine.getVCFAnnotationDescriptions()); @@ -654,9 +658,6 @@ public class HaplotypeCaller extends ActiveRegionWalker, In vcfWriter.writeHeader(new VCFHeader(headerInfo, sampleSet)); - //now that we have all the VCF headers, initialize the annotations (this is particularly important to turn off RankSumTest dithering in integration tests) - annotationEngine.invokeAnnotationInitializationMethods(headerInfo); - try { // fasta reference reader to supplement the edges of the reference sequence referenceReader = new CachingIndexedFastaSequenceFile(getToolkit().getArguments().referenceFile); @@ -1211,7 +1212,7 @@ public class HaplotypeCaller extends ActiveRegionWalker, In * * @return true if HC must emit reference confidence. */ - private boolean emitReferenceConfidence() { + public boolean emitReferenceConfidence() { return HCAC.emitReferenceConfidence != ReferenceConfidenceMode.NONE; } diff --git a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java index 1ca9e6c9d..813a18946 100644 --- a/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java +++ b/protected/gatk-tools-protected/src/main/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGenotypingEngine.java @@ -274,7 +274,7 @@ public class HaplotypeCallerGenotypingEngine extends GenotypingEngine { +public class CombineGVCFs extends RodWalker implements AnnotatorCompatible { + + /** + * Which annotations to recompute for the combined output VCF file. + */ + @Advanced + @Argument(fullName="annotation", shortName="A", doc="One or more specific annotations to recompute. The single value 'none' removes the default annotations", required=false) + protected List annotationsToUse = new ArrayList<>(Arrays.asList(new String[]{"AS_RMSMappingQuality"})); + + /** + * Which groups of annotations to add to the output VCF file. The single value 'none' removes the default group. See + * the VariantAnnotator -list argument to view available groups. Note that this usage is not recommended because + * it obscures the specific requirements of individual annotations. Any requirements that are not met (e.g. failing + * to provide a pedigree file for a pedigree-based annotation) may cause the run to fail. + */ + @Argument(fullName="group", shortName="G", doc="One or more classes/groups of annotations to apply to variant calls", required=false) + protected String[] annotationGroupsToUse = { "Standard" }; + + + /** + * The rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. Note that dbSNP is not used in any way for the calculations themselves. + */ + @ArgumentCollection + protected DbsnpArgumentCollection dbsnp = new DbsnpArgumentCollection(); + public RodBinding getDbsnpRodBinding() { return dbsnp.dbsnp; } + public List> getCompRodBindings() { return Collections.emptyList(); } + public RodBinding getSnpEffRodBinding() { return null; } + public List> getResourceRodBindings() { return Collections.emptyList(); } + public boolean alwaysAppendDbsnpId() { return false; } + + // the annotation engine + private VariantAnnotatorEngine annotationEngine; protected final class PositionalState { final List VCs; @@ -177,6 +211,12 @@ public class CombineGVCFs extends RodWalkeremptyList(), this, getToolkit()); + + //now that we have all the VCF headers, initialize the annotations (this is particularly important to turn off RankSumTest dithering in integration tests) + annotationEngine.invokeAnnotationInitializationMethods(headerLines); + // optimization to prevent mods when we always just want to break bands if ( multipleAtWhichToBreakBands == 1 ) USE_BP_RESOLUTION = true; @@ -321,7 +361,7 @@ public class CombineGVCFs extends RodWalker annotationsToUse = new ArrayList<>(Arrays.asList(new String[]{"InbreedingCoeff", "FisherStrand", "QualByDepth", "ChromosomeCounts", "StrandOddsRatio"})); + /** + * Which groups of annotations to add to the output VCF file. The single value 'none' removes the default group. See + * the VariantAnnotator -list argument to view available groups. Note that this usage is not recommended because + * it obscures the specific requirements of individual annotations. Any requirements that are not met (e.g. failing + * to provide a pedigree file for a pedigree-based annotation) may cause the run to fail. + */ + @Argument(fullName="group", shortName="G", doc="One or more classes/groups of annotations to apply to variant calls", required=false) + protected String[] annotationGroupsToUse = {}; + + /** * The rsIDs from this file are used to populate the ID column of the output. Also, the DB INFO flag will be set when appropriate. Note that dbSNP is not used in any way for the calculations themselves. */ @@ -211,7 +221,7 @@ public class GenotypeGVCFs extends RodWalkeremptyList(), this, toolkit); + annotationEngine = new VariantAnnotatorEngine(Arrays.asList(annotationGroupsToUse), annotationsToUse, Collections.emptyList(), this, toolkit); // take care of the VCF headers final Set headerLines = VCFUtils.smartMergeHeaders(vcfRods.values(), true); @@ -230,6 +240,9 @@ public class GenotypeGVCFs extends RodWalker array ) { return MathUtils.median(array); // right now we take the median but other options could be explored } @@ -89,10 +95,10 @@ public class ReferenceConfidenceVariantContextMerger { * @return new VariantContext representing the merge of all VCs or null if it not relevant */ public static VariantContext merge(final List VCs, final GenomeLoc loc, final Byte refBase, final boolean removeNonRefSymbolicAllele, - final boolean samplesAreUniquified) { + final boolean samplesAreUniquified, final VariantAnnotatorEngine annotatorEngine) { // this can happen if e.g. you are using a dbSNP file that spans a region with no gVCFs - if ( VCs == null || VCs.isEmpty() ) - return null; + if ( VCs == null || VCs.isEmpty() ) { + return null; } // establish the baseline info (sometimes from the first VC) final VariantContext first = VCs.get(0); @@ -100,8 +106,8 @@ public class ReferenceConfidenceVariantContextMerger { // ref allele final Allele refAllele = determineReferenceAlleleGivenReferenceBase(VCs, loc, refBase); - if ( refAllele == null ) - return null; + if ( refAllele == null ) { + return null; } // FinalAlleleSet contains the alleles of the new resulting VC // Using linked set in order to guarantee a stable order @@ -112,7 +118,7 @@ public class ReferenceConfidenceVariantContextMerger { final Map attributes = new LinkedHashMap<>(); final Set rsIDs = new LinkedHashSet<>(1); // most of the time there's one id int depth = 0; - final Map> annotationMap = new LinkedHashMap<>(); + final Map> annotationMap = new LinkedHashMap<>(); final GenotypesContext genotypes = GenotypesContext.create(); // In this list we hold the mapping of each variant context alleles. @@ -156,28 +162,42 @@ public class ReferenceConfidenceVariantContextMerger { } } - if ( loc.getStart() != vc.getStart() ) + if ( loc.getStart() != vc.getStart() ) { continue; + } // special case ID (just preserve it) if ( vc.hasID() ) rsIDs.add(vc.getID()); - // add attributes - addReferenceConfidenceAttributes(vc.getAttributes(), annotationMap); + // add attributes to annotationMap, store all info field annotations as AlleleSpecificAnnotationData in case they can be parsed that way + addReferenceConfidenceAttributes(pair, annotationMap); } - // when combining annotations use the median value from all input VCs which had annotations provided - for ( final Map.Entry> p : annotationMap.entrySet() ) { + //combine the annotations that are reducible and remove them from annotationMap + Map combinedAnnotations = new HashMap<>(); + if (annotatorEngine != null) { + combinedAnnotations = annotatorEngine.combineAnnotations(allelesList, annotationMap); + } + attributes.putAll(combinedAnnotations); + + // remove stale AC and AF based attributes (including MLEAC and MLEAF lists) + //these will be recalculated after genotyping + removeStaleAttributesAfterMerge(annotationMap); + + //annotatorEngine.combineAnnotations removed the successfully combined annotations, so now parse those that are left + //here we're assuming that things that are left are scalars per sample + Map> parsedAnnotationMap = parseRemainingAnnotations(annotationMap); + + // when combining remaining annotations use the median value from all input VCs which had annotations provided + for ( final Map.Entry> p : parsedAnnotationMap.entrySet() ) { if ( ! p.getValue().isEmpty() ) { attributes.put(p.getKey(), combineAnnotationValues(p.getValue())); } } - if ( depth > 0 ) + if ( depth > 0 ) { attributes.put(VCFConstants.DEPTH_KEY, String.valueOf(depth)); - - // remove stale AC and AF based attributes - removeStaleAttributesAfterMerge(attributes); + } final String ID = rsIDs.isEmpty() ? VCFConstants.EMPTY_ID_FIELD : Utils.join(",", rsIDs); @@ -189,6 +209,39 @@ public class ReferenceConfidenceVariantContextMerger { return builder.make(); } + /** + * parse the annotations that were not identified as reducible annotations and combined by the annotation engine + * @param annotationMap the map of info field annotation names and the list of their data from the merged VCs + * @return info field data parsed as ints or doubles + */ + private static Map> parseRemainingAnnotations(final Map> annotationMap) { + final Map> parsedAnnotations = new HashMap<>(); + for (Map.Entry> currentData : annotationMap.entrySet()) { + List annotationValues = new ArrayList<>(); + for (ReducibleAnnotationData value : currentData.getValue()) { + try { + final String stringValue = value.getRawData(); + if (stringValue.contains(".")) { + annotationValues.add(Double.parseDouble(stringValue)); + } else if (Character.isDigit(stringValue.charAt(0))){ + annotationValues.add(Integer.parseInt(stringValue)); + //TODO: uncomment this to parse dbSNP membership annotation once allele-specific merging for that attribute is added + /*} else if (Character.isLetter(stringValue.charAt(0))) { + if (stringValue.equalsIgnoreCase("true")) + annotationValues.add(true); + else if (stringValue.equalsIgnoreCase("false")) + annotationValues.add(false);*/ + } + + } catch (final NumberFormatException e) { + logger.warn("WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but " + value.getRawData() + " doesn't parse and will not be annotated in the final VC."); + } + } + parsedAnnotations.put(currentData.getKey(),annotationValues); + } + return parsedAnnotations; + } + /** * @param list the original alleles list * @return a non-null list of non-symbolic alleles @@ -196,8 +249,9 @@ public class ReferenceConfidenceVariantContextMerger { private static List nonSymbolicAlleles(final List list) { final List result = new ArrayList<>(list.size()); for ( final Allele allele : list ) { - if ( !allele.isSymbolic() ) + if ( !allele.isSymbolic() ) { result.add(allele); + } } return result; } @@ -212,8 +266,9 @@ public class ReferenceConfidenceVariantContextMerger { */ private static Allele determineReferenceAlleleGivenReferenceBase(final List VCs, final GenomeLoc loc, final Byte refBase) { final Allele refAllele = GATKVariantContextUtils.determineReferenceAllele(VCs, loc); - if ( refAllele == null ) - return ( refBase == null ? null : Allele.create(refBase, true) ); + if ( refAllele == null ) { + return (refBase == null ? null : Allele.create(refBase, true)); + } return refAllele; } @@ -222,7 +277,7 @@ public class ReferenceConfidenceVariantContextMerger { * * @param attributes the attribute map */ - private static void removeStaleAttributesAfterMerge(final Map attributes) { + private static void removeStaleAttributesAfterMerge(final Map> attributes) { attributes.remove(VCFConstants.ALLELE_COUNT_KEY); attributes.remove(VCFConstants.ALLELE_FREQUENCY_KEY); attributes.remove(VCFConstants.ALLELE_NUMBER_KEY); @@ -234,31 +289,35 @@ public class ReferenceConfidenceVariantContextMerger { /** * Adds attributes to the global map from the new context in a sophisticated manner * - * @param myAttributes attributes to add from + * @param pair VariantContext/Allele list pair from which to get attributes * @param annotationMap map of annotations for combining later */ - private static void addReferenceConfidenceAttributes(final Map myAttributes, - final Map> annotationMap) { + private static void addReferenceConfidenceAttributes(Pair> pair, + final Map> annotationMap) { + final Map myAttributes = pair.getFirst().getAttributes(); //these are the info field attributes of the VC in pair + final List sampleAlleles = pair.getSecond(); + for ( final Map.Entry p : myAttributes.entrySet() ) { final String key = p.getKey(); - final Object value = p.getValue(); + //allele-specific attributes will always be in list form because they've already been parsed per-allele + //non-allele-specific attributes (DP, etc.) will be a list of length 1 + final List valueList = pair.getFirst().getAttributeAsList(key); - // add the annotation values to a list for combining later - List values = annotationMap.get(key); - if( values == null ) { - values = new ArrayList<>(); - annotationMap.put(key, values); + // add the existing annotation values to a list for combining later + List rawValuesList = annotationMap.get(key); + if( rawValuesList == null ) { + rawValuesList = new ArrayList<>(); + annotationMap.put(key, rawValuesList); } - try { - final String stringValue = value.toString(); - // Branch to avoid unintentional, implicit type conversions that occur with the ? operator. - if (stringValue.contains(".")) - values.add(Double.parseDouble(stringValue)); - else - values.add(Integer.parseInt(stringValue)); - } catch (final NumberFormatException e) { - // nothing to do + String combinedString = ""; + for(int i=0; i < valueList.size(); i++) { + if (i > 0) + combinedString += ","; + combinedString += valueList.get(i); } + ReducibleAnnotationData pairData = new AlleleSpecificAnnotationData(sampleAlleles, combinedString); + rawValuesList.add(pairData); + annotationMap.put(key, rawValuesList); } } @@ -364,30 +423,34 @@ public class ReferenceConfidenceVariantContextMerger { // the map is different depending on the ploidy, so in order to keep this method flexible (mixed ploidies) // we need to get a map done (lazily inside the loop) for each ploidy, up to the maximum possible. final int[][] genotypeIndexMapsByPloidy = new int[maximumPloidy + 1][]; - final int maximumAlleleCount = Math.max(remappedAlleles.size(), targetAlleles.size()); + final int maximumAlleleCount = Math.max(remappedAlleles.size(),targetAlleles.size()); + int[] perSampleIndexesOfRelevantAlleles; - for ( final Genotype g : vc.getGenotypes() ) { + for (final Genotype g : vc.getGenotypes()) { final String name; if (samplesAreUniquified) - name = g.getSampleName() + "." + vc.getSource(); + name = g.getSampleName() + "." + vc.getSource(); else - name = g.getSampleName(); + name = g.getSampleName(); final int ploidy = g.getPloidy(); final GenotypeBuilder genotypeBuilder = new GenotypeBuilder(g).alleles(GATKVariantContextUtils.noCallAlleles(g.getPloidy())); genotypeBuilder.name(name); final boolean hasPL = g.hasPL(); final boolean hasSAC = g.hasExtendedAttribute(GATKVCFConstants.STRAND_COUNT_BY_SAMPLE_KEY); - if ( hasPL || hasSAC ) { - final int[] perSampleIndexesOfRelevantAlleles = getIndexesOfRelevantAlleles(remappedAlleles, targetAlleles, vc.getStart(), g); + if (hasPL || hasSAC) { + perSampleIndexesOfRelevantAlleles = getIndexesOfRelevantAlleles(remappedAlleles, targetAlleles, vc.getStart(), g); if (g.hasPL()) { - // genotype index map by ploidy. - final int[] genotypeIndexMapByPloidy = GenotypeLikelihoodCalculators.getInstance(ploidy, maximumAlleleCount).genotypeIndexMap(perSampleIndexesOfRelevantAlleles); + // lazy initialization of the genotype index map by ploidy. + + final int[] genotypeIndexMapByPloidy = genotypeIndexMapsByPloidy[ploidy] == null + ? GenotypeLikelihoodCalculators.getInstance(ploidy, maximumAlleleCount).genotypeIndexMap(perSampleIndexesOfRelevantAlleles) + : genotypeIndexMapsByPloidy[ploidy]; final int[] PLs = generatePL(g, genotypeIndexMapByPloidy); final int[] AD = g.hasAD() ? generateAD(g.getAD(), perSampleIndexesOfRelevantAlleles) : null; genotypeBuilder.PL(PLs).AD(AD); } if (g.hasExtendedAttribute(GATKVCFConstants.STRAND_COUNT_BY_SAMPLE_KEY)) { - final List sacIndexesToUse = adaptToSACIndexes(perSampleIndexesOfRelevantAlleles); + final List sacIndexesToUse = adaptToSACIndexes(perSampleIndexesOfRelevantAlleles); final int[] SACs = GATKVariantContextUtils.makeNewSACs(g, sacIndexesToUse); genotypeBuilder.attribute(GATKVCFConstants.STRAND_COUNT_BY_SAMPLE_KEY, SACs); } @@ -397,25 +460,26 @@ public class ReferenceConfidenceVariantContextMerger { } /** - * Adapt the relevant alleles to the SAC indexes - * - * @param perSampleIndexesOfRelevantAlleles - * @return SAC indexes + * Adapt the relevant alleles to the SAC indexes + * + * @param perSampleIndexesOfRelevantAlleles + * @return SAC indexes */ - private static List adaptToSACIndexes(final int[] perSampleIndexesOfRelevantAlleles){ - if ( perSampleIndexesOfRelevantAlleles == null ) + private static List adaptToSACIndexes(final int[] perSampleIndexesOfRelevantAlleles) { + if (perSampleIndexesOfRelevantAlleles == null) throw new IllegalArgumentException("The per sample index of relevant alleles must not be null"); - final List sacIndexesToUse = new ArrayList(2*perSampleIndexesOfRelevantAlleles.length); + final List sacIndexesToUse = new ArrayList(2 * perSampleIndexesOfRelevantAlleles.length); - for ( int item : perSampleIndexesOfRelevantAlleles ) { - sacIndexesToUse.add(new Integer(2*item)); - sacIndexesToUse.add(new Integer(2*item+1)); + for (int item : perSampleIndexesOfRelevantAlleles) { + sacIndexesToUse.add(new Integer(2 * item)); + sacIndexesToUse.add(new Integer(2 * item + 1)); } - return sacIndexesToUse; + return sacIndexesToUse; } + /** * Composes a new likelihood array given the original genotype and the genotype index map. * @@ -439,8 +503,8 @@ public class ReferenceConfidenceVariantContextMerger { } /** - * Determines the allele mapping from remappedAlleles to the targetAlleles, substituting the generic "" as appropriate. - * If the myAlleles set does not contain "" as an allele, it throws an exception. + * Determines the allele mapping from myAlleles to the targetAlleles, substituting the generic "" as appropriate. + * If the remappedAlleles set does not contain "" as an allele, it throws an exception. * * @param remappedAlleles the list of alleles to evaluate * @param targetAlleles the target list of alleles @@ -450,7 +514,7 @@ public class ReferenceConfidenceVariantContextMerger { */ protected static int[] getIndexesOfRelevantAlleles(final List remappedAlleles, final List targetAlleles, final int position, final Genotype g) { - if ( remappedAlleles == null || remappedAlleles.isEmpty() ) throw new IllegalArgumentException("The list of input alleles must not be null or empty"); + if ( remappedAlleles == null || remappedAlleles.isEmpty()) throw new IllegalArgumentException("The list of input alleles must not be null or empty"); if ( targetAlleles == null || targetAlleles.isEmpty() ) throw new IllegalArgumentException("The list of target alleles must not be null or empty"); if ( !remappedAlleles.contains(GATKVCFConstants.NON_REF_SYMBOLIC_ALLELE) ) diff --git a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtilsTest.java b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtilsTest.java new file mode 100644 index 000000000..671b31989 --- /dev/null +++ b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/AnnotationUtilsTest.java @@ -0,0 +1,220 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.samtools.Cigar; +import htsjdk.samtools.CigarElement; +import htsjdk.samtools.CigarOperator; +import org.broadinstitute.gatk.utils.sam.ArtificialSAMUtils; +import org.broadinstitute.gatk.utils.sam.GATKSAMRecord; +import org.testng.annotations.BeforeClass; +import org.testng.Assert; +import org.testng.annotations.DataProvider; +import org.testng.annotations.Test; + +import java.util.ArrayList; +import java.util.LinkedList; +import java.util.List; + +public class AnnotationUtilsTest { + + //Basic aligned read + private GATKSAMRecord allMatch; + + //Read with insertion and deletion + private GATKSAMRecord twoIndels; + + //Read with soft clips at start + private GATKSAMRecord softClipStart; + + //Read with hard clips at start + private GATKSAMRecord hardClipStart; + + //Read with low quality tail + private GATKSAMRecord lowQualTail; + + //Read with low quality tail, partially soft clipped + private GATKSAMRecord lowQualClippedTail; + + //Read with low quality bases at start + private GATKSAMRecord lowQualStart; + + //Read with low quality bases, partially soft clipped at both ends + private GATKSAMRecord lowQualBothEnds; + + @BeforeClass + public void init() { + List cigarElements_allMatch = new LinkedList<>(); + cigarElements_allMatch.add(new CigarElement(151, CigarOperator.M)); + allMatch = ArtificialSAMUtils.createArtificialRead(new Cigar(cigarElements_allMatch)); + + List cigarElements_2indels = new LinkedList<>(); + cigarElements_2indels.add(new CigarElement(66, CigarOperator.M)); + cigarElements_2indels.add(new CigarElement(10, CigarOperator.I)); + cigarElements_2indels.add(new CigarElement(7, CigarOperator.M)); + cigarElements_2indels.add(new CigarElement(10, CigarOperator.D)); + cigarElements_2indels.add(new CigarElement(68, CigarOperator.M)); + twoIndels = ArtificialSAMUtils.createArtificialRead(new Cigar(cigarElements_2indels)); + + List cigarElements_softClipStart = new LinkedList<>(); + cigarElements_softClipStart.add(new CigarElement(17, CigarOperator.S)); + cigarElements_softClipStart.add(new CigarElement(134, CigarOperator.M)); + softClipStart = ArtificialSAMUtils.createArtificialRead(new Cigar(cigarElements_softClipStart)); + + List cigarElements_hardClipStart = new LinkedList<>(); + cigarElements_hardClipStart.add(new CigarElement(17, CigarOperator.H)); + cigarElements_hardClipStart.add(new CigarElement(134, CigarOperator.M)); + hardClipStart = ArtificialSAMUtils.createArtificialRead(new Cigar(cigarElements_hardClipStart)); + + + final byte [] bases_lowQualTail = {'A', 'C', 'T', 'G', 'A', 'A', 'A', 'A', 'A', 'A'}; + final byte [] quals_lowQualTail = {30, 15, 25, 30, 2, 2, 2, 2, 2, 2}; + lowQualTail = ArtificialSAMUtils.createArtificialRead(bases_lowQualTail, quals_lowQualTail, "10M"); + + final byte [] bases_lowQualClippedTail = {'A', 'C', 'T', 'G', 'A', 'A', 'A', 'A', 'A', 'A'}; + final byte [] quals_lowQualClippedTail = {30, 15, 25, 30, 2, 2, 2, 2, 2, 2}; + lowQualClippedTail = ArtificialSAMUtils.createArtificialRead(bases_lowQualClippedTail, quals_lowQualClippedTail, "8M2S"); + + final byte [] bases_lowQualStart = {'A', 'A', 'A', 'A', 'A', 'A', 'A', 'C', 'T', 'G'}; + final byte [] quals_lowQualStart = {2, 2, 2, 2, 2, 2, 30, 15, 25, 30}; + lowQualStart = ArtificialSAMUtils.createArtificialRead(bases_lowQualStart, quals_lowQualStart, "10M"); + + final byte [] bases_lowQualBothEnds = {'A', 'A', 'A', 'A', 'A', 'A', 'G', 'C', 'T', 'G', 'A', 'A', 'A', 'A', 'A', 'A'}; + final byte [] quals_lowQualBothEnds = { 2, 2, 2, 2, 2, 2, 30, 15, 25, 30, 2, 2, 2, 2, 2, 2}; + lowQualBothEnds = ArtificialSAMUtils.createArtificialRead(bases_lowQualBothEnds, quals_lowQualBothEnds, "2S12M2S"); + + } + + @DataProvider(name = "makeGetFinalVariantReadPositionTestReads") + public Object[][] makeFinalPosTestReads() { + final List tests = new ArrayList<>(); + tests.add(new Object[] {allMatch, 10, 10}); + tests.add(new Object[] {allMatch, 140, 10}); + tests.add(new Object[] {twoIndels, 10, 10}); + tests.add(new Object[] {twoIndels, 140, 10}); + tests.add(new Object[] {hardClipStart, 20, 20}); + tests.add(new Object[] {hardClipStart, 110, 6}); //this is what the code produces as-is + tests.add(new Object[] {softClipStart, 10, 10}); + tests.add(new Object[] {softClipStart, 140, 10}); + tests.add(new Object[] {lowQualTail, 3, 0}); + tests.add(new Object[] {lowQualTail, 2, 2}); + tests.add(new Object[] {lowQualClippedTail, 3, 0}); + tests.add(new Object[] {lowQualClippedTail, 2, 2}); + tests.add(new Object[] {lowQualStart, 7, -4}); //this is what the code produces as-is, but should be 1 + tests.add(new Object[] {lowQualStart, 8, -5}); //this is what the code produces as-is, but should be 1 + tests.add(new Object[] {lowQualBothEnds, 7, -4}); //this is what the code produces as-is, but should be 1 + tests.add(new Object[] {lowQualBothEnds, 8, -5}); //this is what the code produces as-is, but should be 1 + return tests.toArray(new Object[][]{}); + } + + @Test(dataProvider = "makeGetFinalVariantReadPositionTestReads") + public void testGetFinalVariantReadPosition(GATKSAMRecord read, int variantPosition, int expected) throws Exception { + Assert.assertEquals(AnnotationUtils.getFinalVariantReadPosition(read, variantPosition), expected); + } + + @DataProvider(name = "getNumClippedBasesAtStartTestReads") + public Object[][] numClipStart() { + final List tests = new ArrayList<>(); + tests.add(new Object[] {allMatch, 0}); + tests.add(new Object[] {twoIndels, 0}); + tests.add(new Object[] {softClipStart, 0}); + tests.add(new Object[] {hardClipStart, 17}); + tests.add(new Object[] {lowQualTail, 0}); + tests.add(new Object[] {lowQualClippedTail, 0}); + tests.add(new Object[] {lowQualStart, 6}); + tests.add(new Object[] {lowQualBothEnds, 6}); + return tests.toArray(new Object[][]{}); + } + + @Test(dataProvider = "getNumClippedBasesAtStartTestReads") + public void testGetNumClippedBasesAtStart(GATKSAMRecord read, int expected) throws Exception { + Assert.assertEquals(AnnotationUtils.getNumClippedBasesAtStart(read),expected); + } + + @DataProvider(name = "getNumAlignedBasesTestReads") + public Object[][] numAligned() { + final List tests = new ArrayList<>(); + tests.add(new Object[] {allMatch, 151}); + tests.add(new Object[] {twoIndels, 151}); + tests.add(new Object[] {softClipStart, 151}); + tests.add(new Object[] {hardClipStart, 117}); //This is what the code produces, but it's wrong + tests.add(new Object[] {lowQualTail, 4}); + tests.add(new Object[] {lowQualClippedTail, 4}); + tests.add(new Object[] {lowQualStart, 4}); + tests.add(new Object[] {lowQualBothEnds, 4}); + return tests.toArray(new Object[][]{}); + } + + @Test(dataProvider = "getNumAlignedBasesTestReads") + public void testGetNumAlignedBases(GATKSAMRecord read, int expected) throws Exception { + Assert.assertEquals(AnnotationUtils.getNumAlignedBases(read),expected); + } + + @DataProvider(name = "getNumClippedBasesAtEndTestReads") + public Object[][] numClipEnd() { + final List tests = new ArrayList<>(); + tests.add(new Object[] {allMatch, 0}); + tests.add(new Object[] {twoIndels, 0}); + tests.add(new Object[] {softClipStart, 0}); + tests.add(new Object[] {hardClipStart, 0}); + tests.add(new Object[] {lowQualTail, 6}); + tests.add(new Object[] {lowQualClippedTail, 6}); + tests.add(new Object[] {lowQualStart, 0}); + tests.add(new Object[] {lowQualBothEnds, 6}); + return tests.toArray(new Object[][]{}); + } + + @Test(dataProvider = "getNumClippedBasesAtEndTestReads") + public void testGetNumClippedBasesAtEnd(GATKSAMRecord read, int expected) throws Exception { + Assert.assertEquals(AnnotationUtils.getNumClippedBasesAtEnd(read), expected); + } +} \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTableUtilsTest.java b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTableUtilsTest.java new file mode 100644 index 000000000..440f14aac --- /dev/null +++ b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/StrandBiasTableUtilsTest.java @@ -0,0 +1,127 @@ +/* +* By downloading the PROGRAM you agree to the following terms of use: +* +* BROAD INSTITUTE +* SOFTWARE LICENSE AGREEMENT +* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY +* +* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”). +* +* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and +* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions. +* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows: +* +* 1. DEFINITIONS +* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE. +* +* 2. LICENSE +* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation. +* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement. +* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement. +* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM. +* +* 3. PHONE-HOME FEATURE +* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation. +* +* 4. OWNERSHIP OF INTELLECTUAL PROPERTY +* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication. +* Copyright 2012-2015 Broad Institute, Inc. +* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc. +* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes. +* +* 5. INDEMNIFICATION +* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement. +* +* 6. NO REPRESENTATIONS OR WARRANTIES +* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME. +* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING. +* +* 7. ASSIGNMENT +* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void. +* +* 8. MISCELLANEOUS +* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries. +* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes. +* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4. +* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt. +* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter. +* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement. +* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import org.testng.Assert; +import org.testng.annotations.DataProvider; +import org.testng.annotations.Test; + +import java.util.ArrayList; +import java.util.List; + +public class StrandBiasTableUtilsTest { + private static final double DELTA_PRECISION = 10e-7; + + @DataProvider(name = "UsingTable") + public Object[][] makeUsingTable() { + + /* NOTE: the expected P values were computed in R as follows + fisher = function(v) { + return(fisher.test(matrix(v, nrow=2, ncol=2))$p.value) + } + */ + //> fisher(c(2068, 6796, 1133, 0)) + + final List tests = new ArrayList<>(); + tests.add(new Object[]{0, 0, 0, 0, 1.0}); + tests.add(new Object[]{100000, 100000, 100000, 100000, 1.0}); + tests.add(new Object[]{1, 2, 3, 4, 1.0}); + tests.add(new Object[]{0, 0, 100000, 100000, 1.0}); + tests.add(new Object[]{100000, 100000, 100000, 0, 0.0}); //below R's or Java's precision + + tests.add(new Object[]{200000, 100000, 1, 2, 1.0}); //differs from GATK4 implementation + tests.add(new Object[]{100, 100, 100, 0, 3.730187e-23}); + tests.add(new Object[]{13736, 9047, 41, 1433, 1.232E-4}); //differs from GATK4 implementation + tests.add(new Object[]{66, 14, 64, 4, 0.0688244}); + tests.add(new Object[]{351169, 306836, 153739, 2379, 0.0}); //below R's or Java's precision + tests.add(new Object[]{116449, 131216, 289, 16957, 0.0026801}); //differs from GATK4 implementation + tests.add(new Object[]{137, 159, 9, 23, 0.10752410}); //differs from GATK4 implementation + tests.add(new Object[]{129, 90, 21, 20, 0.6450772}); //differs from GATK4 implementation + tests.add(new Object[]{14054, 9160, 16, 7827, 0.0}); //below R's or Java's precision + tests.add(new Object[]{32803, 9184, 32117, 3283, 0.0289540}); //differs from GATK4 implementation + tests.add(new Object[]{2068, 6796, 1133, 0, 0.0}); //below R's or Java's precision + + return tests.toArray(new Object[][]{}); + } + + @Test(dataProvider = "UsingTable") + public void testPValueForContingencyTable(final int refpos, final int refneg, final int altpos, final int altneg, double expectedPvalue) throws Exception { + final int[][] contingencyTable = new int[2][2]; + contingencyTable[0][0] = refpos; + contingencyTable[0][1] = refneg; + contingencyTable[1][0] = altpos; + contingencyTable[1][1] = altneg; + final double pvalue = StrandBiasTableUtils.FisherExactPValueForContingencyTable(contingencyTable); + Assert.assertEquals(pvalue, expectedPvalue, DELTA_PRECISION, "Pvalues"); + } + + @Test + public void testGetContingencyArray() throws Exception { + final int[][] t = new int[2][2]; + t[0][0] = 1; t[0][1] = 2; t[1][0] = 3; t[1][1] = 4; + final List tList = StrandBiasTableUtils.getContingencyArray(t); + final List truthList = new ArrayList(); + truthList.add(1); truthList.add(2); truthList.add(3); truthList.add(4); + Assert.assertEquals(tList, truthList); + } + + @Test + public void testCopyContingencyTable() throws Exception { + final int[][] t = new int[2][2]; + t[0][0] = 1; t[0][1] = 2; t[1][0] = 3; t[1][1] = 4; + final int[][] t2 = StrandBiasTableUtils.copyContingencyTable(t); + Assert.assertTrue(t[0][0] == t2[0][0]); + Assert.assertTrue(t[1][0] == t2[1][0]); + Assert.assertTrue(t[0][1] == t2[0][1]); + Assert.assertTrue(t[1][1] == t2[1][1]); + } +} \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGVCFIntegrationTest.java b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGVCFIntegrationTest.java index 1cb257c9e..d13905cfb 100644 --- a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGVCFIntegrationTest.java +++ b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/haplotypecaller/HaplotypeCallerGVCFIntegrationTest.java @@ -76,11 +76,11 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { // this functionality can be adapted to provide input data for whatever you might want in your data tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.NONE, PCRFreeIntervals, "7f09c261950bf86e435edfa69ed2ec71"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "1ecdfe6eeaa1bd8fe1ba690945af9fbf"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "274b1e1016ca62f5ff81e398d17b63ab"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "70605a2070d6d614d70b8547eedad2cd"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "c439e51453a031ca60fe96b75423e589"}); tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.NONE, WExIntervals, "03e5815cc351f1ec2feed89d2aed8268"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "5854a0067b888947fb79ceed8a303fa7"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, "46e7eb6a031cdcbe512aedac1e35f81e"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "5d5cca382bdf6987b2aef87918ed374c"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, "532888aff96c356397a21ef790636818"}); return tests.toArray(new Object[][]{}); } @@ -95,12 +95,12 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { // this functionality can be adapted to provide input data for whatever you might want in your data tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.NONE, PCRFreeIntervals, "cfe629c5a3be3b6524258ad1f9145488"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "b7952299b34db309026c8cbaaec2aa1e"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "fa7363b2adbd459bc7acb75b3beeb13c"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "12fbfffa4bb2b8d520f8021a40b37d19"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "a5ea6d4052bbf9e8bba9011bc6f0d203"}); tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.NONE, WExIntervals, "31da2254620f4a9c34ccf7c311cc133f"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "78b1d207f37ce17183461dd9b6814ec8"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "c95f52fe395392331dc3102902d54408"}); - final String NA12878bandedResolutionMD5 = "2d54e7ed33001848fda3df66ee6e1999"; + final String NA12878bandedResolutionMD5 = "fbe6099d138a069a65e4713bcae1e873"; tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, NA12878bandedResolutionMD5}); tests.add(new Object[]{NA12878_WEx + " -I " + privateTestDir + "NA20313.highCoverageRegion.bam -sn NA12878", ReferenceConfidenceMode.GVCF, WExIntervals, NA12878bandedResolutionMD5}); @@ -118,11 +118,11 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { // this functionality can be adapted to provide input data for whatever you might want in your data tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.NONE, PCRFreeIntervals, "4c33ec3f6b9dce74e545405588c76ed1"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "c65d97f6eac66769fe1dbcf7b671459d"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "7bcd06d7ada7495b641df6545b76bf24"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "d38becbac8a2e0c4b77e9cc11c1ba891"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "57c0f8baaa34f4e0300eaec7b938970b"}); tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.NONE, WExIntervals, "4f8e3e249509a24da21d5dd8e3594f92"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "61b225aaf53c85e755ff3790f1a86818"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, "d150defe805f74b65011c7ec02a3e952"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "aeb64e3cb8d141aa3f8ac7506db15e19"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, "03af03feb96e402048801ecfa75344ad"}); return tests.toArray(new Object[][]{}); } @@ -136,11 +136,11 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { // this functionality can be adapted to provide input data for whatever you might want in your data tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.NONE, PCRFreeIntervals, "0bec872d0dc72cc68538cfbd236e9933"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "10cb558e49de284f1e559c7ed114f9ab"}); - tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "3512803a103c7c2da86a18609bcd7061"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.BP_RESOLUTION, PCRFreeIntervals, "694405c8f6497b5ce99145538c325cba"}); + tests.add(new Object[]{NA12878_PCRFREE, ReferenceConfidenceMode.GVCF, PCRFreeIntervals, "47ddf81ed143451aeb27969d397ce4ac"}); tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.NONE, WExIntervals, "c724ecdaea7eab5a6239ff4daaa6e034"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "170bdab72eef971598a0f2289a7ee4e3"}); - tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, "825da547dc8100a5a47cd41dd31d2aef"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.BP_RESOLUTION, WExIntervals, "f80cc551fe7763bacc46ea6d56582bba"}); + tests.add(new Object[]{NA12878_WEx, ReferenceConfidenceMode.GVCF, WExIntervals, "43ef1e04d015c2bd9d513a67a6fc7a02"}); return tests.toArray(new Object[][]{}); } @@ -266,7 +266,7 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { public void testWrongGVCFNonVariantRecordOrderBugFix() { final String commandLine = String.format("-T HaplotypeCaller --pcr_indel_model NONE -pairHMMSub %s %s -R %s -I %s -L %s -ERC GVCF --no_cmdline_in_header -variant_index_type %s -variant_index_parameter %d", HMM_SUB_IMPLEMENTATION, ALWAYS_LOAD_VECTOR_HMM, b37KGReference, WRONG_GVCF_RECORD_ORDER_BUGFIX_BAM, WRONG_GVCF_RECORD_ORDER_BUGFIX_INTERVALS, GATKVCFUtils.DEFAULT_GVCF_INDEX_TYPE, GATKVCFUtils.DEFAULT_GVCF_INDEX_PARAMETER); - final WalkerTestSpec spec = new WalkerTestSpec(commandLine + " -o %s", Arrays.asList("5dbb42e0140fc21315d58495030f56a2")); + final WalkerTestSpec spec = new WalkerTestSpec(commandLine + " -o %s", Arrays.asList("9b3b232ceeb109f2624826ea20825a82")); spec.disableShadowBCF(); executeTest("testMissingGVCFIndexingStrategyException", spec); } @@ -283,7 +283,7 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { public void testNoCallGVCFMissingPLsBugFix() { final String commandLine = String.format("-T HaplotypeCaller --pcr_indel_model NONE -pairHMMSub %s %s -R %s -I %s -L %s -ERC GVCF --no_cmdline_in_header -variant_index_type %s -variant_index_parameter %d", HMM_SUB_IMPLEMENTATION, ALWAYS_LOAD_VECTOR_HMM, b37KGReference, NOCALL_GVCF_BUGFIX_BAM, NOCALL_GVCF_BUGFIX_INTERVALS, GATKVCFUtils.DEFAULT_GVCF_INDEX_TYPE, GATKVCFUtils.DEFAULT_GVCF_INDEX_PARAMETER); - final WalkerTestSpec spec = new WalkerTestSpec(commandLine + " -o %s", Arrays.asList("0eba41fa151a9079852fe9c77becde50")); + final WalkerTestSpec spec = new WalkerTestSpec(commandLine + " -o %s", Arrays.asList("c5a017f1cbd60219506be76f30fc4468")); spec.disableShadowBCF(); executeTest("testNoCallGVCFMissingPLsBugFix", spec); } @@ -312,4 +312,13 @@ public class HaplotypeCallerGVCFIntegrationTest extends WalkerTest { executeTest(" testGeneralPloidyArrayIndexBug2Fix", spec); } + @Test + public void testAlleleSpecificAnnotations() { + final String commandLine = String.format("-T HaplotypeCaller --pcr_indel_model NONE -pairHMMSub %s %s -R %s -I %s -L %s -ERC GVCF --no_cmdline_in_header -variant_index_type %s -variant_index_parameter %d -G Standard -G AS_Standard --disableDithering", + HMM_SUB_IMPLEMENTATION, ALWAYS_LOAD_VECTOR_HMM, b37KGReference, privateTestDir + "NA12878.HiSeq.b37.chr20.10_11mb.bam", "20:10433000-10437000", GATKVCFUtils.DEFAULT_GVCF_INDEX_TYPE, GATKVCFUtils.DEFAULT_GVCF_INDEX_PARAMETER); + final WalkerTestSpec spec = new WalkerTestSpec(commandLine + " -o %s", Arrays.asList("fc10e56ffa5e230c1a1c8bdf9673bdaa")); + spec.disableShadowBCF(); + executeTest(" testAlleleSpecificAnnotations", spec); + } + } diff --git a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/CombineGVCFsIntegrationTest.java b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/CombineGVCFsIntegrationTest.java index 84ab76698..080c8da02 100644 --- a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/CombineGVCFsIntegrationTest.java +++ b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/CombineGVCFsIntegrationTest.java @@ -272,4 +272,13 @@ public class CombineGVCFsIntegrationTest extends WalkerTest { spec.disableShadowBCF(); executeTest("testBasepairResolutionInput", spec); } + + @Test + public void testAlleleSpecificAnnotations() throws Exception { + final String cmd = "-T CombineGVCFs -R " + b37KGReference + " -o %s --no_cmdline_in_header -G Standard -G AS_Standard -V " + + privateTestDir + "NA12878.AS.chr20snippet.g.vcf -V " + privateTestDir + "NA12891.AS.chr20snippet.g.vcf"; + final WalkerTestSpec spec = new WalkerTestSpec(cmd, 1, Arrays.asList("cbb2571eeb95e661acee8f9e1d1cbfbd")); + spec.disableShadowBCF(); + executeTest("testAlleleSpecificAnnotations", spec); + } } diff --git a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/GenotypeGVCFsIntegrationTest.java b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/GenotypeGVCFsIntegrationTest.java index f24b18ae2..1909060a0 100644 --- a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/GenotypeGVCFsIntegrationTest.java +++ b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/GenotypeGVCFsIntegrationTest.java @@ -259,7 +259,7 @@ public class GenotypeGVCFsIntegrationTest extends WalkerTest { final WalkerTestSpec spec = new WalkerTestSpec( baseTestString(" -V " + gVCF.getAbsolutePath(), b37KGReference), 1, - Arrays.asList("b81a94763ea9714ed6289acd3d39c4cf")); + Arrays.asList("81ae1bc09f9cd34b0b37ae3cb4b09bc8")); spec.disableShadowBCF(); //TODO: Remove when BaseTest.assertAttributesEquals() works with SAC executeTest("testStrandAlleleCountsBySample", spec); } @@ -560,4 +560,23 @@ public class GenotypeGVCFsIntegrationTest extends WalkerTest { Arrays.asList("d617884b08ee85816f1ba1acf11f1738")); executeTest("testSetZeroRGQsToNoCall", spec); } + + @Test + public void testAlleleSpecificAnnotations() { + final String cmd = "-T GenotypeGVCFs -R " + b37KGReference + " -o %s --no_cmdline_in_header -G Standard -G AS_Standard --disableDithering -V " + + privateTestDir + "NA12878.AS.chr20snippet.g.vcf -V " + privateTestDir + "NA12891.AS.chr20snippet.g.vcf"; + final WalkerTestSpec spec = new WalkerTestSpec(cmd, 1, Arrays.asList("f62ae282ec8e32f6104ef84237b8a5a4")); + spec.disableShadowBCF(); + executeTest("testAlleleSpecificAnnotations", spec); + } + + @Test + //make sure none of the assumptions about things being merged as lists break the single-sample case + public void testAlleleSpecificAnnotations_oneSample() { + final String cmd = "-T GenotypeGVCFs -R " + b37KGReference + " -o %s --no_cmdline_in_header -G Standard -G AS_Standard --disableDithering -V " + + privateTestDir + "NA12878.AS.chr20snippet.g.vcf"; + final WalkerTestSpec spec = new WalkerTestSpec(cmd, 1, Arrays.asList("1eefb0ed407b9071f09f9189c9ad45cf")); + spec.disableShadowBCF(); + executeTest("testAlleleSpecificAnnotations", spec); + } } \ No newline at end of file diff --git a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/VariantContextMergerUnitTest.java b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/VariantContextMergerUnitTest.java index d34859b07..a2b1b3f8b 100644 --- a/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/VariantContextMergerUnitTest.java +++ b/protected/gatk-tools-protected/src/test/java/org/broadinstitute/gatk/tools/walkers/variantutils/VariantContextMergerUnitTest.java @@ -52,6 +52,8 @@ package org.broadinstitute.gatk.tools.walkers.variantutils; import htsjdk.variant.variantcontext.*; +import org.broadinstitute.gatk.engine.GenomeAnalysisEngine; +import org.broadinstitute.gatk.tools.walkers.annotator.VariantAnnotatorEngine; import org.broadinstitute.gatk.utils.*; import org.broadinstitute.gatk.utils.exceptions.UserException; import org.broadinstitute.gatk.utils.fasta.CachingIndexedFastaSequenceFile; @@ -65,6 +67,7 @@ import java.io.File; import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; +import java.util.Collections; import java.util.List; /** @@ -103,7 +106,7 @@ public class VariantContextMergerUnitTest extends BaseTest { @Test(dataProvider = "referenceConfidenceMergeData") public void testReferenceConfidenceMerge(final String testID, final List toMerge, final GenomeLoc loc, final boolean returnSiteEvenIfMonomorphic, final boolean uniquifySamples, final VariantContext expectedResult) { - final VariantContext result = ReferenceConfidenceVariantContextMerger.merge(toMerge, loc, returnSiteEvenIfMonomorphic ? (byte) 'A' : null, true, uniquifySamples); + final VariantContext result = ReferenceConfidenceVariantContextMerger.merge(toMerge, loc, returnSiteEvenIfMonomorphic ? (byte) 'A' : null, true, uniquifySamples, null); if ( result == null ) { Assert.assertTrue(expectedResult == null); return; diff --git a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AlleleSpecificAnnotationData.java b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AlleleSpecificAnnotationData.java new file mode 100644 index 000000000..df3944214 --- /dev/null +++ b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/AlleleSpecificAnnotationData.java @@ -0,0 +1,96 @@ +/* +* Copyright 2012-2015 Broad Institute, Inc. +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, +* copy, modify, merge, publish, distribute, sublicense, and/or sell +* copies of the Software, and to permit persons to whom the +* Software is furnished to do so, subject to the following +* conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES +* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR +* THE USE OR OTHER DEALINGS IN THE SOFTWARE. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; +import org.broadinstitute.gatk.utils.exceptions.GATKException; + +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * A class to encapsulate the raw data for allele-specific classes compatible with the ReducibleAnnotation interface + * @param the type of raw data to be stored for later annotation calculation + */ +public class AlleleSpecificAnnotationData extends ReducibleAnnotationData{ + final private List alleleList; + private Allele refAllele; + + public AlleleSpecificAnnotationData(final List inputAlleles, final String inputData) { + super(inputData); + attributeMap = new HashMap<>(); + for(final Allele a : inputAlleles) { + attributeMap.put(a, null); + } + alleleList = inputAlleles; + for(Allele a : alleleList) { + if(a.isReference()) { + refAllele = a; + } + } + } + + @Override + public List getAlleles() {return Collections.unmodifiableList(alleleList);} + + /** + * Get the reference allele for this allele-specific data. + * (Used in cases where annotations compare some attribute of the alt alleles to that of the reference.) + * @return the reference allele for this data + */ + public Allele getRefAllele() {return refAllele;} + + public void setAttributeMap(Map inputMap) { + super.setAttributeMap(inputMap); + checkRefAlleles(); + } + + private void checkRefAlleles() { + boolean foundRef = false; + for (Allele a : alleleList) { + if (a.isReference()) { + if (foundRef) + throw new GATKException("ERROR: multiple reference alleles found in annotation data\n"); + foundRef = true; + } + } + if (!foundRef) + throw new GATKException("ERROR: no reference alleles found in annotation data\n"); + } + + public String makeRawAnnotationString(String printDelim) { + String annotationString = ""; + for (final Allele current : alleleList) { + if (!annotationString.isEmpty()) + annotationString += printDelim; + if(attributeMap.get(current) != null) + annotationString += attributeMap.get(current).toString(); + } + return annotationString.replaceAll("[\\[\\]\\s]", ""); + } +} diff --git a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/CompressedDataList.java b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/CompressedDataList.java new file mode 100644 index 000000000..a89859fa3 --- /dev/null +++ b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/CompressedDataList.java @@ -0,0 +1,117 @@ +/* +* Copyright 2012-2015 Broad Institute, Inc. +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, +* copy, modify, merge, publish, distribute, sublicense, and/or sell +* copies of the Software, and to permit persons to whom the +* Software is furnished to do so, subject to the following +* conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES +* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR +* THE USE OR OTHER DEALINGS IN THE SOFTWARE. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import java.util.Arrays; +import java.util.HashMap; +import java.util.Iterator; +import java.util.Map; + +/** + * A class to represent data as a list of pairs. For example, the list 2,2,2,2,2,2,3,4,4,4,5,5 + * would be compressed as 2,6,3,1,4,3,5,2. The compressed list should be sorted in ascending order by value. + * + * Created by gauthier on 9/25/15. + */ +public class CompressedDataList implements Iterable { + protected Map valueCounts = new HashMap<>(); + + public Map getValueCounts(){ + return valueCounts; + } + + public boolean isEmpty(){ + return valueCounts.isEmpty(); + } + + @Override + public Iterator iterator(){ + Iterator it = new Iterator() { + private Iterator keySetIterator = valueCounts.keySet().iterator(); + private T currentKey = valueCounts.isEmpty() ? null : keySetIterator.next(); + private int currentValueIndex = 0; + private int currentValueSize = valueCounts.isEmpty() ? 0 : valueCounts.get(currentKey); + + @Override + public boolean hasNext() { + return !valueCounts.isEmpty() && (keySetIterator.hasNext() || currentValueIndex < currentValueSize); + } + + @Override + public T next() { + T retKey = currentKey; + currentValueIndex++; + if(currentValueIndex==currentValueSize){ + if(keySetIterator.hasNext()) { + currentKey = keySetIterator.next(); + currentValueIndex = 0; + currentValueSize = valueCounts.get(currentKey); + } + } + return retKey; + } + + @Override + public void remove() { + throw new UnsupportedOperationException(); + } + }; + return it; + } + + @Override + public String toString(){ + String str = ""; + Object[] keys = valueCounts.keySet().toArray(); + Arrays.sort(keys); + for (Object i: keys){ + if(!str.isEmpty()) + str+=","; + str+=(i+","+valueCounts.get(i)); + } + return str; + } + + public void add(final T val){ + add(val, 1); + } + + public void add(final T val, final int count){ + if(valueCounts.containsKey(val)){ + valueCounts.put(val, valueCounts.get(val)+count); + } + else + valueCounts.put(val, count); + + } + + public void add(final CompressedDataList obj){ + for(Map.Entry pair : obj.getValueCounts().entrySet()){ + this.add(pair.getKey(),pair.getValue()); + } + } + +} diff --git a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReducibleAnnotationData.java b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReducibleAnnotationData.java new file mode 100644 index 000000000..0c18bb187 --- /dev/null +++ b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/ReducibleAnnotationData.java @@ -0,0 +1,105 @@ +/* +* Copyright 2012-2015 Broad Institute, Inc. +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, +* copy, modify, merge, publish, distribute, sublicense, and/or sell +* copies of the Software, and to permit persons to whom the +* Software is furnished to do so, subject to the following +* conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES +* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR +* THE USE OR OTHER DEALINGS IN THE SOFTWARE. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import htsjdk.variant.variantcontext.Allele; + +import java.util.*; + +/** + * A class to encapsulate the raw data for classes compatible with the ReducibleAnnotation interface + */ +public class ReducibleAnnotationData { + protected String rawData; + protected Map attributeMap; + + /** + * Create a new ReducibleAnnotationData using the raw data string from a VCF + * @param inputData the raw data as read in from a VCF + */ + public ReducibleAnnotationData(final String inputData) { + rawData = inputData; attributeMap = new HashMap<>(); + attributeMap.put(Allele.NO_CALL, null); + } + + /** + * + * @return the string of raw data as represented in the VCF + */ + public String getRawData() {return rawData;} + + /** + * Note: parent class ReducibleAnnotationData is non-allele specific and stores all values with the no-call allele + * @return the list of alleles for which we have raw annotation data + */ + public List getAlleles() { + List ret = new ArrayList(); + ret.addAll(attributeMap.keySet()); + return ret; + } + + /** + * + * @param key the allele of interest + * @return do we have data for the allele of interest? + */ + public boolean hasAttribute(Allele key) { + return attributeMap.containsKey(key); + } + + /** + * + * @param key the allele of interest + * @return data for the allele of interest + */ + public T getAttribute(Allele key) { + return attributeMap.get(key); + } + + /** + * + * @param key the allele of interest + * @param value raw data corresponding to the allele of interest + */ + public void putAttribute(Allele key, T value) { + attributeMap.put(key, value); + } + + /** + * Assign all of the per-allele raw data at once + * @param inputMap the pre-calculated per-allele data + */ + public void setAttributeMap(Map inputMap) { + attributeMap = inputMap; + } + + /** + * Get the stored raw per-allele data + * @return + */ + public Map getAttributeMap() {return Collections.unmodifiableMap(attributeMap);} + +} diff --git a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/VariantAnnotatorEngine.java b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/VariantAnnotatorEngine.java index a9d14c2ce..840ecd808 100644 --- a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/VariantAnnotatorEngine.java +++ b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/VariantAnnotatorEngine.java @@ -48,6 +48,8 @@ import java.util.*; public class VariantAnnotatorEngine { private final static Logger logger = Logger.getLogger(VariantAnnotatorEngine.class); private List requestedInfoAnnotations = Collections.emptyList(); + private List requestedReducibleInfoAnnotations = new ArrayList<>(); + private List requestedNonReducibleInfoAnnotations = new ArrayList<>(); private List requestedGenotypeAnnotations = Collections.emptyList(); private List requestedExpressions = new ArrayList<>(); private boolean expressionAlleleConcordance = false; @@ -90,6 +92,7 @@ public class VariantAnnotatorEngine { requestedInfoAnnotations = AnnotationInterfaceManager.createAllInfoFieldAnnotations(); requestedGenotypeAnnotations = AnnotationInterfaceManager.createAllGenotypeAnnotations(); excludeAnnotations(annotationsToExclude); + setReducibleAnnotations(); initializeDBs(toolkit); } @@ -98,6 +101,7 @@ public class VariantAnnotatorEngine { this.walker = walker; this.toolkit = toolkit; initializeAnnotations(annotationGroupsToUse, annotationsToUse, annotationsToExclude); + setReducibleAnnotations(); initializeDBs(toolkit); } @@ -224,36 +228,83 @@ public class VariantAnnotatorEngine { return annotateDBs(tracker, annotated); } + /** + * + * @param referenceContext + * @param tracker + * @param readLikelihoods + * @param vc + * @param useRaw output annotation data as raw data? (Yes in the case of gVCF mode for HaplotypeCaller) + * @return + */ public VariantContext annotateContextForActiveRegion(final ReferenceContext referenceContext, final RefMetaDataTracker tracker, final ReadLikelihoods readLikelihoods, - final VariantContext vc) { + final VariantContext vc, + final boolean useRaw) { //TODO we transform the read-likelihood into the Map^2 previous version for the sake of not changing of not changing annotation interface. //TODO should we change those interfaces? final Map annotationLikelihoods = readLikelihoods.toPerReadAlleleLikelihoodMap(); - return annotateContextForActiveRegion(referenceContext, tracker, annotationLikelihoods, vc); + return annotateContextForActiveRegion(referenceContext, tracker, annotationLikelihoods, vc, useRaw); } + /** + * + * @param referenceContext + * @param tracker + * @param perReadAlleleLikelihoodMap + * @param vc + * @param useRaw output annotation data as raw data? (Yes in the case of gVCF mode for HaplotypeCaller) + * @return + */ public VariantContext annotateContextForActiveRegion(final ReferenceContext referenceContext, final RefMetaDataTracker tracker, final Map perReadAlleleLikelihoodMap, - final VariantContext vc) { + final VariantContext vc, + final boolean useRaw) { // annotate genotypes final VariantContextBuilder builder = new VariantContextBuilder(vc).genotypes(annotateGenotypes(null, null, null, vc, perReadAlleleLikelihoodMap)); VariantContext newGenotypeAnnotatedVC = builder.make(); final Map infoAnnotations = new LinkedHashMap<>(newGenotypeAnnotatedVC.getAttributes()); - // go through all the requested info annotationTypes - for ( final InfoFieldAnnotation annotationType : requestedInfoAnnotations ) { + // go through all the requested info annotationTypes that are reducible + if (useRaw) { + for (final InfoFieldAnnotation annotationType : requestedReducibleInfoAnnotations) { + if (!(annotationType instanceof ActiveRegionBasedAnnotation)) + continue; + + + ReducibleAnnotation currentASannotation = (ReducibleAnnotation) annotationType; + final Map annotationsFromCurrentType = currentASannotation.annotateRawData(null, null, referenceContext, null, newGenotypeAnnotatedVC, perReadAlleleLikelihoodMap); + if (annotationsFromCurrentType != null) { + infoAnnotations.putAll(annotationsFromCurrentType); + } + } + } + //if not in reference-confidence mode, do annotate with reducible annotations, but skip the raw data and go straight to the finalized values + else { + for (final InfoFieldAnnotation annotationType : requestedReducibleInfoAnnotations) { + if (!(annotationType instanceof ActiveRegionBasedAnnotation)) + continue; + + final Map annotationsFromCurrentType = annotationType.annotate(null, null, referenceContext, null, newGenotypeAnnotatedVC, perReadAlleleLikelihoodMap); + if (annotationsFromCurrentType != null) { + infoAnnotations.putAll(annotationsFromCurrentType); + } + } + } + //leave this in or else the median will overwrite until we do truly allele-specific + //// for now do both allele-specific and not + for ( final InfoFieldAnnotation annotationType : requestedNonReducibleInfoAnnotations ) { if ( !(annotationType instanceof ActiveRegionBasedAnnotation) ) continue; - final Map annotationsFromCurrentType = annotationType.annotate(referenceContext, perReadAlleleLikelihoodMap, newGenotypeAnnotatedVC); - if ( annotationsFromCurrentType != null ) { - infoAnnotations.putAll(annotationsFromCurrentType); - } + final Map annotationsFromCurrentType = annotationType.annotate(referenceContext, perReadAlleleLikelihoodMap, newGenotypeAnnotatedVC); + if (annotationsFromCurrentType != null) { + infoAnnotations.putAll(annotationsFromCurrentType); + } } // create a new VC with info and genotype annotations @@ -263,6 +314,61 @@ public class VariantAnnotatorEngine { return annotateDBs(tracker, annotated); } + /** + * Combine (raw) data for reducible annotations (those that use raw data in gVCFs) + * Mutates annotationMap by removing the annotations that were combined + * @param allelesList the list of merged alleles across all variants being combined + * @param annotationMap attributes of merged variant contexts -- is modifying by removing successfully combined annotations + * @return a map containing the keys and raw values for the combined annotations + */ + public Map combineAnnotations(final List allelesList, Map> annotationMap) { + Map combinedAnnotations = new HashMap<>(); + + // go through all the requested reducible info annotationTypes + for (final InfoFieldAnnotation annotationType : requestedReducibleInfoAnnotations) { + ReducibleAnnotation currentASannotation = (ReducibleAnnotation) annotationType; + if (annotationMap.containsKey(currentASannotation.getRawKeyName())) { + final List annotationValue = annotationMap.get(currentASannotation.getRawKeyName()); + final Map annotationsFromCurrentType = currentASannotation.combineRawData(allelesList, annotationValue); + combinedAnnotations.putAll(annotationsFromCurrentType); + //remove the combined annotations so that the next method only processes the non-reducible ones + annotationMap.remove(currentASannotation.getRawKeyName()); + } + } + return combinedAnnotations; + } + + /** + * Finalize reducible annotations (those that use raw data in gVCFs) + * @param vc the merged VC with the final set of alleles, possibly subset to the number of maxAltAlleles for genotyping + * @param originalVC the merged but non-subset VC that contains the full list of merged alleles + * @return a VariantContext with the final annotation values for reducible annotations + */ + public VariantContext finalizeAnnotations(VariantContext vc, VariantContext originalVC) { + final Map infoAnnotations = new LinkedHashMap<>(vc.getAttributes()); + + // go through all the requested info annotationTypes + for ( final InfoFieldAnnotation annotationType : requestedReducibleInfoAnnotations ) { + if ( !(annotationType instanceof ActiveRegionBasedAnnotation) ) + continue; + + ReducibleAnnotation currentASannotation = (ReducibleAnnotation)annotationType; + final Map annotationsFromCurrentType = currentASannotation.finalizeRawData(vc, originalVC); + if ( annotationsFromCurrentType != null ) { + infoAnnotations.putAll(annotationsFromCurrentType); + //clean up raw annotation data after annotations are finalized + infoAnnotations.remove(currentASannotation.getRawKeyName()); + } + } + + // generate a new annotated VC + final VariantContextBuilder builder = new VariantContextBuilder(vc).attributes(infoAnnotations); + + // annotate genotypes, creating another new VC in the process + final VariantContext annotated = builder.make(); + return annotated; + } + /** * Annotate the ID field and other DBs for the given Variant Context * @@ -412,13 +518,13 @@ public class VariantAnnotatorEngine { * @param vc variant context to annotate * @return list of biallelics trimmed to a minimum representation */ - private List getMinRepresentationBiallelics(final VariantContext vc){ + private List getMinRepresentationBiallelics(final VariantContext vc) { final List minRepresentationBiallelicVCs = new ArrayList(); final boolean isMultiAllelic = vc.getNAlleles() > 2; - if ( isMultiAllelic ){ + if (isMultiAllelic) { final List vcList = GATKVariantContextUtils.splitVariantContextToBiallelics(vc); - for (final VariantContext biallelicVC: vcList) { - if ( !biallelicVC.isSNP() ) + for (final VariantContext biallelicVC : vcList) { + if (!biallelicVC.isSNP()) minRepresentationBiallelicVCs.add(GATKVariantContextUtils.trimAlleles(biallelicVC, true, true)); else minRepresentationBiallelicVCs.add(biallelicVC); @@ -429,4 +535,13 @@ public class VariantAnnotatorEngine { return minRepresentationBiallelicVCs; } + + private void setReducibleAnnotations() { + for(final InfoFieldAnnotation annotationType : requestedInfoAnnotations) { + if (annotationType instanceof ReducibleAnnotation) + requestedReducibleInfoAnnotations.add(annotationType); + else + requestedNonReducibleInfoAnnotations.add(annotationType); + } + } } diff --git a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/interfaces/AS_StandardAnnotation.java b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/interfaces/AS_StandardAnnotation.java new file mode 100644 index 000000000..55c3e47cc --- /dev/null +++ b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/interfaces/AS_StandardAnnotation.java @@ -0,0 +1,31 @@ +/* +* Copyright 2012-2015 Broad Institute, Inc. +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, +* copy, modify, merge, publish, distribute, sublicense, and/or sell +* copies of the Software, and to permit persons to whom the +* Software is furnished to do so, subject to the following +* conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES +* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR +* THE USE OR OTHER DEALINGS IN THE SOFTWARE. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator.interfaces; + +/** + * Created by gauthier on 9/28/15. + */ +public interface AS_StandardAnnotation extends AnnotationType {} diff --git a/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/interfaces/ReducibleAnnotation.java b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/interfaces/ReducibleAnnotation.java new file mode 100644 index 000000000..25d2d8f41 --- /dev/null +++ b/public/gatk-tools-public/src/main/java/org/broadinstitute/gatk/tools/walkers/annotator/interfaces/ReducibleAnnotation.java @@ -0,0 +1,88 @@ +/* +* Copyright 2012-2015 Broad Institute, Inc. +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, +* copy, modify, merge, publish, distribute, sublicense, and/or sell +* copies of the Software, and to permit persons to whom the +* Software is furnished to do so, subject to the following +* conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES +* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR +* THE USE OR OTHER DEALINGS IN THE SOFTWARE. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator.interfaces; + +import htsjdk.variant.variantcontext.Allele; +import htsjdk.variant.variantcontext.VariantContext; +import org.broadinstitute.gatk.tools.walkers.annotator.ReducibleAnnotationData; +import org.broadinstitute.gatk.utils.contexts.AlignmentContext; +import org.broadinstitute.gatk.utils.contexts.ReferenceContext; +import org.broadinstitute.gatk.utils.genotyper.PerReadAlleleLikelihoodMap; +import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker; + +import java.util.List; +import java.util.Map; + +/** + * An interface for annotations that are calculated using raw data across samples, rather than the median (or median of median) of samples values + */ +public interface ReducibleAnnotation extends AnnotationType { + public abstract String getRawKeyName(); + + /** + * Generate the raw data necessary to calculate the annotation. Raw data is the final endpoint for gVCFs. + * + * @param tracker + * @param walker + * @param ref + * @param stratifiedContexts + * @param vc + * @param stratifiedPerReadAlleleLikelihoodMap + * @return + */ + public abstract Map annotateRawData(final RefMetaDataTracker tracker, + final AnnotatorCompatible walker, + final ReferenceContext ref, + final Map stratifiedContexts, + final VariantContext vc, + final Map stratifiedPerReadAlleleLikelihoodMap); + + /** + * Combine raw data, typically during the merging of raw data contained in multiple gVCFs as in CombineGVCFs and the + * preliminary merge for GenotypeGVCFs + * @param allelesList The merged allele list across all variants being combined/merged + * @param listOfRawData The raw data for all the variants being combined/merged + * @return + */ + public abstract Map combineRawData(final List allelesList, final List listOfRawData); + + + /** + * Calculate the final annotation value from the raw data + * @param vc -- contains the final set of alleles, possibly subset by GenotypeGVCFs + * @param originalVC -- used to get all the alleles for all gVCFs + * @return + */ + public abstract Map finalizeRawData(final VariantContext vc, final VariantContext originalVC); + + /** + * + * @param vc + * @param pralm + * @param rawAnnotations + */ + public abstract void calculateRawData(VariantContext vc, Map pralm, ReducibleAnnotationData rawAnnotations); +} diff --git a/public/gatk-tools-public/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/CompressedDataListUnitTest.java b/public/gatk-tools-public/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/CompressedDataListUnitTest.java new file mode 100644 index 000000000..521f4c237 --- /dev/null +++ b/public/gatk-tools-public/src/test/java/org/broadinstitute/gatk/tools/walkers/annotator/CompressedDataListUnitTest.java @@ -0,0 +1,118 @@ +/* +* Copyright 2012-2015 Broad Institute, Inc. +* +* Permission is hereby granted, free of charge, to any person +* obtaining a copy of this software and associated documentation +* files (the "Software"), to deal in the Software without +* restriction, including without limitation the rights to use, +* copy, modify, merge, publish, distribute, sublicense, and/or sell +* copies of the Software, and to permit persons to whom the +* Software is furnished to do so, subject to the following +* conditions: +* +* The above copyright notice and this permission notice shall be +* included in all copies or substantial portions of the Software. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES +* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT +* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, +* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING +* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR +* THE USE OR OTHER DEALINGS IN THE SOFTWARE. +*/ + +package org.broadinstitute.gatk.tools.walkers.annotator; + +import org.testng.Assert; +import org.testng.annotations.Test; + +public class CompressedDataListUnitTest { + + @Test + public void testAddSingly(){ + CompressedDataList intList = new CompressedDataList<>(); + intList.add(2); + intList.add(5); + intList.add(5); + intList.add(2); + intList.add(2); + intList.add(3); + intList.add(2); + intList.add(2); + intList.add(4); + intList.add(4); + intList.add(4); + intList.add(2); + + Assert.assertEquals(intList.isEmpty(), false); + Assert.assertEquals(intList.toString(), "2,6,3,1,4,3,5,2"); + } + + @Test + public void testAddValueCounts(){ + CompressedDataList intList = new CompressedDataList<>(); + intList.add(5,2); + intList.add(2,6); + intList.add(3,1); + intList.add(4,3); + + Assert.assertEquals(intList.isEmpty(), false); + Assert.assertEquals(intList.toString(), "2,6,3,1,4,3,5,2"); + } + + @Test + public void testAddBothWays(){ + CompressedDataList intList = new CompressedDataList<>(); + intList.add(2); + intList.add(5,2); + intList.add(2); + intList.add(2); + intList.add(3); + intList.add(2); + intList.add(2); + intList.add(4,2); + intList.add(2); + intList.add(4,1); + + Assert.assertEquals(intList.toString(), "2,6,3,1,4,3,5,2"); + } + + @Test + public void testCombineLists(){ + CompressedDataList intList1 = new CompressedDataList<>(); + intList1.add(5,2); + intList1.add(2,6); + intList1.add(3,1); + intList1.add(4,3); + + CompressedDataList intList2 = new CompressedDataList<>(); + intList2.add(2,5); + intList2.add(6,2); + intList2.add(1,3); + intList2.add(3,4); + + intList1.add(intList2); + + Assert.assertEquals(intList1.toString(), "1,3,2,11,3,5,4,3,5,2,6,2"); + + } + + @Test + public void testIterator(){ + CompressedDataList intList1 = new CompressedDataList<>(); + intList1.add(5,2); + intList1.add(2,6); + intList1.add(3,1); + intList1.add(4,3); + + CompressedDataList intList2 = new CompressedDataList<>(); + for(Integer i : intList1) { + intList2.add(i); + } + + Assert.assertEquals(intList1.toString(),intList2.toString()); + } + +} \ No newline at end of file diff --git a/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFConstants.java b/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFConstants.java index f76924cb9..f871bdc1b 100644 --- a/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFConstants.java +++ b/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFConstants.java @@ -35,6 +35,9 @@ import htsjdk.variant.variantcontext.Allele; public final class GATKVCFConstants { //INFO keys + public static final String RAW_RMS_MAPPING_QUALITY_KEY = "RAW_MQ"; + public static final String AS_RMS_MAPPING_QUALITY_KEY = "AS_MQ"; + public static final String AS_RAW_RMS_MAPPING_QUALITY_KEY = "AS_RAW_MQ"; public static final String ALLELE_BALANCE_HET_KEY = "ABHet"; public static final String ALLELE_BALANCE_HOM_KEY = "ABHom"; public static final String ORIGINAL_AC_KEY = "AC_Orig"; //SelectVariants @@ -45,6 +48,8 @@ public final class GATKVCFConstants { public static final String BEAGLE_AN_COMP_KEY = "ANH"; //BeagleOutputToVCF public static final String BASE_COUNTS_KEY = "BaseCounts"; public static final String BASE_QUAL_RANK_SUM_KEY = "BaseQRankSum"; + public static final String AS_BASE_QUAL_RANK_SUM_KEY = "AS_BaseQRankSum"; + public static final String AS_RAW_BASE_QUAL_RANK_SUM_KEY = "AS_RAW_BaseQRankSum"; public static final String GENOTYPE_AND_VALIDATE_STATUS_KEY = "callStatus"; public static final String CLIPPING_RANK_SUM_KEY = "ClippingRankSum"; public static final String CULPRIT_KEY = "culprit"; @@ -55,6 +60,8 @@ public final class GATKVCFConstants { public static final String EVENT_DISTANCE_MAX_KEY = "MAX_ED"; //M2 public static final String EVENT_DISTANCE_MIN_KEY = "MIN_ED"; //M2 public static final String FISHER_STRAND_KEY = "FS"; + public static final String AS_FISHER_STRAND_KEY = "AS_FS"; + public static final String AS_SB_TABLE_KEY = "AS_SB_TABLE"; public static final String GC_CONTENT_KEY = "GC"; public static final String GQ_MEAN_KEY = "GQ_MEAN"; public static final String GQ_STDEV_KEY = "GQ_STDDEV"; @@ -67,6 +74,7 @@ public final class GATKVCFConstants { public static final String INTERVAL_GC_CONTENT_KEY = "IGC"; public static final String INBREEDING_COEFFICIENT_KEY = "InbreedingCoeff"; public static final String EXCESS_HET_KEY = "ExcessHet"; + public static final String AS_HETEROZYGOSITY_KEY = "AS_InbreedingCoeff"; public static final String LIKELIHOOD_RANK_SUM_KEY = "LikelihoodRankSum"; public static final String LO_CONF_DENOVO_KEY = "loConfDeNovo"; public static final String LOW_MQ_KEY = "LowMQ"; @@ -75,6 +83,9 @@ public final class GATKVCFConstants { public static final String MLE_PER_SAMPLE_ALLELE_COUNT_KEY = "MLPSAC"; public static final String MLE_PER_SAMPLE_ALLELE_FRACTION_KEY = "MLPSAF"; public static final String MAP_QUAL_RANK_SUM_KEY = "MQRankSum"; + public static final String RAW_MAP_QUAL_RANK_SUM_KEY = "RAW_MQRankSum"; + public static final String AS_MAP_QUAL_RANK_SUM_KEY = "AS_MQRankSum"; + public static final String AS_RAW_MAP_QUAL_RANK_SUM_KEY = "AS_RAW_MQRankSum"; public static final String MENDEL_VIOLATION_LR_KEY = "MVLR"; public static final String NOCALL_CHROM_KEY = "NCC"; public static final String NUMBER_OF_DISCOVERED_ALLELES_KEY = "NDA"; @@ -92,12 +103,15 @@ public final class GATKVCFConstants { public static final String POSITIVE_LABEL_KEY = "POSITIVE_TRAIN_SITE"; public static final String QUAL_BY_DEPTH_KEY = "QD"; public static final String BEAGLE_R2_KEY = "R2"; //BeagleOutputToVCF + public static final String AS_READ_POS_RANK_SUM_KEY = "AS_ReadPosRankSum"; + public static final String AS_RAW_READ_POS_RANK_SUM_KEY = "AS_RAW_ReadPosRankSum"; public static final String READ_POS_RANK_SUM_KEY = "ReadPosRankSum"; public static final String REFSAMPLE_DEPTH_KEY = "REFDEPTH"; public static final String REPEATS_PER_ALLELE_KEY = "RPA"; public static final String REPEAT_UNIT_KEY = "RU"; public static final String SAMPLE_LIST_KEY = "Samples"; public static final String STRAND_ODDS_RATIO_KEY = "SOR"; + public static final String AS_STRAND_ODDS_RATIO_KEY = "AS_SOR"; public static final String STR_PRESENT_KEY = "STR"; public static final String TRANSMISSION_DISEQUILIBRIUM_KEY = "TDT"; public static final String TUMOR_LOD_KEY = "TLOD"; //M2 diff --git a/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFHeaderLines.java b/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFHeaderLines.java index 48884e5c3..634da68f8 100644 --- a/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFHeaderLines.java +++ b/public/gatk-utils/src/main/java/org/broadinstitute/gatk/utils/variant/GATKVCFHeaderLines.java @@ -115,8 +115,12 @@ public class GATKVCFHeaderLines { addInfoLine(new VCFInfoHeaderLine(LOW_MQ_KEY, 3, VCFHeaderLineType.Float, "3-tuple: ,,")); addInfoLine(new VCFInfoHeaderLine(N_BASE_COUNT_KEY, 1, VCFHeaderLineType.Float, "Percentage of N bases in the pileup")); addInfoLine(new VCFInfoHeaderLine(BASE_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities")); + addInfoLine(new VCFInfoHeaderLine(AS_BASE_QUAL_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific Z-score from Wilcoxon rank sum test of each Alt Vs. Ref base qualities")); + addInfoLine(new VCFInfoHeaderLine(AS_RAW_BASE_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.String, "raw data for allele specific rank sum test of base qualities")); addInfoLine(new VCFInfoHeaderLine(CLIPPING_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases")); addInfoLine(new VCFInfoHeaderLine(FISHER_STRAND_KEY, 1, VCFHeaderLineType.Float, "Phred-scaled p-value using Fisher's exact test to detect strand bias")); + addInfoLine(new VCFInfoHeaderLine(AS_FISHER_STRAND_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific phred-scaled p-value using Fisher's exact test to detect strand bias of each alt allele")); + addInfoLine(new VCFInfoHeaderLine(AS_SB_TABLE_KEY, 1, VCFHeaderLineType.String, "Allele-specific forward/reverse read counts for strand bias tests")); addInfoLine(new VCFInfoHeaderLine(GC_CONTENT_KEY, 1, VCFHeaderLineType.Float, "GC content around the variant (see docs for window size details)")); addInfoLine(new VCFInfoHeaderLine(NOCALL_CHROM_KEY, 1, VCFHeaderLineType.Integer, "Number of no-called samples")); addInfoLine(new VCFInfoHeaderLine(GQ_MEAN_KEY, 1, VCFHeaderLineType.Float, "Mean of all GQ values")); @@ -126,16 +130,28 @@ public class GATKVCFHeaderLines { addInfoLine(new VCFInfoHeaderLine(HOMOPOLYMER_RUN_KEY, 1, VCFHeaderLineType.Integer, "Largest Contiguous Homopolymer Run of Variant Allele In Either Direction")); addInfoLine(new VCFInfoHeaderLine(INBREEDING_COEFFICIENT_KEY, 1, VCFHeaderLineType.Float, "Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation")); addInfoLine(new VCFInfoHeaderLine(EXCESS_HET_KEY, 1, VCFHeaderLineType.Float, "Phred-scaled p-value for exact test of excess heterozygosity")); + addInfoLine(new VCFInfoHeaderLine(AS_HETEROZYGOSITY_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific heterozygosity as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation; relate to inbreeding coefficient")); addInfoLine(new VCFInfoHeaderLine(LIKELIHOOD_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score from Wilcoxon rank sum test of Alt Vs. Ref haplotype likelihoods")); addInfoLine(new VCFInfoHeaderLine(MAP_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities")); + addInfoLine(new VCFInfoHeaderLine(AS_MAP_QUAL_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific Z-score From Wilcoxon rank sum test of each Alt vs. Ref read mapping qualities")); + addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 1, VCFHeaderLineType.Float, "Raw data for RMS Mapping Quality")); + addInfoLine(new VCFInfoHeaderLine(AS_RAW_RMS_MAPPING_QUALITY_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "Allele-specfic raw data for RMS Mapping Quality")); + addInfoLine(new VCFInfoHeaderLine(AS_RMS_MAPPING_QUALITY_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "Allele-specific RMS Mapping Quality")); + addInfoLine(new VCFInfoHeaderLine(RAW_MAP_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Raw data for Mapping Quality Rank Sum")); + addInfoLine(new VCFInfoHeaderLine(AS_RAW_MAP_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.String, "Allele-specfic raw data for Mapping Quality Rank Sum")); + addInfoLine(new VCFInfoHeaderLine(AS_MAP_QUAL_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "Allele-specific Mapping Quality Rank Sum")); + addInfoLine(new VCFInfoHeaderLine(MENDEL_VIOLATION_LR_KEY, 1, VCFHeaderLineType.Float, "Mendelian violation likelihood ratio: L[MV] - L[No MV]")); addInfoLine(new VCFInfoHeaderLine(HI_CONF_DENOVO_KEY, 1, VCFHeaderLineType.String, "High confidence possible de novo mutation (GQ >= 20 for all trio members)=[comma-delimited list of child samples]")); addInfoLine(new VCFInfoHeaderLine(LO_CONF_DENOVO_KEY, 1, VCFHeaderLineType.String, "Low confidence possible de novo mutation (GQ >= 10 for child, GQ > 0 for parents)=[comma-delimited list of child samples]")); addInfoLine(new VCFInfoHeaderLine(QUAL_BY_DEPTH_KEY, 1, VCFHeaderLineType.Float, "Variant Confidence/Quality by Depth")); addInfoLine(new VCFInfoHeaderLine(READ_POS_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias")); + addInfoLine(new VCFInfoHeaderLine(AS_READ_POS_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias")); + addInfoLine(new VCFInfoHeaderLine(AS_RAW_READ_POS_RANK_SUM_KEY, 1, VCFHeaderLineType.String, "allele specific raw data for rank sum test of read position bias")); addInfoLine(new VCFInfoHeaderLine(SAMPLE_LIST_KEY, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.String, "List of polymorphic samples")); addInfoLine(new VCFInfoHeaderLine(SPANNING_DELETIONS_KEY, 1, VCFHeaderLineType.Float, "Fraction of Reads Containing Spanning Deletions")); addInfoLine(new VCFInfoHeaderLine(STRAND_ODDS_RATIO_KEY, 1, VCFHeaderLineType.Float, "Symmetric Odds Ratio of 2x2 contingency table to detect strand bias")); + addInfoLine(new VCFInfoHeaderLine(AS_STRAND_ODDS_RATIO_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "Allele specific strand Odds Ratio of 2x|Alts| contingency table to detect allele specific strand bias")); addInfoLine(new VCFInfoHeaderLine(STR_PRESENT_KEY, 0, VCFHeaderLineType.Flag, "Variant is a short tandem repeat")); addInfoLine(new VCFInfoHeaderLine(REPEAT_UNIT_KEY, 1, VCFHeaderLineType.String, "Tandem repeat unit (bases)")); addInfoLine(new VCFInfoHeaderLine(REPEATS_PER_ALLELE_KEY, VCFHeaderLineCount.UNBOUNDED, VCFHeaderLineType.Integer, "Number of times tandem repeat unit is repeated, for each allele (including reference)"));