Removed walkers for handling Beagle data
Added deprecation statements to DeprecatedToolChecks.java
Removed integration test for Beagle walker
Added URL for Beagle documentation
This commit is contained in:
parent
178bf12b27
commit
3bd988825f
|
|
@ -1,109 +0,0 @@
|
|||
/*
|
||||
* By downloading the PROGRAM you agree to the following terms of use:
|
||||
*
|
||||
* BROAD INSTITUTE
|
||||
* SOFTWARE LICENSE AGREEMENT
|
||||
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
|
||||
*
|
||||
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (“BROAD”) and the LICENSEE and is effective at the date the downloading is completed (“EFFECTIVE DATE”).
|
||||
*
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
|
||||
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
|
||||
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
|
||||
*
|
||||
* 1. DEFINITIONS
|
||||
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
|
||||
*
|
||||
* 2. LICENSE
|
||||
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
|
||||
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
|
||||
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
|
||||
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
|
||||
*
|
||||
* 3. PHONE-HOME FEATURE
|
||||
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (“PHONE-HOME”) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEE’S user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
|
||||
*
|
||||
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
|
||||
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
|
||||
* Copyright 2012-2014 Broad Institute, Inc.
|
||||
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
|
||||
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
|
||||
*
|
||||
* 5. INDEMNIFICATION
|
||||
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
|
||||
*
|
||||
* 6. NO REPRESENTATIONS OR WARRANTIES
|
||||
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
|
||||
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
|
||||
*
|
||||
* 7. ASSIGNMENT
|
||||
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
|
||||
*
|
||||
* 8. MISCELLANEOUS
|
||||
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
|
||||
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
|
||||
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
|
||||
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
|
||||
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
|
||||
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
|
||||
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.beagle;
|
||||
|
||||
import org.broadinstitute.gatk.engine.walkers.WalkerTest;
|
||||
import org.testng.annotations.Test;
|
||||
|
||||
import java.util.Arrays;
|
||||
|
||||
public class BeagleIntegrationTest extends WalkerTest {
|
||||
|
||||
private static final String beagleValidationDataLocation = privateTestDir + "/Beagle/";
|
||||
@Test
|
||||
public void testBeagleOutput() {
|
||||
WalkerTestSpec spec = new WalkerTestSpec(
|
||||
"-T BeagleOutputToVCF -R " + hg19Reference + " " +
|
||||
"--variant:VCF3 " + beagleValidationDataLocation + "inttestbgl.input.vcf " +
|
||||
"--beagleR2:BEAGLE " + beagleValidationDataLocation + "inttestbgl.r2 " +
|
||||
"--beagleProbs:BEAGLE " + beagleValidationDataLocation + "inttestbgl.gprobs " +
|
||||
"--beaglePhased:BEAGLE " + beagleValidationDataLocation + "inttestbgl.phased " +
|
||||
"-o %s --no_cmdline_in_header -U LENIENT_VCF_PROCESSING", 1, Arrays.asList("1c4f2fed1d452368fa4dfe3e209ebb57"));
|
||||
spec.disableShadowBCF();
|
||||
executeTest("test BeagleOutputToVCF", spec);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testBeagleInput() {
|
||||
WalkerTestSpec spec = new WalkerTestSpec(
|
||||
"-T ProduceBeagleInput -R " + hg19Reference + " " +
|
||||
"--variant:VCF3 " + beagleValidationDataLocation + "inttestbgl.input.vcf " +
|
||||
"-o %s -U LENIENT_VCF_PROCESSING", 1, Arrays.asList("f301b089d21da259873f04bdc468835d"));
|
||||
spec.disableShadowBCF();
|
||||
executeTest("test BeagleInput", spec);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testBeagleInput2() {
|
||||
WalkerTestSpec spec = new WalkerTestSpec(
|
||||
"-T ProduceBeagleInput --variant:VCF /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878_HSQ_chr22_14-16m.vcf "+
|
||||
"--validation:VCF /humgen/gsa-hpprojects/GATK/data/Validation_Data/NA12878_OMNI_chr22_14-16m.vcf "+
|
||||
"-L 22:14000000-16000000 -o %s -bvcf %s -bs 0.8 -U LENIENT_VCF_PROCESSING -valp 0.98 -R /humgen/1kg/reference/human_g1k_v37.fasta --no_cmdline_in_header ",2,
|
||||
Arrays.asList("660986891b30cdc937e0f2a3a5743faa","4b6417f892ccfe5c63b8a60cb0ef3740"));
|
||||
spec.disableShadowBCF();
|
||||
executeTest("test BeagleInputWithBootstrap",spec);
|
||||
}
|
||||
|
||||
@Test
|
||||
public void testBeagleOutput2() {
|
||||
WalkerTestSpec spec = new WalkerTestSpec(
|
||||
"-T BeagleOutputToVCF -R "+hg19Reference+" "+
|
||||
"--variant:VCF /humgen/gsa-hpprojects/GATK/data/Validation_Data/EUR_beagle_in_test.vcf "+
|
||||
"--beagleR2:beagle /humgen/gsa-hpprojects/GATK/data/Validation_Data/EUR_beagle_in_test.r2 "+
|
||||
"--beagleProbs:beagle /humgen/gsa-hpprojects/GATK/data/Validation_Data/EUR_beagle_in_test.gprobs.bgl "+
|
||||
"--beaglePhased:beagle /humgen/gsa-hpprojects/GATK/data/Validation_Data/EUR_beagle_in_test.phased.bgl "+
|
||||
"-L 20:1-70000 -o %s --no_cmdline_in_header -U LENIENT_VCF_PROCESSING",1,Arrays.asList("e036636fcd6a748ede4a70ea47941d47"));
|
||||
spec.disableShadowBCF();
|
||||
executeTest("testBeagleChangesSitesToRef",spec);
|
||||
}
|
||||
|
||||
}
|
||||
|
|
@ -1,391 +0,0 @@
|
|||
/*
|
||||
* Copyright (c) 2012 The Broad Institute
|
||||
*
|
||||
* Permission is hereby granted, free of charge, to any person
|
||||
* obtaining a copy of this software and associated documentation
|
||||
* files (the "Software"), to deal in the Software without
|
||||
* restriction, including without limitation the rights to use,
|
||||
* copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
* copies of the Software, and to permit persons to whom the
|
||||
* Software is furnished to do so, subject to the following
|
||||
* conditions:
|
||||
*
|
||||
* The above copyright notice and this permission notice shall be
|
||||
* included in all copies or substantial portions of the Software.
|
||||
*
|
||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
|
||||
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
|
||||
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
|
||||
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
||||
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
|
||||
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.beagle;
|
||||
|
||||
import org.broadinstitute.gatk.utils.commandline.*;
|
||||
import org.broadinstitute.gatk.engine.CommandLineGATK;
|
||||
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.engine.walkers.RodWalker;
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.broadinstitute.gatk.engine.SampleUtils;
|
||||
import org.broadinstitute.gatk.utils.codecs.beagle.BeagleFeature;
|
||||
import org.broadinstitute.gatk.utils.help.HelpConstants;
|
||||
import org.broadinstitute.gatk.engine.GATKVCFUtils;
|
||||
import htsjdk.variant.vcf.*;
|
||||
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
|
||||
import htsjdk.variant.variantcontext.*;
|
||||
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
|
||||
|
||||
import java.util.*;
|
||||
|
||||
import static java.lang.Math.log10;
|
||||
|
||||
|
||||
/**
|
||||
* Takes files produced by Beagle imputation engine and creates a vcf with modified annotations.
|
||||
*
|
||||
* <p>This walker is intended to be run after Beagle has successfully executed. The full calling sequence for using Beagle along with the GATK is: </p>
|
||||
*
|
||||
* <p>1. Run ProduceBeagleInputWalker. </p>
|
||||
* <p>2. Run Beagle</p>
|
||||
* <p>3. Uncompress output files</p>
|
||||
* <p>4. Run BeagleOutputToVCFWalker.</p>
|
||||
*
|
||||
*
|
||||
* Note that this walker requires all input files produced by Beagle.
|
||||
*
|
||||
*
|
||||
* <h3>Example</h3>
|
||||
* <pre>
|
||||
* java -Xmx4000m -jar dist/GenomeAnalysisTK.jar \
|
||||
* -R reffile.fasta -T BeagleOutputToVCF \
|
||||
* -V input_vcf.vcf \
|
||||
* -beagleR2:BEAGLE /myrun.beagle_output.r2 \
|
||||
* -beaglePhased:BEAGLE /myrun.beagle_output.phased \
|
||||
* -beagleProbs:BEAGLE /myrun.beagle_output.gprobs \
|
||||
* -o output_vcf.vcf
|
||||
* </pre>
|
||||
|
||||
<p> Note that Beagle produces some of these files compressed as .gz, so gunzip must be run on them before walker is run in order to decompress them </p>
|
||||
|
||||
*/
|
||||
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARDISC, extraDocs = {CommandLineGATK.class} )
|
||||
public class BeagleOutputToVCF extends RodWalker<Integer, Integer> {
|
||||
|
||||
@ArgumentCollection
|
||||
protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
|
||||
|
||||
/**
|
||||
* If this argument is present, the original allele frequencies and counts from this vcf are added as annotations ACH,AFH and ANH. at each record present in this vcf
|
||||
*/
|
||||
@Input(fullName="comp", shortName = "comp", doc="Comparison VCF file", required=false)
|
||||
public RodBinding<VariantContext> comp;
|
||||
|
||||
|
||||
/**
|
||||
* This required argument is used to annotate each site in the vcf INFO field with R2 annotation. Will be NaN if Beagle determined there are no variant samples.
|
||||
*/
|
||||
@Input(fullName="beagleR2", shortName = "beagleR2", doc="Beagle-produced .r2 file containing R^2 values for all markers", required=true)
|
||||
public RodBinding<BeagleFeature> beagleR2;
|
||||
|
||||
/**
|
||||
* These values will populate the GL field for each sample and contain the posterior probability of each genotype given the data after phasing and imputation.
|
||||
*/
|
||||
@Input(fullName="beagleProbs", shortName = "beagleProbs", doc="Beagle-produced .probs file containing posterior genotype probabilities", required=true)
|
||||
public RodBinding<BeagleFeature> beagleProbs;
|
||||
|
||||
/**
|
||||
* By default, all genotypes will be marked in the VCF as "phased", using the "|" separator after Beagle.
|
||||
*/
|
||||
@Input(fullName="beaglePhased", shortName = "beaglePhased", doc="Beagle-produced .phased file containing phased genotypes", required=true)
|
||||
public RodBinding<BeagleFeature> beaglePhased;
|
||||
|
||||
@Output(doc="VCF File to which variants should be written")
|
||||
protected VariantContextWriter vcfWriter = null;
|
||||
|
||||
/**
|
||||
* If this argument is absent, and if Beagle determines that there is no sample in a site that has a variant genotype, the site will be marked as filtered (Default behavior).
|
||||
* If the argument is present, the site won't be marked as filtered under this condition even if there are no variant genotypes.
|
||||
*/
|
||||
@Argument(fullName="dont_mark_monomorphic_sites_as_filtered", shortName="keep_monomorphic", doc="If provided, we won't filter sites that beagle tags as monomorphic. Useful for imputing a sample's genotypes from a reference panel" ,required=false)
|
||||
public boolean DONT_FILTER_MONOMORPHIC_SITES = false;
|
||||
|
||||
/**
|
||||
* Value between 0 and 1. If the probability of getting a genotype correctly (based on the posterior genotype probabilities and the actual genotype) is below this threshold,
|
||||
* a genotype will be substitute by a no-call.
|
||||
*/
|
||||
@Argument(fullName="no" +
|
||||
"call_threshold", shortName="ncthr", doc="Threshold of confidence at which a genotype won't be called", required=false)
|
||||
private double noCallThreshold = 0.0;
|
||||
|
||||
protected static String line = null;
|
||||
|
||||
private final double MIN_PROB_ERROR = 0.000001;
|
||||
private final double MAX_GENOTYPE_QUALITY = -6.0;
|
||||
|
||||
public void initialize() {
|
||||
|
||||
// setup the header fields
|
||||
|
||||
final Set<VCFHeaderLine> hInfo = new HashSet<>();
|
||||
hInfo.addAll(GATKVCFUtils.getHeaderFields(getToolkit()));
|
||||
hInfo.add(GATKVCFHeaderLines.getFormatLine(GATKVCFConstants.ORIGINAL_GENOTYPE_KEY));
|
||||
hInfo.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.BEAGLE_R2_KEY));
|
||||
hInfo.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.NUM_GENOTYPES_CHANGED_KEY));
|
||||
hInfo.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.ORIGINAL_ALT_ALLELE_INFO_KEY));
|
||||
hInfo.add(GATKVCFHeaderLines.getFilterLine(GATKVCFConstants.BEAGLE_MONO_FILTER_NAME));
|
||||
|
||||
if ( comp.isBound() ) {
|
||||
hInfo.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.BEAGLE_AC_COMP_KEY));
|
||||
hInfo.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.BEAGLE_AF_COMP_KEY));
|
||||
hInfo.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.BEAGLE_AN_COMP_KEY));
|
||||
}
|
||||
|
||||
Set<String> samples = SampleUtils.getSampleListWithVCFHeader(getToolkit(), Arrays.asList(variantCollection.variants.getName()));
|
||||
|
||||
final VCFHeader vcfHeader = new VCFHeader(hInfo, samples);
|
||||
vcfWriter.writeHeader(vcfHeader);
|
||||
}
|
||||
|
||||
public Integer map( RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context ) {
|
||||
|
||||
if ( tracker == null )
|
||||
return 0;
|
||||
|
||||
GenomeLoc loc = context.getLocation();
|
||||
VariantContext vc_input = tracker.getFirstValue(variantCollection.variants, loc);
|
||||
|
||||
VariantContext vc_comp = tracker.getFirstValue(comp, loc);
|
||||
|
||||
if ( vc_input == null )
|
||||
return 0;
|
||||
|
||||
if (vc_input.isFiltered()) {
|
||||
vcfWriter.add(vc_input);
|
||||
return 1;
|
||||
}
|
||||
|
||||
BeagleFeature beagleR2Feature = tracker.getFirstValue(beagleR2);
|
||||
BeagleFeature beagleProbsFeature = tracker.getFirstValue(beagleProbs);
|
||||
BeagleFeature beaglePhasedFeature = tracker.getFirstValue(beaglePhased);
|
||||
|
||||
// ignore places where we don't have a variant
|
||||
if ( beagleR2Feature == null || beagleProbsFeature == null || beaglePhasedFeature == null)
|
||||
{
|
||||
vcfWriter.add(vc_input);
|
||||
return 1;
|
||||
}
|
||||
|
||||
|
||||
// get reference base for current position
|
||||
byte refByte = ref.getBase();
|
||||
|
||||
// make new Genotypes based on Beagle results
|
||||
GenotypesContext genotypes = GenotypesContext.create(vc_input.getGenotypes().size());
|
||||
|
||||
// for each genotype, create a new object with Beagle information on it
|
||||
|
||||
int numGenotypesChangedByBeagle = 0;
|
||||
Integer alleleCountH = 0, chrCountH = 0;
|
||||
Double alleleFrequencyH = 0.0;
|
||||
int beagleVarCounts = 0;
|
||||
|
||||
GenotypesContext hapmapGenotypes = null;
|
||||
|
||||
if (vc_comp != null) {
|
||||
hapmapGenotypes = vc_comp.getGenotypes();
|
||||
}
|
||||
|
||||
for ( final Genotype g : vc_input.getGenotypes() ) {
|
||||
boolean genotypeIsPhased = true;
|
||||
String sample = g.getSampleName();
|
||||
|
||||
// If we have a Hapmap (comp) ROD, compute Hapmap AC, AN and AF
|
||||
// use sample as key into genotypes structure
|
||||
if (vc_comp != null) {
|
||||
|
||||
if (vc_input.getGenotypes().containsSample(sample) && hapmapGenotypes.containsSample(sample)) {
|
||||
|
||||
Genotype hapmapGenotype = hapmapGenotypes.get(sample);
|
||||
if (hapmapGenotype.isCalled()){
|
||||
chrCountH += 2;
|
||||
if (hapmapGenotype.isHet()) {
|
||||
alleleCountH += 1;
|
||||
} else if (hapmapGenotype.isHomVar()) {
|
||||
alleleCountH += 2;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
ArrayList<String> beagleProbabilities = beagleProbsFeature.getProbLikelihoods().get(sample);
|
||||
ArrayList<String> beagleGenotypePairs = beaglePhasedFeature.getGenotypes().get(sample);
|
||||
|
||||
// original alleles at this genotype
|
||||
Allele originalAlleleA = g.getAllele(0);
|
||||
|
||||
Allele originalAlleleB = (g.getAlleles().size() == 2) ? g.getAllele(1) : g.getAllele(0); // hack to deal with no-call genotypes
|
||||
|
||||
|
||||
// We have phased genotype in hp. Need to set the isRef field in the allele.
|
||||
List<Allele> alleles = new ArrayList<>();
|
||||
|
||||
String alleleA = beagleGenotypePairs.get(0);
|
||||
String alleleB = beagleGenotypePairs.get(1);
|
||||
|
||||
if ( alleleA.equals("null") || alleleB.equals("null") ) {
|
||||
logger.warn("Beagle produced 'null' alleles at location "+ref.getLocus().toString()+". Ignoring.");
|
||||
return 0;
|
||||
}
|
||||
|
||||
// Beagle always produces genotype strings based on the strings we input in the likelihood file.
|
||||
String refString = vc_input.getReference().getDisplayString();
|
||||
|
||||
Allele bglAlleleA, bglAlleleB;
|
||||
|
||||
if (alleleA.matches(refString))
|
||||
bglAlleleA = Allele.create(alleleA,true);
|
||||
else
|
||||
bglAlleleA = Allele.create(alleleA,false);
|
||||
|
||||
if (alleleB.matches(refString))
|
||||
bglAlleleB = Allele.create(alleleB,true);
|
||||
else
|
||||
bglAlleleB = Allele.create(alleleB,false);
|
||||
|
||||
|
||||
alleles.add(bglAlleleA);
|
||||
alleles.add(bglAlleleB);
|
||||
|
||||
// Compute new GQ field = -10*log10Pr(Genotype call is wrong)
|
||||
// Beagle gives probability that genotype is AA, AB and BB.
|
||||
// Which, by definition, are prob of hom ref, het and hom var.
|
||||
double probWrongGenotype, genotypeQuality;
|
||||
Double homRefProbability = Double.valueOf(beagleProbabilities.get(0));
|
||||
Double hetProbability = Double.valueOf(beagleProbabilities.get(1));
|
||||
Double homVarProbability = Double.valueOf(beagleProbabilities.get(2));
|
||||
|
||||
if (bglAlleleA.isReference() && bglAlleleB.isReference()) // HomRef call
|
||||
probWrongGenotype = hetProbability + homVarProbability;
|
||||
else if ((bglAlleleB.isReference() && bglAlleleA.isNonReference()) || (bglAlleleA.isReference() && bglAlleleB.isNonReference()))
|
||||
probWrongGenotype = homRefProbability + homVarProbability;
|
||||
else // HomVar call
|
||||
probWrongGenotype = hetProbability + homRefProbability;
|
||||
|
||||
// deal with numerical errors coming from limited formatting value on Beagle output files
|
||||
if (probWrongGenotype > 1 - MIN_PROB_ERROR)
|
||||
probWrongGenotype = 1 - MIN_PROB_ERROR;
|
||||
|
||||
if (1-probWrongGenotype < noCallThreshold) {
|
||||
// quality is bad: don't call genotype
|
||||
alleles.clear();
|
||||
alleles.add(originalAlleleA);
|
||||
alleles.add(originalAlleleB);
|
||||
genotypeIsPhased = false;
|
||||
}
|
||||
|
||||
if (probWrongGenotype < MIN_PROB_ERROR)
|
||||
genotypeQuality = MAX_GENOTYPE_QUALITY;
|
||||
else
|
||||
genotypeQuality = log10(probWrongGenotype);
|
||||
|
||||
HashMap<String,Object> originalAttributes = new HashMap<>(g.getExtendedAttributes());
|
||||
|
||||
// get original encoding and add to keynotype attributes
|
||||
String a1, a2, og;
|
||||
if (originalAlleleA.isNoCall())
|
||||
a1 = ".";
|
||||
else if (originalAlleleA.isReference())
|
||||
a1 = "0";
|
||||
else
|
||||
a1 = "1";
|
||||
|
||||
if (originalAlleleB.isNoCall())
|
||||
a2 = ".";
|
||||
else if (originalAlleleB.isReference())
|
||||
a2 = "0";
|
||||
else
|
||||
a2 = "1";
|
||||
|
||||
og = a1+"/"+a2;
|
||||
|
||||
// See if Beagle switched genotypes
|
||||
if (! originalAlleleA.equals(Allele.NO_CALL) && beagleSwitchedGenotypes(bglAlleleA,originalAlleleA,bglAlleleB,originalAlleleB)){
|
||||
originalAttributes.put(GATKVCFConstants.ORIGINAL_GENOTYPE_KEY, og);
|
||||
numGenotypesChangedByBeagle++;
|
||||
}
|
||||
else {
|
||||
originalAttributes.put(GATKVCFConstants.ORIGINAL_GENOTYPE_KEY, ".");
|
||||
}
|
||||
Genotype imputedGenotype = new GenotypeBuilder(g).alleles(alleles).log10PError(genotypeQuality).attributes(originalAttributes).phased(genotypeIsPhased).make();
|
||||
if ( imputedGenotype.isHet() || imputedGenotype.isHomVar() ) {
|
||||
beagleVarCounts++;
|
||||
}
|
||||
|
||||
genotypes.add(imputedGenotype);
|
||||
}
|
||||
|
||||
final VariantContextBuilder builder = new VariantContextBuilder(vc_input).source("outputvcf").genotypes(genotypes);
|
||||
if ( ! ( beagleVarCounts > 0 || DONT_FILTER_MONOMORPHIC_SITES ) ) {
|
||||
builder.attribute(GATKVCFConstants.ORIGINAL_ALT_ALLELE_INFO_KEY, vc_input.getAlternateAllele(0));
|
||||
builder.alleles(Collections.singleton(vc_input.getReference())).filter(GATKVCFConstants.BEAGLE_MONO_FILTER_NAME);
|
||||
}
|
||||
|
||||
// re-compute chromosome counts
|
||||
VariantContextUtils.calculateChromosomeCounts(builder, false);
|
||||
|
||||
// Get Hapmap AC and AF
|
||||
if (vc_comp != null) {
|
||||
builder.attribute(GATKVCFConstants.BEAGLE_AC_COMP_KEY, alleleCountH.toString() );
|
||||
builder.attribute(GATKVCFConstants.BEAGLE_AN_COMP_KEY, chrCountH.toString() );
|
||||
builder.attribute(GATKVCFConstants.BEAGLE_AF_COMP_KEY, String.format("%4.2f", (double)alleleCountH/chrCountH) );
|
||||
|
||||
}
|
||||
|
||||
builder.attribute(GATKVCFConstants.NUM_GENOTYPES_CHANGED_KEY, numGenotypesChangedByBeagle );
|
||||
if( !beagleR2Feature.getR2value().equals(Double.NaN) ) {
|
||||
builder.attribute(GATKVCFConstants.BEAGLE_R2_KEY, beagleR2Feature.getR2value().toString() );
|
||||
}
|
||||
|
||||
vcfWriter.add(builder.make());
|
||||
|
||||
return 1;
|
||||
}
|
||||
|
||||
private boolean beagleSwitchedGenotypes(Allele bglAlleleA, Allele originalAlleleA, Allele bglAlleleB, Allele originalAlleleB) {
|
||||
return !((bglAlleleA.equals(originalAlleleA) && bglAlleleB.equals(originalAlleleB) ||
|
||||
(bglAlleleA.equals(originalAlleleB) && bglAlleleB.equals(originalAlleleA))));
|
||||
}
|
||||
|
||||
public Integer reduceInit() {
|
||||
return 0; // Nothing to do here
|
||||
}
|
||||
|
||||
/**
|
||||
* Increment the number of loci processed.
|
||||
*
|
||||
* @param value result of the map.
|
||||
* @param sum accumulator for the reduce.
|
||||
* @return the new number of loci processed.
|
||||
*/
|
||||
public Integer reduce(Integer value, Integer sum) {
|
||||
return sum + value;
|
||||
}
|
||||
|
||||
/**
|
||||
* Tell the user the number of loci processed and close out the new variants file.
|
||||
*
|
||||
* @param result the number of loci seen.
|
||||
*/
|
||||
public void onTraversalDone(Integer result) {
|
||||
System.out.printf("Processed %d loci.\n", result);
|
||||
}
|
||||
}
|
||||
|
|
@ -1,463 +0,0 @@
|
|||
/*
|
||||
* Copyright (c) 2012 The Broad Institute
|
||||
*
|
||||
* Permission is hereby granted, free of charge, to any person
|
||||
* obtaining a copy of this software and associated documentation
|
||||
* files (the "Software"), to deal in the Software without
|
||||
* restriction, including without limitation the rights to use,
|
||||
* copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
* copies of the Software, and to permit persons to whom the
|
||||
* Software is furnished to do so, subject to the following
|
||||
* conditions:
|
||||
*
|
||||
* The above copyright notice and this permission notice shall be
|
||||
* included in all copies or substantial portions of the Software.
|
||||
*
|
||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
|
||||
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
|
||||
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
|
||||
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
||||
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
|
||||
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.beagle;
|
||||
|
||||
import org.broadinstitute.gatk.utils.Utils;
|
||||
import org.broadinstitute.gatk.utils.commandline.*;
|
||||
import org.broadinstitute.gatk.engine.CommandLineGATK;
|
||||
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.engine.samples.Gender;
|
||||
import org.broadinstitute.gatk.engine.walkers.RodWalker;
|
||||
import org.broadinstitute.gatk.tools.walkers.variantrecalibration.VQSRCalibrationCurve;
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.broadinstitute.gatk.utils.MathUtils;
|
||||
import org.broadinstitute.gatk.engine.SampleUtils;
|
||||
import org.broadinstitute.gatk.utils.help.HelpConstants;
|
||||
import org.broadinstitute.gatk.engine.GATKVCFUtils;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils;
|
||||
import htsjdk.variant.vcf.VCFFilterHeaderLine;
|
||||
import htsjdk.variant.vcf.VCFHeader;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import org.broadinstitute.gatk.utils.exceptions.GATKException;
|
||||
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
|
||||
import htsjdk.variant.variantcontext.*;
|
||||
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
|
||||
|
||||
import java.io.File;
|
||||
import java.io.PrintStream;
|
||||
import java.util.*;
|
||||
|
||||
/**
|
||||
* Converts the input VCF into a format accepted by the Beagle imputation/analysis program.
|
||||
* <p>
|
||||
*
|
||||
* <h3>Input</h3>
|
||||
* <p>
|
||||
* A VCF with variants to convert to Beagle format
|
||||
* </p>
|
||||
*
|
||||
* <h2>Outputs</h2>
|
||||
* <p>
|
||||
* A single text file which can be fed to Beagle
|
||||
* </p>
|
||||
* <p>
|
||||
* Optional: A file with a list of markers
|
||||
* </p>
|
||||
*
|
||||
* <h3>Examples</h3>
|
||||
* <pre>
|
||||
* java -Xmx2g -jar dist/GenomeAnalysisTK.jar -L 20 \
|
||||
* -R reffile.fasta -T ProduceBeagleInput \
|
||||
* -V path_to_input_vcf/inputvcf.vcf -o path_to_beagle_output/beagle_output
|
||||
* </pre>
|
||||
*
|
||||
*/
|
||||
|
||||
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARDISC, extraDocs = {CommandLineGATK.class} )
|
||||
public class ProduceBeagleInput extends RodWalker<Integer, Integer> {
|
||||
|
||||
@ArgumentCollection protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
|
||||
|
||||
@Hidden
|
||||
@Input(fullName="validation", shortName = "validation", doc="Validation VCF file", required=false)
|
||||
public RodBinding<VariantContext> validation;
|
||||
|
||||
|
||||
@Output(doc="File to which BEAGLE input should be written")
|
||||
protected PrintStream beagleWriter = null;
|
||||
|
||||
@Hidden
|
||||
@Output(doc="File to which BEAGLE markers should be written", shortName="markers", fullName = "markers", required = false, defaultToStdout = false)
|
||||
protected PrintStream markers = null;
|
||||
int markerCounter = 1;
|
||||
|
||||
@Hidden
|
||||
@Input(doc="VQSqual calibration file", shortName = "cc", required=false)
|
||||
protected File VQSRCalibrationFile = null;
|
||||
protected VQSRCalibrationCurve VQSRCalibrator = null;
|
||||
|
||||
@Hidden
|
||||
@Argument(doc="VQSqual key", shortName = "vqskey", required=false)
|
||||
protected String VQSLOD_KEY = "VQSqual";
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "inserted_nocall_rate", shortName = "nc_rate", doc = "Rate (0-1) at which genotype no-calls will be randomly inserted, for testing", required = false)
|
||||
public double insertedNoCallRate = 0;
|
||||
@Hidden
|
||||
@Argument(fullName = "validation_genotype_ptrue", shortName = "valp", doc = "Flat probability to assign to validation genotypes. Will override GL field.", required = false)
|
||||
public double validationPrior = -1.0;
|
||||
@Hidden
|
||||
@Argument(fullName = "validation_bootstrap", shortName = "bs", doc = "Proportion of records to be used in bootstrap set", required = false)
|
||||
public double bootstrap = 0.0;
|
||||
@Hidden
|
||||
@Argument(fullName = "bootstrap_vcf",shortName = "bvcf", doc = "Output a VCF with the records used for bootstrapping filtered out", required = false)
|
||||
VariantContextWriter bootstrapVCFOutput = null;
|
||||
|
||||
/**
|
||||
* If sample gender is known, this flag should be set to true to ensure that Beagle treats male Chr X properly.
|
||||
*/
|
||||
@Argument(fullName = "checkIsMaleOnChrX", shortName = "checkIsMaleOnChrX", doc = "Set to true when Beagle-ing chrX and want to ensure male samples don't have heterozygous calls.", required = false)
|
||||
public boolean CHECK_IS_MALE_ON_CHR_X = false;
|
||||
|
||||
@Hidden
|
||||
@Argument(fullName = "variant_genotype_ptrue", shortName = "varp", doc = "Flat probability prior to assign to variant (not validation) genotypes. Does not override GL field.", required = false)
|
||||
public double variantPrior = 0.96;
|
||||
|
||||
private Set<String> samples = null;
|
||||
private Set<String> BOOTSTRAP_FILTER = new HashSet<String>( Arrays.asList("bootstrap") );
|
||||
private int bootstrapSetSize = 0;
|
||||
private int testSetSize = 0;
|
||||
private CachingFormatter formatter = new CachingFormatter("%5.4f ", 100000);
|
||||
private int certainFPs = 0;
|
||||
|
||||
public void initialize() {
|
||||
|
||||
samples = SampleUtils.getSampleListWithVCFHeader(getToolkit(), Arrays.asList(variantCollection.variants.getName()));
|
||||
|
||||
beagleWriter.print("marker alleleA alleleB");
|
||||
for ( String sample : samples )
|
||||
beagleWriter.print(String.format(" %s %s %s", sample, sample, sample));
|
||||
|
||||
beagleWriter.println();
|
||||
|
||||
if ( bootstrapVCFOutput != null ) {
|
||||
initializeVcfWriter();
|
||||
}
|
||||
|
||||
if ( VQSRCalibrationFile != null ) {
|
||||
VQSRCalibrator = VQSRCalibrationCurve.readFromFile(VQSRCalibrationFile);
|
||||
logger.info("Read calibration curve");
|
||||
VQSRCalibrator.printInfo(logger);
|
||||
}
|
||||
}
|
||||
|
||||
public Integer map( RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context ) {
|
||||
if( tracker != null ) {
|
||||
GenomeLoc loc = context.getLocation();
|
||||
VariantContext variant_eval = tracker.getFirstValue(variantCollection.variants, loc);
|
||||
VariantContext validation_eval = tracker.getFirstValue(validation, loc);
|
||||
|
||||
if ( goodSite(variant_eval,validation_eval) ) {
|
||||
if ( useValidation(validation_eval, ref) ) {
|
||||
writeBeagleOutput(validation_eval, variant_eval, true, validationPrior);
|
||||
return 1;
|
||||
} else {
|
||||
if ( goodSite(variant_eval) ) {
|
||||
writeBeagleOutput(variant_eval,validation_eval,false,variantPrior);
|
||||
return 1;
|
||||
} else { // todo -- if the variant site is bad, validation is good, but not in bootstrap set -- what do?
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
} else {
|
||||
return 0;
|
||||
}
|
||||
} else {
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
public boolean goodSite(VariantContext a, VariantContext b) {
|
||||
return goodSite(a) || goodSite(b);
|
||||
}
|
||||
|
||||
public boolean goodSite(VariantContext v) {
|
||||
if ( canBeOutputToBeagle(v) ) {
|
||||
if ( VQSRCalibrator != null && VQSRCalibrator.certainFalsePositive(VQSLOD_KEY, v) ) {
|
||||
certainFPs++;
|
||||
return false;
|
||||
} else {
|
||||
return true;
|
||||
}
|
||||
} else {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
public static boolean canBeOutputToBeagle(VariantContext v) {
|
||||
return v != null && ! v.isFiltered() && v.isBiallelic() && v.hasGenotypes();
|
||||
}
|
||||
|
||||
public boolean useValidation(VariantContext validation, ReferenceContext ref) {
|
||||
if( goodSite(validation) ) {
|
||||
// if using record keeps us below expected proportion, use it
|
||||
logger.debug(String.format("boot: %d, test: %d, total: %d", bootstrapSetSize, testSetSize, bootstrapSetSize+testSetSize+1));
|
||||
if ( (bootstrapSetSize+1.0)/(1.0+bootstrapSetSize+testSetSize) <= bootstrap ) {
|
||||
if ( bootstrapVCFOutput != null ) {
|
||||
bootstrapVCFOutput.add(new VariantContextBuilder(validation).filters(BOOTSTRAP_FILTER).make());
|
||||
}
|
||||
bootstrapSetSize++;
|
||||
return true;
|
||||
} else {
|
||||
if ( bootstrapVCFOutput != null ) {
|
||||
bootstrapVCFOutput.add(validation);
|
||||
}
|
||||
testSetSize++;
|
||||
return false;
|
||||
}
|
||||
} else {
|
||||
if ( validation != null && bootstrapVCFOutput != null ) {
|
||||
bootstrapVCFOutput.add(validation);
|
||||
}
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
private final static double[] HAPLOID_FLAT_LOG10_LIKELIHOODS = MathUtils.toLog10(new double[]{ 0.5, 0.0, 0.5 });
|
||||
private final static double[] DIPLOID_FLAT_LOG10_LIKELIHOODS = MathUtils.toLog10(new double[]{ 0.33, 0.33, 0.33 });
|
||||
|
||||
public void writeBeagleOutput(VariantContext preferredVC, VariantContext otherVC, boolean isValidationSite, double prior) {
|
||||
GenomeLoc currentLoc = GATKVariantContextUtils.getLocation(getToolkit().getGenomeLocParser(), preferredVC);
|
||||
StringBuffer beagleOut = new StringBuffer();
|
||||
|
||||
String marker = String.format("%s:%d ",currentLoc.getContig(),currentLoc.getStart());
|
||||
beagleOut.append(marker);
|
||||
if ( markers != null ) markers.append(marker).append("\t").append(Integer.toString(markerCounter++)).append("\t");
|
||||
for ( Allele allele : preferredVC.getAlleles() ) {
|
||||
String bglPrintString;
|
||||
if (allele.isNoCall())
|
||||
bglPrintString = "-";
|
||||
else
|
||||
bglPrintString = allele.getBaseString(); // get rid of * in case of reference allele
|
||||
|
||||
beagleOut.append(String.format("%s ", bglPrintString));
|
||||
if ( markers != null ) markers.append(bglPrintString).append("\t");
|
||||
}
|
||||
if ( markers != null ) markers.append("\n");
|
||||
|
||||
GenotypesContext preferredGenotypes = preferredVC.getGenotypes();
|
||||
GenotypesContext otherGenotypes = goodSite(otherVC) ? otherVC.getGenotypes() : null;
|
||||
for ( String sample : samples ) {
|
||||
boolean isMaleOnChrX = CHECK_IS_MALE_ON_CHR_X && getSample(sample).getGender() == Gender.MALE;
|
||||
|
||||
Genotype genotype;
|
||||
boolean isValidation;
|
||||
// use sample as key into genotypes structure
|
||||
if ( preferredGenotypes.containsSample(sample) ) {
|
||||
genotype = preferredGenotypes.get(sample);
|
||||
isValidation = isValidationSite;
|
||||
} else if ( otherGenotypes != null && otherGenotypes.containsSample(sample) ) {
|
||||
genotype = otherGenotypes.get(sample);
|
||||
isValidation = ! isValidationSite;
|
||||
} else {
|
||||
// there is magically no genotype for this sample.
|
||||
throw new GATKException("Sample "+sample+" arose with no genotype in variant or validation VCF. This should never happen.");
|
||||
}
|
||||
|
||||
/*
|
||||
* Use likelihoods if: is validation, prior is negative; or: is not validation, has genotype key
|
||||
*/
|
||||
double [] log10Likelihoods = null;
|
||||
if ( (isValidation && prior < 0.0) || genotype.hasLikelihoods() ) {
|
||||
log10Likelihoods = genotype.getLikelihoods().getAsVector();
|
||||
|
||||
// see if we need to randomly mask out genotype in this position.
|
||||
if ( Utils.getRandomGenerator().nextDouble() <= insertedNoCallRate ) {
|
||||
// we are masking out this genotype
|
||||
log10Likelihoods = isMaleOnChrX ? HAPLOID_FLAT_LOG10_LIKELIHOODS : DIPLOID_FLAT_LOG10_LIKELIHOODS;
|
||||
}
|
||||
|
||||
if( isMaleOnChrX ) {
|
||||
log10Likelihoods[1] = -255; // todo -- warning this is dangerous for multi-allele case
|
||||
}
|
||||
}
|
||||
/**
|
||||
* otherwise, use the prior uniformly
|
||||
*/
|
||||
else if (! isValidation && genotype.isCalled() && ! genotype.hasLikelihoods() ) {
|
||||
// hack to deal with input VCFs with no genotype likelihoods. Just assume the called genotype
|
||||
// is confident. This is useful for Hapmap and 1KG release VCFs.
|
||||
double AA = (1.0-prior)/2.0;
|
||||
double AB = (1.0-prior)/2.0;
|
||||
double BB = (1.0-prior)/2.0;
|
||||
|
||||
if (genotype.isHomRef()) { AA = prior; }
|
||||
else if (genotype.isHet()) { AB = prior; }
|
||||
else if (genotype.isHomVar()) { BB = prior; }
|
||||
|
||||
log10Likelihoods = MathUtils.toLog10(new double[]{ AA, isMaleOnChrX ? 0.0 : AB, BB });
|
||||
}
|
||||
else {
|
||||
log10Likelihoods = isMaleOnChrX ? HAPLOID_FLAT_LOG10_LIKELIHOODS : DIPLOID_FLAT_LOG10_LIKELIHOODS;
|
||||
}
|
||||
|
||||
writeSampleLikelihoods(beagleOut, preferredVC, log10Likelihoods);
|
||||
}
|
||||
|
||||
beagleWriter.println(beagleOut.toString());
|
||||
}
|
||||
|
||||
private void writeSampleLikelihoods( StringBuffer out, VariantContext vc, double[] log10Likelihoods ) {
|
||||
if ( VQSRCalibrator != null ) {
|
||||
log10Likelihoods = VQSRCalibrator.includeErrorRateInLikelihoods(VQSLOD_KEY, vc, log10Likelihoods);
|
||||
}
|
||||
|
||||
double[] normalizedLikelihoods = MathUtils.normalizeFromLog10(log10Likelihoods);
|
||||
// see if we need to randomly mask out genotype in this position.
|
||||
for (double likeVal: normalizedLikelihoods) {
|
||||
out.append(formatter.format(likeVal));
|
||||
// out.append(String.format("%5.4f ",likeVal));
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
public Integer reduceInit() {
|
||||
return 0; // Nothing to do here
|
||||
}
|
||||
|
||||
public Integer reduce( Integer value, Integer sum ) {
|
||||
return value + sum; // count up the sites
|
||||
}
|
||||
|
||||
public void onTraversalDone( Integer includedSites ) {
|
||||
logger.info("Sites included in beagle likelihoods file : " + includedSites);
|
||||
logger.info(String.format("Certain false positive found from recalibration curve : %d (%.2f%%)",
|
||||
certainFPs, (100.0 * certainFPs) / (Math.max(certainFPs + includedSites, 1))));
|
||||
}
|
||||
|
||||
private void initializeVcfWriter() {
|
||||
final List<String> inputNames = Arrays.asList(validation.getName());
|
||||
|
||||
// setup the header fields
|
||||
Set<VCFHeaderLine> hInfo = new HashSet<VCFHeaderLine>();
|
||||
hInfo.addAll(GATKVCFUtils.getHeaderFields(getToolkit(), inputNames));
|
||||
hInfo.add(new VCFFilterHeaderLine("bootstrap","This site used for genotype bootstrapping with ProduceBeagleInputWalker"));
|
||||
|
||||
bootstrapVCFOutput.writeHeader(new VCFHeader(hInfo, SampleUtils.getUniqueSamplesFromRods(getToolkit(), inputNames)));
|
||||
}
|
||||
|
||||
public static class CachingFormatter {
|
||||
private String format;
|
||||
private LRUCache<Double, String> cache;
|
||||
|
||||
public String getFormat() {
|
||||
return format;
|
||||
}
|
||||
|
||||
public String format(double value) {
|
||||
String f = cache.get(value);
|
||||
if ( f == null ) {
|
||||
f = String.format(format, value);
|
||||
cache.put(value, f);
|
||||
// if ( cache.usedEntries() < maxCacheSize ) {
|
||||
// System.out.printf("CACHE size %d%n", cache.usedEntries());
|
||||
// } else {
|
||||
// System.out.printf("CACHE is full %f%n", value);
|
||||
// }
|
||||
// }
|
||||
// } else {
|
||||
// System.out.printf("CACHE hit %f%n", value);
|
||||
// }
|
||||
}
|
||||
|
||||
return f;
|
||||
}
|
||||
|
||||
public CachingFormatter(String format, int maxCacheSize) {
|
||||
this.format = format;
|
||||
this.cache = new LRUCache<Double, String>(maxCacheSize);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* An LRU cache, based on <code>LinkedHashMap</code>.
|
||||
*
|
||||
* <p>
|
||||
* This cache has a fixed maximum number of elements (<code>cacheSize</code>).
|
||||
* If the cache is full and another entry is added, the LRU (least recently used) entry is dropped.
|
||||
*
|
||||
* <p>
|
||||
* This class is thread-safe. All methods of this class are synchronized.
|
||||
*
|
||||
* <p>
|
||||
* Author: Christian d'Heureuse, Inventec Informatik AG, Zurich, Switzerland<br>
|
||||
* Multi-licensed: EPL / LGPL / GPL / AL / BSD.
|
||||
*/
|
||||
public static class LRUCache<K,V> {
|
||||
|
||||
private static final float hashTableLoadFactor = 0.75f;
|
||||
|
||||
private LinkedHashMap<K,V> map;
|
||||
private int cacheSize;
|
||||
|
||||
/**
|
||||
* Creates a new LRU cache.
|
||||
* @param cacheSize the maximum number of entries that will be kept in this cache.
|
||||
*/
|
||||
public LRUCache (int cacheSize) {
|
||||
this.cacheSize = cacheSize;
|
||||
int hashTableCapacity = (int)Math.ceil(cacheSize / hashTableLoadFactor) + 1;
|
||||
map = new LinkedHashMap<K,V>(hashTableCapacity, hashTableLoadFactor, true) {
|
||||
// (an anonymous inner class)
|
||||
private static final long serialVersionUID = 1;
|
||||
@Override protected boolean removeEldestEntry (Map.Entry<K,V> eldest) {
|
||||
return size() > LRUCache.this.cacheSize; }}; }
|
||||
|
||||
/**
|
||||
* Retrieves an entry from the cache.<br>
|
||||
* The retrieved entry becomes the MRU (most recently used) entry.
|
||||
* @param key the key whose associated value is to be returned.
|
||||
* @return the value associated to this key, or null if no value with this key exists in the cache.
|
||||
*/
|
||||
public synchronized V get (K key) {
|
||||
return map.get(key); }
|
||||
|
||||
/**
|
||||
* Adds an entry to this cache.
|
||||
* The new entry becomes the MRU (most recently used) entry.
|
||||
* If an entry with the specified key already exists in the cache, it is replaced by the new entry.
|
||||
* If the cache is full, the LRU (least recently used) entry is removed from the cache.
|
||||
* @param key the key with which the specified value is to be associated.
|
||||
* @param value a value to be associated with the specified key.
|
||||
*/
|
||||
public synchronized void put (K key, V value) {
|
||||
map.put (key, value); }
|
||||
|
||||
/**
|
||||
* Clears the cache.
|
||||
*/
|
||||
public synchronized void clear() {
|
||||
map.clear(); }
|
||||
|
||||
/**
|
||||
* Returns the number of used entries in the cache.
|
||||
* @return the number of entries currently in the cache.
|
||||
*/
|
||||
public synchronized int usedEntries() {
|
||||
return map.size(); }
|
||||
|
||||
/**
|
||||
* Returns a <code>Collection</code> that contains a copy of all cache entries.
|
||||
* @return a <code>Collection</code> with a copy of the cache content.
|
||||
*/
|
||||
public synchronized Collection<Map.Entry<K,V>> getAll() {
|
||||
return new ArrayList<Map.Entry<K,V>>(map.entrySet()); }
|
||||
|
||||
} // end class LRUCache
|
||||
}
|
||||
|
|
@ -1,184 +0,0 @@
|
|||
/*
|
||||
* Copyright (c) 2012 The Broad Institute
|
||||
*
|
||||
* Permission is hereby granted, free of charge, to any person
|
||||
* obtaining a copy of this software and associated documentation
|
||||
* files (the "Software"), to deal in the Software without
|
||||
* restriction, including without limitation the rights to use,
|
||||
* copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
* copies of the Software, and to permit persons to whom the
|
||||
* Software is furnished to do so, subject to the following
|
||||
* conditions:
|
||||
*
|
||||
* The above copyright notice and this permission notice shall be
|
||||
* included in all copies or substantial portions of the Software.
|
||||
*
|
||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
|
||||
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
|
||||
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
|
||||
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
|
||||
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
|
||||
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
*/
|
||||
|
||||
package org.broadinstitute.gatk.tools.walkers.beagle;
|
||||
|
||||
import org.broadinstitute.gatk.utils.commandline.Argument;
|
||||
import org.broadinstitute.gatk.utils.commandline.Input;
|
||||
import org.broadinstitute.gatk.utils.commandline.Output;
|
||||
import org.broadinstitute.gatk.utils.commandline.RodBinding;
|
||||
import org.broadinstitute.gatk.engine.CommandLineGATK;
|
||||
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
|
||||
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
|
||||
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
|
||||
import org.broadinstitute.gatk.engine.walkers.RodWalker;
|
||||
import org.broadinstitute.gatk.utils.GenomeLoc;
|
||||
import org.broadinstitute.gatk.engine.SampleUtils;
|
||||
import org.broadinstitute.gatk.utils.help.HelpConstants;
|
||||
import org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils;
|
||||
import htsjdk.variant.vcf.VCFHeader;
|
||||
import htsjdk.variant.vcf.VCFHeaderLine;
|
||||
import org.broadinstitute.gatk.engine.GATKVCFUtils;
|
||||
import org.broadinstitute.gatk.utils.exceptions.UserException;
|
||||
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
|
||||
import htsjdk.variant.variantcontext.Allele;
|
||||
import htsjdk.variant.variantcontext.Genotype;
|
||||
import htsjdk.variant.variantcontext.VariantContext;
|
||||
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
|
||||
|
||||
import java.io.PrintStream;
|
||||
import java.util.Arrays;
|
||||
import java.util.Set;
|
||||
|
||||
/**
|
||||
* Produces an input file to Beagle imputation engine, listing unphased, hard-called genotypes for a single sample
|
||||
* in input variant file. Will additionally hold back a fraction of the sites for evaluation, marking the
|
||||
* genotypes at that sites as missing, and writing the truth of these sites to a second VCF file
|
||||
*/
|
||||
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARDISC, extraDocs = {CommandLineGATK.class} )
|
||||
public class VariantsToBeagleUnphased extends RodWalker<Integer, Integer> {
|
||||
@Input(fullName="variants", shortName = "V", doc="Input VCF file", required=true)
|
||||
public RodBinding<VariantContext> variants;
|
||||
|
||||
@Output(doc="File to which BEAGLE unphased genotypes should be written")
|
||||
protected PrintStream beagleWriter = null;
|
||||
|
||||
@Argument(fullName = "bootstrap_fraction", shortName = "bs", doc = "Proportion of records to be used in bootstrap set", required = false)
|
||||
public double bootstrap = 0.0;
|
||||
|
||||
@Argument(fullName = "bootstrap_vcf",shortName = "bsvcf", doc = "Output a VCF with the records used for bootstrapping filtered out", required = false)
|
||||
VariantContextWriter bootstrapVCFOutput = null;
|
||||
|
||||
@Argument(fullName = "missing", shortName = "missing", doc = "String to identify missing data in beagle output", required = false)
|
||||
public String MISSING = "?";
|
||||
|
||||
private Set<String> samples = null;
|
||||
private int bootstrapSetSize = 0;
|
||||
private int testSetSize = 0;
|
||||
|
||||
public void initialize() {
|
||||
samples = SampleUtils.getSampleListWithVCFHeader(getToolkit(), Arrays.asList(variants.getName()));
|
||||
|
||||
beagleWriter.print("I marker alleleA alleleB");
|
||||
for ( String sample : samples )
|
||||
beagleWriter.print(String.format(" %s %s", sample, sample));
|
||||
|
||||
beagleWriter.println();
|
||||
|
||||
if ( bootstrap < 0.0 | bootstrap > 1.0 )
|
||||
throw new UserException.BadArgumentValue("bootstrap", "Bootstrap value must be fraction between 0 and 1");
|
||||
|
||||
if ( bootstrapVCFOutput != null ) {
|
||||
Set<VCFHeaderLine> hInfo = GATKVCFUtils.getHeaderFields(getToolkit());
|
||||
bootstrapVCFOutput.writeHeader(new VCFHeader(hInfo, SampleUtils.getUniqueSamplesFromRods(getToolkit())));
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Iterate over each site, emitting the BEAGLE unphased genotypes file format
|
||||
* @param tracker
|
||||
* @param ref
|
||||
* @param context
|
||||
* @return
|
||||
*/
|
||||
public Integer map( RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context ) {
|
||||
if( tracker != null ) {
|
||||
GenomeLoc loc = context.getLocation();
|
||||
VariantContext vc = tracker.getFirstValue(variants, loc);
|
||||
|
||||
if ( ProduceBeagleInput.canBeOutputToBeagle(vc) ) {
|
||||
// do we want to hold back this site?
|
||||
boolean makeMissing = dropSite(vc);
|
||||
|
||||
// if we are holding it back and we are writing a bootstrap VCF, write it out
|
||||
if ( makeMissing && bootstrapVCFOutput != null ) {
|
||||
bootstrapVCFOutput.add(vc);
|
||||
}
|
||||
|
||||
// regardless, all sites are written to the unphased genotypes file, marked as missing if appropriate
|
||||
writeUnphasedBeagleOutput(vc, makeMissing);
|
||||
}
|
||||
}
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
/**
|
||||
* Do we want to hold back this site for bootstrap? Considers the bootstrap fraction member variable
|
||||
*
|
||||
* @param vc
|
||||
* @return
|
||||
*/
|
||||
public boolean dropSite(VariantContext vc) {
|
||||
if ( (bootstrapSetSize+1.0)/(1.0+bootstrapSetSize+testSetSize) <= bootstrap ) {
|
||||
bootstrapSetSize++;
|
||||
return true;
|
||||
} else {
|
||||
testSetSize++;
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
public void writeUnphasedBeagleOutput(VariantContext vc, boolean makeMissing) {
|
||||
GenomeLoc currentLoc = GATKVariantContextUtils.getLocation(getToolkit().getGenomeLocParser(), vc);
|
||||
StringBuffer beagleOut = new StringBuffer();
|
||||
|
||||
String marker = String.format("%s:%d ",currentLoc.getContig(), currentLoc.getStart());
|
||||
beagleOut.append("M ").append(marker);
|
||||
|
||||
// write out the alleles at this site
|
||||
for ( Allele allele : vc.getAlleles() ) {
|
||||
beagleOut.append(allele.isNoCall() ? "-" : allele.getBaseString()).append(" ");
|
||||
}
|
||||
|
||||
// write out sample level genotypes
|
||||
for ( String sample : samples ) {
|
||||
Genotype genotype = vc.getGenotype(sample);
|
||||
if ( ! makeMissing && genotype.isCalled() ) {
|
||||
addAlleles(beagleOut, genotype);
|
||||
} else {
|
||||
addAlleles(beagleOut, MISSING, MISSING);
|
||||
}
|
||||
}
|
||||
|
||||
beagleWriter.println(beagleOut.toString());
|
||||
}
|
||||
|
||||
private void addAlleles(StringBuffer buf, Genotype g) {
|
||||
addAlleles(buf, g.getAllele(0).getBaseString(), g.getAllele(1).getBaseString());
|
||||
|
||||
}
|
||||
|
||||
private void addAlleles(StringBuffer buf, String a, String b) {
|
||||
buf.append(a).append(" ").append(b);
|
||||
}
|
||||
|
||||
public Integer reduceInit() { return 0; }
|
||||
public Integer reduce( Integer value, Integer sum ) { return value + sum; }
|
||||
|
||||
public void onTraversalDone( Integer includedSites ) {
|
||||
logger.info("Sites included in beagle genotypes file : " + includedSites);
|
||||
}
|
||||
}
|
||||
|
|
@ -48,6 +48,9 @@ public class DeprecatedToolChecks {
|
|||
deprecatedGATKWalkers.put("AlignmentWalker", "2.2 (no replacement)");
|
||||
deprecatedGATKWalkers.put("CountBestAlignments", "2.2 (no replacement)");
|
||||
deprecatedGATKWalkers.put("SomaticIndelDetector", "2.0 (replaced by the standalone tool Indelocator; see Cancer Tools documentation)");
|
||||
deprecatedGATKWalkers.put("BeagleOutputToVCF", "3,4 (replaced by Beagle native functions; see Beagle 4 documentation at https://faculty.washington.edu/browning/beagle/beagle.html)");
|
||||
deprecatedGATKWalkers.put("VariantsToBeagleUnphased", "3.4 (replaced by Beagle native functions; see Beagle 4 documentation at https://faculty.washington.edu/browning/beagle/beagle.html)");
|
||||
deprecatedGATKWalkers.put("ProduceBeagleInput", "3.4 (replaced by Beagle native functions; see Beagle 4 documentation at https://faculty.washington.edu/browning/beagle/beagle.html)");
|
||||
}
|
||||
|
||||
// Mapping from walker name to major version number where the walker first disappeared and optional replacement options
|
||||
|
|
|
|||
Loading…
Reference in New Issue