Merge pull request #1235 from broadinstitute/gvda_deprecate_useless_tools_1192

Deprecate tools that were outdated or redundant
This commit is contained in:
Geraldine Van der Auwera 2015-11-21 14:58:00 -05:00
commit 22fa1511be
15 changed files with 10 additions and 2305 deletions

View File

@ -1,308 +0,0 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE
* SOFTWARE LICENSE AGREEMENT
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. PHONE-HOME FEATURE
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (PHONE-HOME) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEES user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
*
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012-2015 Broad Institute, Inc.
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 5. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 6. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 7. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 8. MISCELLANEOUS
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.diagnostics;
import org.broadinstitute.gatk.utils.commandline.Argument;
import org.broadinstitute.gatk.utils.commandline.Output;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.utils.report.GATKReport;
import org.broadinstitute.gatk.engine.walkers.LocusWalker;
import org.broadinstitute.gatk.utils.GenomeLoc;
import org.broadinstitute.gatk.utils.GenomeLocParser;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.LinkedList;
import java.util.Map;
/**
* Evaluate coverage distribution per base
*
* <p>
* This tool reports the distribution of coverage per base. It includes reads with deletions in the counts unless
* otherwise specified. Quality filters can be applied before the coverage is calculated.
* </p>
*
* <h3>Input</h3>
* <p>
* The BAM file and an optional interval list
* </p>
*
* <h3>Output</h3>
* <p>
* A GATK Report with the coverage distribution per base
*
* <p/>
* <h3>Usage example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -R reference.fasta \
* -T BaseCoverageDistribution \
* -I myData.bam \
* -L intervals.list \
* -fd \
* -o report.grp
* </pre>
*
* @author carneiro
* @since 1/27/13
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
public class BaseCoverageDistribution extends LocusWalker<ArrayList<Integer>, Map<Integer, ArrayList<Long>>> {
/**
* The name of the file to output the GATK Report table. See the FAQs for more information on the GATK Report format.
*/
@Output(doc = "Output filename")
private PrintStream out;
/**
* Whether or not a deletion should be counted towards the coverage of a site
*/
@Argument(required = false, shortName="del", fullName = "include_deletions", doc ="Include reads with deletions")
private boolean includeDeletions = true;
/**
* Whether or not to apply quality filters before calculating coverage distribution. Filtering will use the
* minimum_mapping_quality and minimum_base_quality parameters below.
*/
@Argument(required = false, shortName="fd", fullName = "filtered_distribution", doc ="Apply quality filters")
private boolean calculateFilteredDistribution = false;
/**
* The minimum mapping quality a read must have to be counted towards the filtered coverage of a site
*/
@Argument(required = false, shortName="mmq", fullName = "minimum_mapping_quality", doc ="Minimum read mapping quality of a read to pass filters")
private byte minMappingQuality = 20;
/**
* The minimum base quality a base must have to be counted towards the filtered coverage of a site
*/
@Argument(required = false, shortName="mbq", fullName = "minimum_base_quality", doc ="Minimum base quality to pass filters")
private byte minBaseQuality = 17;
private GenomeLoc previousLocus = null;
private long uncoveredBases = 0L;
private final LinkedList<GenomeLoc> intervalList = new LinkedList<GenomeLoc>();
@Override
public boolean includeReadsWithDeletionAtLoci() {
return includeDeletions;
}
@Override
public void initialize() {
if (getToolkit().getIntervals() != null)
intervalList.addAll(getToolkit().getIntervals()); // if the user provided intervals, keep track of them for uncovered bases calculation
}
@Override
public ArrayList<Integer> map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
ArrayList<Integer> result = new ArrayList<Integer>(2);
GenomeLoc currentLocus = ref.getLocus();
tallyUncoveredBases(currentLocus);
previousLocus = currentLocus;
result.add(context.getBasePileup().getReads().size()); // I want the reads instead of the base pileup because I want to count deletions.
if (calculateFilteredDistribution)
result.add(context.getBasePileup().getBaseAndMappingFilteredPileup(minBaseQuality, minMappingQuality).getReads().size()); // filtered pileup
else {
result.add(result.get(0)); // repeat the same value as the unfiltered pileup if filters are not on
}
return result;
}
@Override
public Map<Integer, ArrayList<Long>> reduceInit() {
return new HashMap<Integer, ArrayList<Long>>(10000);
}
@Override
public Map<Integer, ArrayList<Long>> reduce(ArrayList<Integer> value, Map<Integer, ArrayList<Long>> sum) {
final int unfilteredCoverage = value.get(0);
final int filteredCoverage = value.get(1);
incrementSumArray(sum, unfilteredCoverage, 0);
incrementSumArray(sum, filteredCoverage, 1);
return sum;
}
@Override
public void onTraversalDone(Map<Integer, ArrayList<Long>> result) {
tallyUncoveredBasesTillEndOfTraversal();
GATKReport report;
if (calculateFilteredDistribution) {
report = GATKReport.newSimpleReport("BaseCoverageDistribution", "Coverage", "Count", "Filtered");
} else {
report = GATKReport.newSimpleReport("BaseCoverageDistribution", "Coverage", "Count");
report.addRow(0, uncoveredBases); // preemptively add the uncovered bases row (since they'll never exist in the Map)
}
for (Map.Entry<Integer, ArrayList<Long>> entry : result.entrySet()) {
final ArrayList<Long> values = entry.getValue();
final int coverage = entry.getKey();
if (calculateFilteredDistribution) {
if (coverage == 0) { // special case for the uncovered bases. The filtered pileups may have an entry, but the unfiltered ones won't.
report.addRow(coverage, uncoveredBases, uncoveredBases + values.get(1));
} else {
report.addRow(coverage, values.get(0), values.get(1));
}
} else {
report.addRow(coverage, values.get(0));
}
}
// In case the filtered distribution never had a pileup filtered down to zero coverage, output the overall uncovered bases for both
if (calculateFilteredDistribution && !result.containsKey(0)) {
report.addRow(0, uncoveredBases, uncoveredBases);
}
report.print(out);
}
/**
* Initializes the ArrayList if needed. Returns the initialized element (or previously initialized)
* this method is used directly by the incrementSumArray.
*
* @param sum the map
* @param coverage the key to the map to extract the array list
* @return if the ArrayList exists, return it. Otherwise, initialize it with 0 counters.
*/
private ArrayList<Long> initializeSumArray(final Map<Integer, ArrayList<Long>> sum, final int coverage) {
ArrayList<Long> curr = sum.get(coverage);
if (curr == null) {
curr = new ArrayList<Long>(2);
curr.add(0L); // number of bases with this unfiltered coverage
curr.add(0L); // number of bases with this filtered coverage
sum.put(coverage, curr);
}
return curr;
}
/**
* Increments the counter for the given arrayindex (type of coverage : filtered or unfiltered) initializing if necessary
*
* @param sum the hash
* @param coverage the hash key
* @param arrayIndex which distribution to increment, 0 for unfiltered, 1 for filtered.
*/
private void incrementSumArray(final Map<Integer, ArrayList<Long>> sum, final int coverage, final int arrayIndex) {
final ArrayList<Long> currentTally = initializeSumArray(sum, coverage);
currentTally.set(arrayIndex, currentTally.get(arrayIndex) + 1);
}
/**
* Counts all the uncovered loci after the end of traversal.
*
* - Modifies the global variable uncoveredBases
* - Uses global variables: intervalList and previousLocus
*
* takes into account that the traversal may have been due over a set of intervals, or over the whole genome.
*/
private void tallyUncoveredBasesTillEndOfTraversal() {
GenomeLocParser parser = getToolkit().getGenomeLocParser();
GenomeLoc lastLocus;
if (intervalList.isEmpty()) { // whole genome, add up all contigs past previousLocus
final int lastContigIndex = getToolkit().getSAMFileHeader().getSequenceDictionary().size() - 1;
final int lastContigLength = getToolkit().getSAMFileHeader().getSequence(lastContigIndex).getSequenceLength();
final String lastContigName = getToolkit().getSAMFileHeader().getSequence(lastContigIndex).getSequenceName();
lastLocus = parser.createGenomeLoc(lastContigName, lastContigIndex, lastContigLength, lastContigLength);
} else {
GenomeLoc lastInterval = intervalList.getLast();
lastLocus = parser.createGenomeLoc(lastInterval.getContig(), lastInterval.getContigIndex(), lastInterval.getStop(), lastInterval.getStop());
}
tallyUncoveredBases(lastLocus);
}
/**
* Counts all the uncovered loci that have been skipped since the last visited locus. This method allows coverage
* tools to run with @By(DataSource.READS) instead of @By(DataSource.REFERENCE), while still accurately calculating
* uncovered bases
*
* //todo -- make this a generic capability of Coverage and DiagnoseTargets
*
* - Modifies the global variable uncoveredBases
* - Uses global variables: intervalList and previousLocus
*
* takes into account that the traversal may have been due over a set of intervals, or over the whole genome.
*
* @param currentLocus the locus we are visiting right now
*/
private void tallyUncoveredBases(GenomeLoc currentLocus) {
long distance = 0;
if (previousLocus == null) { // first base visited
GenomeLocParser parser = getToolkit().getGenomeLocParser();
if (intervalList.isEmpty()) { // if this is whole genome (no intervals requested), add what we missed.
final GenomeLoc zeroLoc = parser.createGenomeLoc(getToolkit().getSAMFileHeader().getSequence(0).getSequenceName(), 0, 1, 1);
distance += currentLocus.distanceAcrossContigs(zeroLoc, getToolkit().getSAMFileHeader());
} else { // if we are running on an interval list, add all intervals before the current locus to the uncovered bases counter
while (!intervalList.peek().containsP(currentLocus)) {
GenomeLoc interval = intervalList.removeFirst();
distance += interval.size();
}
distance += currentLocus.getStart() - intervalList.peek().getStart(); // now this is the interval that contains the current locus. Discount the bases from the beginning.
}
} else {
final GenomeLoc previousInterval = intervalList.peekFirst(); // peekFirst returns null if interval list is empty (WGS).
distance = currentLocus.distanceAcrossContigs(previousLocus, getToolkit().getSAMFileHeader()) - 1;
if (previousInterval != null && !previousInterval.containsP(currentLocus)) {
intervalList.removeFirst(); // we're done with the previous interval
final GenomeLoc currentInterval = intervalList.peekFirst();
distance -= currentInterval.distanceAcrossContigs(previousInterval, getToolkit().getSAMFileHeader()) - 1;
}
}
uncoveredBases += distance;
}
}

View File

@ -1,160 +0,0 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE
* SOFTWARE LICENSE AGREEMENT
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. PHONE-HOME FEATURE
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (PHONE-HOME) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEES user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
*
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012-2015 Broad Institute, Inc.
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 5. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 6. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 7. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 8. MISCELLANEOUS
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.diagnostics;
import org.broadinstitute.gatk.utils.commandline.Argument;
import org.broadinstitute.gatk.utils.commandline.Output;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.engine.walkers.ActiveRegionTraversalParameters;
import org.broadinstitute.gatk.engine.walkers.ActiveRegionWalker;
import org.broadinstitute.gatk.engine.walkers.PartitionBy;
import org.broadinstitute.gatk.engine.walkers.PartitionType;
import org.broadinstitute.gatk.utils.GenomeLoc;
import org.broadinstitute.gatk.utils.activeregion.ActivityProfileState;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import java.io.PrintStream;
/**
* Outputs a list of intervals that are covered above a given threshold
*
* <p>The output list can be used as an interval list for other tools. Note that if the -uncovered argument is given, the
* logic will be inverted and the tool will instead output intervals that fail the coverage threshold.</p>
*
* <h3>Input</h3>
* <p>
* One or more BAM files.
* </p>
*
* <h3>Output</h3>
* <p>
* List of covered (or uncovered) intervals.
* </p>
*
* <h3>Example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -T FindCoveredIntervals \
* -R reference.fasta \
* -I my_file.bam \
* -o output.list
* </pre>
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
@PartitionBy(value = PartitionType.CONTIG)
@ActiveRegionTraversalParameters(extension = 0, maxRegion = 50000)
public class FindCoveredIntervals extends ActiveRegionWalker<GenomeLoc, Long> {
@Output
private PrintStream out;
@Argument(fullName = "uncovered", shortName = "u", required = false, doc = "output intervals that fail the coverage threshold instead")
private boolean outputUncovered = false;
@Argument(fullName = "coverage_threshold", shortName = "cov", doc = "The minimum allowable coverage to be considered covered", required = false)
private int coverageThreshold = 20;
@Argument(fullName = "minBaseQuality", shortName = "minBQ", doc = "The minimum allowable base quality score to be counted for coverage",required = false)
private int minBaseQuality = 0;
@Argument(fullName = "minMappingQuality", shortName = "minMQ", doc = "The minimum allowable mapping quality score to be counted for coverage",required = false)
private int minMappingQuality = 0;
@Override
// Look to see if the region has sufficient coverage
public ActivityProfileState isActive(final RefMetaDataTracker tracker, final ReferenceContext ref, final AlignmentContext context) {
int depth;
if(minBaseQuality == 0 && minMappingQuality == 0)
depth = context.getBasePileup().getBaseFilteredPileup(coverageThreshold).depthOfCoverage();
else
depth = context.getBasePileup().getBaseAndMappingFilteredPileup(minBaseQuality,minMappingQuality).depthOfCoverage();
// note the linear probability scale
return new ActivityProfileState(ref.getLocus(), Math.min(depth / coverageThreshold, 1));
}
@Override
public GenomeLoc map(final org.broadinstitute.gatk.utils.activeregion.ActiveRegion activeRegion, final RefMetaDataTracker tracker) {
if ((!outputUncovered && activeRegion.isActive()) || (outputUncovered && !activeRegion.isActive()))
return activeRegion.getLocation();
return null;
}
@Override
public Long reduceInit() {
return 0L;
}
@Override
public Long reduce(final GenomeLoc value, Long reduce) {
if (value != null) {
out.println(value.toString());
reduce++;
}
return reduce;
}
@Override
public void onTraversalDone(Long reduce) {
logger.info(String.format("Found %d intervals", reduce));
}
}

View File

@ -1,95 +0,0 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE
* SOFTWARE LICENSE AGREEMENT
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. PHONE-HOME FEATURE
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (PHONE-HOME) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEES user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
*
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012-2015 Broad Institute, Inc.
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 5. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 6. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 7. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 8. MISCELLANEOUS
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.diagnostics;
import org.broadinstitute.gatk.engine.walkers.WalkerTest;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;
import java.util.Arrays;
/**
* @author Mauricio Carneiro
* @since 2/6/13
*/
public class BaseCoverageDistributionIntegrationTest extends WalkerTest {
final static String REF = hg18Reference;
final String bam = validationDataLocation + "small_bam_for_countloci.withRG.bam";
@DataProvider(name = "BasicArguments")
public Object[][] basicArgumentsDataProvider() {
return new Object[][] {
// Tests simple counting on one interval with everything in the same contig including tallying of uncovered bases.
{"testSingleInterval ", "-L chr1:90000-100000", "45368696dc008d1a07fb2b05fbafd1f4"},
// Tests specially the tallying of uncovered bases across multiple intervals. Makes sure it's only adding the bases present in the intervals requested.
{"testMultipleIntervals ", "-L chr1:10-20 -L chr1:40-100 -L chr1:10,000-11,000 -L chr1:40,000-60,000 -L chr1:76,000-99,000 ", "45dafe59e5e54451b88c914d6ecbddc6"},
// Tests adding the entire genome around every covered base as uncovered. Especially tests the tally in the beginning and end of the run, adding up all chromosomes not visited (this test file only has reads on chr1).
{"testNoIntervals ", "", "c399f780f0b7da6be2614d837c368d1c"},
// the following three tests are equivalent but now include the filtered distribution option. These tests are aimed at the filtered distribution output.
{"testFilteredSingleInterval ", "-fd -L chr1:90000-100000", "7017cf191bf54e85111972a882e1d5fa"},
{"testFilteredMultipleIntervals ", "-fd -L chr1:10-20 -L chr1:40-100 -L chr1:10,000-11,000 -L chr1:40,000-60,000 -L chr1:76,000-99,000 ", "75d11cc02210676d6c19939fb0b9ab2e"},
{"testFilteredNoIntervals ", "-fd ", "e7abfa6c7be493de4557a64f66688148"},
};
}
@Test(dataProvider = "BasicArguments", enabled = true)
private void BaseCoverageDistributionTest(String testName, String args, String md5) {
String base = String.format("-T BaseCoverageDistribution -R %s -I %s ", REF, bam) + " -o %s ";
WalkerTestSpec spec = new WalkerTestSpec(base + args, Arrays.asList(md5));
executeTest(testName, spec);
}
}

View File

@ -1,109 +0,0 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE
* SOFTWARE LICENSE AGREEMENT
* FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 415 Main Street, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK3 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute.org/gatk on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM. LICENSEE hereby automatically grants to BROAD a non-exclusive, royalty-free, irrevocable license to any LICENSEE bug fixes or modifications to the PROGRAM with unlimited rights to sublicense and/or distribute. LICENSEE agrees to provide any such modifications and bug fixes to BROAD promptly upon their creation.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. PHONE-HOME FEATURE
* LICENSEE expressly acknowledges that the PROGRAM contains an embedded automatic reporting system (PHONE-HOME) which is enabled by default upon download. Unless LICENSEE requests disablement of PHONE-HOME, LICENSEE agrees that BROAD may collect limited information transmitted by PHONE-HOME regarding LICENSEE and its use of the PROGRAM. Such information shall include LICENSEES user identification, version number of the PROGRAM and tools being run, mode of analysis employed, and any error reports generated during run-time. Collection of such information is used by BROAD solely to monitor usage rates, fulfill reporting requirements to BROAD funding agencies, drive improvements to the PROGRAM, and facilitate adjustments to PROGRAM-related documentation.
*
* 4. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012-2015 Broad Institute, Inc.
* Notice of attribution: The GATK3 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 5. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 6. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 7. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 8. MISCELLANEOUS
* 8.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 8.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 8.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 8.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 8.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 8.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 8.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.gatk.tools.walkers.variantutils;
import org.broadinstitute.gatk.engine.walkers.WalkerTest;
import org.broadinstitute.gatk.utils.exceptions.UserException;
import org.testng.annotations.Test;
import java.util.Arrays;
/**
* Tests LiftoverVariants
*/
public class LiftoverVariantsIntegrationTest extends WalkerTest {
@Test
public void testb36Tohg19() {
WalkerTestSpec spec = new WalkerTestSpec(
"-T LiftoverVariants -o %s -R " + b36KGReference + " --variant " + privateTestDir + "yri.trio.gatk_glftrio.intersection.annotated.filtered.chr1.500.noheader.vcf -chain " + validationDataLocation + "b36ToHg19.broad.over.chain -dict /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19.dict",
1,
Arrays.asList("eb38adfffb7913fa761c5c579b885ea2"));
executeTest("test b36 to hg19", spec);
}
@Test
public void testb36Tohg19UnsortedSamples() {
WalkerTestSpec spec = new WalkerTestSpec(
"-T LiftoverVariants -o %s -R " + b36KGReference + " --variant " + privateTestDir + "yri.trio.gatk_glftrio.intersection.annotated.filtered.chr1.500.noheader.unsortedSamples.vcf -chain " + validationDataLocation + "b36ToHg19.broad.over.chain -dict /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19.dict",
1,
Arrays.asList("f40e078fd5fb1d4107738a4f2f602902"));
executeTest("test b36 to hg19, unsorted samples", spec);
}
@Test
public void testhg18Tohg19Unsorted() {
WalkerTestSpec spec = new WalkerTestSpec(
"-T LiftoverVariants -o %s -R " + hg18Reference + " --variant:vcf " + privateTestDir + "liftover_test.vcf -chain " + validationDataLocation + "hg18ToHg19.broad.over.chain -dict /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19.dict",
1,
Arrays.asList("186a675c9758c1b83a1232399e18d8fe"));
executeTest("test hg18 to hg19, unsorted", spec);
}
@Test
public void testLiftoverFilteringOfIndels() {
WalkerTestSpec spec = new WalkerTestSpec(
"-T FilterLiftedVariants -o %s -R " + b37KGReference + " --variant:vcf " + privateTestDir + "liftover_indel_test.vcf --no_cmdline_in_header",
1,
Arrays.asList("303e3df3bf737feaff397b4eeb037ecb"));
executeTest("test liftover filtering of indels", spec);
}
@Test
public void testLiftoverFailsWithNoOutput() {
WalkerTestSpec spec = new WalkerTestSpec(
"-T LiftoverVariants -R " + hg18Reference + " --variant:vcf " + privateTestDir + "liftover_test.vcf -chain " + validationDataLocation + "hg18ToHg19.broad.over.chain -dict /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19.dict",
0,
UserException.class);
executeTest("test liftover fails with no output", spec);
}
}

View File

@ -1,85 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools;
import org.broadinstitute.gatk.utils.commandline.CommandLineProgram;
import org.broadinstitute.gatk.utils.exceptions.UserException;
import org.broadinstitute.gatk.tools.walkers.annotator.interfaces.AnnotationHelpUtils;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import org.broadinstitute.gatk.utils.help.HelpConstants;
/**
* Utility program to print a list of available annotations
*
* <p>This is a very simple utility tool that retrieves available annotations for use with tools such as
* UnifiedGenotyper, HaplotypeCaller and VariantAnnotator.</p>
*
* <h3>Important note</h3>
* <p>This is a command-line utility that bypasses the GATK engine. As a result, the command-line you must use to
* invoke it is a little different from other GATK tools (see usage below), and it does not accept any of the
* classic "CommandLineGATK" arguments.</p>
*
* <h3>Usage</h3>
* <pre>java -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.ListAnnotations</pre>
*
* @author vdauwera
* @since 3/14/13
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_HELPUTILS )
public class ListAnnotations extends CommandLineProgram {
/*
* Print usage information
*
* TODO: would be more convenient if we could just call the program by name instead of the full classpath
*/
private static void printUsage() {
System.err.println("Usage: java -cp dist/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.ListAnnotations");
System.err.println(" Prints a list of available annotations and exits.");
}
// TODO: override CommandLineProgram bit that offers version, logging etc arguments. We don't need that stuff here and it makes the doc confusing.
@Override
protected int execute() throws Exception {
AnnotationHelpUtils.listAnnotations();
return 0;
}
public static void main(String[] args){
try {
ListAnnotations instance = new ListAnnotations();
start(instance, args);
System.exit(CommandLineProgram.result);
} catch ( UserException e ) {
printUsage();
exitSystemWithUserError(e);
} catch ( Exception e ) {
exitSystemWithError(e);
}
}
}

View File

@ -1,154 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.diagnostics;
import org.broadinstitute.gatk.engine.walkers.By;
import org.broadinstitute.gatk.engine.walkers.DataSource;
import org.broadinstitute.gatk.engine.walkers.RodWalker;
import org.broadinstitute.gatk.engine.walkers.TreeReducible;
import org.broadinstitute.gatk.utils.commandline.Argument;
import org.broadinstitute.gatk.utils.commandline.ArgumentCollection;
import org.broadinstitute.gatk.utils.commandline.Output;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.utils.GenomeLoc;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import htsjdk.variant.variantcontext.Genotype;
import htsjdk.variant.variantcontext.GenotypesContext;
import htsjdk.variant.variantcontext.VariantContext;
import java.io.*;
import java.util.Collection;
/**
* Report well-covered intervals
*
* <p>
* This tool evaluates whether sites are well-covered or not according to specific coverage quality parameters, and
* outputs a list of intervals that are considered well-covered, i.e. where most samples have good coverage. This is
* useful for masking out poorly-covered sites where we cannot expect meaningful results in downstream analyses. See
* argument defaults for what constitutes "most" samples and "good" coverage.
* </p>
*
* <h3>Input</h3>
* <p>
* A variant file and optionally, minimum coverage and sample percentage values.
* </p>
*
* <h3>Output</h3>
* <p>
* An list of well-covered intervals.
* </p>
*
* <h3>Usage example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -T CoveredByNSamplesSites \
* -R reference.fasta \
* -V input.vcf \
* -out output.intervals \
* -minCov 15
* </pre>
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_QC, extraDocs = {CommandLineGATK.class} )
@By(DataSource.REFERENCE_ORDERED_DATA)
public class CoveredByNSamplesSites extends RodWalker<GenomeLoc, Integer> implements TreeReducible<Integer> {
@Output(fullName = "OutputIntervals", shortName = "out", doc = "Name of file for output intervals")
PrintStream outputStream;
@ArgumentCollection
protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
@Argument(fullName = "minCoverage", shortName = "minCov",doc = "only samples that have coverage bigger than minCoverage will be counted",required = false)
int minCoverage = 10;
@Argument(fullName = "percentageOfSamples", shortName = "percentage", doc = "only sites where at least percentageOfSamples of the samples have good coverage, will be emitted", required = false)
double percentageOfSamples = 0.9;
@Override
public GenomeLoc map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
if ( tracker == null )
return null;
Collection<VariantContext> VCs = tracker.getValues(variantCollection.variants, context.getLocation());
if ( VCs.size() == 0 )
return null;
boolean emitSite = false;
for(VariantContext vc : VCs){
int coveredSamples = 0;
final GenotypesContext genotypes = vc.getGenotypes();
final int numOfGenotypes = genotypes.size();
for(Genotype g : genotypes){
if(g.getDP() >= minCoverage)
coveredSamples++;
}
if((double)coveredSamples/numOfGenotypes > percentageOfSamples){
emitSite = true;
}
}
if (emitSite)
return ref.getLocus();
else
return null;
}
@Override
public Integer reduceInit() { return 0; }
@Override
public Integer reduce(GenomeLoc value, Integer sum) {
if ( value != null ) {
outputStream.println(value);
sum++;
}
return sum;
}
@Override
public Integer treeReduce(Integer lhs, Integer rhs) {
return lhs + rhs;
}
/**
*
* @param result the number of sites that passed the filter.
*/
public void onTraversalDone(Integer result) {
logger.info(result+" sites that have "+(percentageOfSamples*100)+"% of the samples with at least "+minCoverage+" coverage.\n");
}
}

View File

@ -1,397 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.readutils;
import com.google.java.contract.Ensures;
import com.google.java.contract.Requires;
import htsjdk.samtools.SAMFileWriter;
import org.apache.log4j.Logger;
import org.broadinstitute.gatk.engine.walkers.NanoSchedulable;
import org.broadinstitute.gatk.engine.walkers.PartitionBy;
import org.broadinstitute.gatk.engine.walkers.PartitionType;
import org.broadinstitute.gatk.engine.walkers.ReadWalker;
import org.broadinstitute.gatk.utils.commandline.Advanced;
import org.broadinstitute.gatk.utils.commandline.Argument;
import org.broadinstitute.gatk.utils.commandline.Hidden;
import org.broadinstitute.gatk.utils.commandline.Output;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.utils.BaseUtils;
import org.broadinstitute.gatk.utils.collections.Pair;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import org.broadinstitute.gatk.utils.sam.GATKSAMRecord;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
/**
* Utility tool to blindly strip base adaptors
*
* <p>This tool is mainly intended to be applied to FASTQ/unaligned BAM pre-processing where libraries
* have very short inserts, and hence a substantial part of the sequencing data will have adaptor sequence present. By
* design, tool will only work for Illumina-like library constructs, where the typical library architecture is:
* [Adaptor 1]-[Genomic Insert]-[Adaptor 2 (index/barcode)]</p>
*
* <p>We assume that when data is paired, one read will span the forward strand and one read will span the reverse strand.
* Hence, adaptors should be specified as both forward and reverse-complement to ensure they are removed in all cases.
* By design, as well, "circular" constructions where a read can have an insert, then adaptor, then more genomic insert, are not supported.
* When an adaptor is detected, all bases downstream from it (i.e. in the 3' direction) will be removed.
* Adaptor detection is carried out by looking for overlaps between forward and reverse reads in a pair.
* If a sufficiently high overlap is found, the insert size is computed and if insert size < read lengths adaptor bases are removed from reads.
* </p>
*
* <p> Advantage over ReadClipper: No previous knowledge of adaptors or library structure is necessary.</p>
*
* <p>Advantages over 3rd party tools like SeqPrep:</p>
* <ul>
* <li>Can do BAM streaming instead of having to convert to fastq</li>
* <li>No need to merge reads; merging reads can have some advantages, but complicates downstream processing and loses information that can be used,
* e.g. in variant calling</li>
* </ul>
*
* <h3>Input</h3>
* <p>
* The input read data in BAM format. Read data MUST be in query name ordering as produced, for example with Picard's FastqToBam
* </p>
*
* <h3>Output</h3>
* <p>
* A merged BAM file with unaligned reads
* </p>
*
* <h3Usage example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -R reference.fasta \
* -T ReadAdaptorTrimmer \
* -I my_reads.bam \
* -o trimmed_Reads.bam
* </pre>
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_DATA, extraDocs = {CommandLineGATK.class} )
@PartitionBy(PartitionType.READ)
public class ReadAdaptorTrimmer extends ReadWalker<List<GATKSAMRecord>, SAMFileWriter> implements NanoSchedulable {
@Output(doc="Write output to this BAM filename instead of STDOUT", required = false)
SAMFileWriter out;
/**
* Only prints the first n reads of the file - for short testing
*/
@Hidden
@Argument(fullName = "number", shortName = "n", doc="Print the first n reads from the file, discarding the rest", required = false)
int nReadsToPrint = -1;
/**
* Argument to control strictness of match between forward and reverse reads - by default, we require 15 matches between them to declare
* an overlap.
*/
@Advanced
@Argument(fullName = "minMatches", shortName = "minMatches", doc="Minimum number of substring matches to detect pair overlaps", required = false)
int minMatchesForOverlap = 15;
/**
* If true, this argument will make the walker discard unpaired reads instead of erroring out.
*/
@Advanced
@Argument(fullName = "removeUnpairedReads", shortName = "removeUnpairedReads", doc="Remove unpaired reads instead of erroring out", required = false)
boolean cleanUnpairedReads = false;
/**
* private class members
*/
private GATKSAMRecord firstReadInPair;
private TrimStats trimStats = new TrimStats();
static class TrimStats {
long numReadsProcessed;
long numReadsWithAdaptorTrimmed;
long numUnpairedReadsFound;
}
/**
* The reads filter function.
*
* @param ref the reference bases that correspond to our read, if a reference was provided
* @param read the read itself, as a GATKSAMRecord
* @return true if the read passes the filter, false if it doesn't
*/
public boolean filter(ReferenceContext ref, GATKSAMRecord read) {
// check if we've reached the output limit
if ( nReadsToPrint == 0 ) {
return false; // n == 0 means we've printed all we needed.
}
else if (nReadsToPrint > 0) {
nReadsToPrint--; // n > 0 means there are still reads to be printed.
}
return true;
}
/**
* reduceInit is called once before any calls to the map function. We use it here to setup the output
* bam file, if it was specified on the command line
*
* @return SAMFileWriter, set to the BAM output file if the command line option was set, null otherwise
*/
public SAMFileWriter reduceInit() {
return out;
}
public List<GATKSAMRecord> map( final ReferenceContext ref, final GATKSAMRecord readIn, final RefMetaDataTracker metaDataTracker ) {
final List<GATKSAMRecord> readsToEmit = new ArrayList<GATKSAMRecord>();
// cache first read in pair if flag set.
if (readIn.getFirstOfPairFlag()) {
firstReadInPair = GATKSAMRecord.emptyRead(readIn);
firstReadInPair.setReadString(readIn.getReadString());
firstReadInPair.setReadName(readIn.getReadName());
firstReadInPair.setBaseQualities(readIn.getBaseQualities());
}
else {
if (!readIn.getReadName().equals(firstReadInPair.getReadName())) {
if (cleanUnpairedReads) {
trimStats.numUnpairedReadsFound++;
return readsToEmit;
}
else // by default require that reads be completely paired
throw new IllegalStateException("Second read in pair must follow first read in pair: data not ordered?");
}
final int oldLength1 = firstReadInPair.getReadLength();
final int oldLength2 = readIn.getReadLength();
// try to strip any adaptor sequence in read pair
final Integer result = trimReads(firstReadInPair, readIn, minMatchesForOverlap, logger);
if (logger.isDebugEnabled()) {
if (result == null)
logger.debug("No overlap found, insert size cannot be computed");
else
logger.debug("Insert size estimate = " + result);
}
readsToEmit.add(firstReadInPair);
readsToEmit.add(readIn);
if (oldLength1 != firstReadInPair.getReadLength())
trimStats.numReadsWithAdaptorTrimmed++;
if (oldLength2 != readIn.getReadLength())
trimStats.numReadsWithAdaptorTrimmed++;
}
trimStats.numReadsProcessed++;
return readsToEmit;
}
/**
* given a read and a output location, reduce by emitting the read
*
* @param readsToEmit the read itself
* @param output the output source
* @return the SAMFileWriter, so that the next reduce can emit to the same source
*/
public SAMFileWriter reduce( final List<GATKSAMRecord> readsToEmit, final SAMFileWriter output ) {
for (final GATKSAMRecord read : readsToEmit)
output.addAlignment(read);
return output;
}
@Override
public void onTraversalDone(SAMFileWriter output) {
logger.info("Finished Trimming:");
logger.info("Number of processed reads: "+ trimStats.numReadsProcessed);
logger.info("Number of reads with adaptor sequence trimmed: "+ trimStats.numReadsWithAdaptorTrimmed);
if (cleanUnpairedReads)
logger.info("Number of unpaired reads thrown out: "+ trimStats.numUnpairedReadsFound);
}
/**
*
* Workhorse routines...
*
*/
/**
* Core routine that does most underlying work for walker. Takes two reads and looks for overlaps in them.
* An overlap is defined as a contiguous chunk of N bases that matches reverse-complement between reads.
* Currently, the only insert structure that it will look for overlaps is as follows:
* CASE 1: Insert shorter than read length:
* 3' XXXXXXXXXXXXXXXX 5' (second read)
* 5' YYYYYYYYYYYYYYYY 3' (first read)
* ***********
*
* In this case, if X and Y are complements at the 11 positions marked by *, routine will do the following
* iff minMatchesForOverlap <= 11:
* a) Cleave adaptor from end of second read (leftmost dangling part in diagram above)
* b) Cleave adaptor from end of first read (rightmost part in diagram).
*
* CASE 2: Insert size >= read length:
* 3' XXXXXXXXXXXXXXXX 5' (second read)
* 5' YYYYYYYYYYYYYYYY 3' (first read)
* ********* (overlap)
*
* In this case, no trimming is done and reads are left unchanged
* @param first (I/O) First read in pair - read contents (bases/quals) can be modified if adaptor is detected
* @param second (I/O) Second read in pair - read contents (bases/quals) can be modified if adaptor is detected
* @param minMatchesForOverlap Reads need to match in these # of bases to be joined
* @return Offset between second and first read.
* If there's no detectable offset, return Null
*/
@Requires({"first != null","second != null","minMatchesForOverlap>0"})
protected static Integer trimReads(final GATKSAMRecord first,
final GATKSAMRecord second,
final int minMatchesForOverlap,
final Logger logger) {
final Integer insertSize = estimateInsertSize(first.getReadBases(), second.getReadBases(),
minMatchesForOverlap, logger);
if (insertSize == null)
return insertSize;
if (insertSize < first.getReadLength()) {
// trim adaptor sequence from read
first.setReadBases(Arrays.copyOfRange(first.getReadBases(),0,insertSize));
first.setBaseQualities(Arrays.copyOfRange(first.getBaseQualities(),0,insertSize));
}
if (insertSize < second.getReadLength()) {
// trim adaptor sequence from read
second.setReadBases(Arrays.copyOfRange(second.getReadBases(),0,insertSize));
second.setBaseQualities(Arrays.copyOfRange(second.getBaseQualities(),0,insertSize));
}
return insertSize;
}
/**
* Brain-dead implementation of an aligner of two sequences, where it's assumed that there might be an overlap
* from the first into the second. From this, an estimate of insert size is performed and returned
* Assumes that reads come in reverse direction, so one of the base sequences needs to be reverse-complemented.]
*
* @param firstRead Bytes from first read
* @param secondRead Bytes from second read (reverse direction)
* @return Estimated insert size based on offset between first and second read.
* If no overlap can be detected, return null
*/
@Requires({"firstRead != null","secondRead != null","minMatches>0","firstRead.length == secondRead.length"})
protected static Integer estimateInsertSize(final byte[] firstRead,
final byte[] secondRead,
final int minMatches,
final Logger logger) {
final byte[] firstBases = firstRead;
final byte[] secondBases = BaseUtils.simpleReverseComplement(secondRead);
final Pair<Integer,Integer> overlaps = findOverlappingSequence(firstBases, secondBases);
final int bestOffset = overlaps.first;
final int maxScore = overlaps.second;
if ( logger.isDebugEnabled()) {
String sb="", s1 = new String(firstBases), s2 = new String(secondBases);
for (int k=0; k < Math.abs(bestOffset); k++) sb+=" ";
if (maxScore >= minMatches) {
logger.debug(String.format("Match, Max Score = %d, best offset = %d\n",maxScore, bestOffset));
if (bestOffset>0)
s2 = sb+s2;
else
s1 = sb+s1;
}
else logger.debug("NoMatch:");
logger.debug("R1:"+s1);
logger.debug("R2:"+s2);
}
if (maxScore < minMatches)
return null; // no overlap detected
return bestOffset+secondRead.length;
}
/**
* Tries to find overlapping sequence between two reads, and computes offset between them
* For each possible offset, computes matching score, which is = MATCH_SCORE*Num_matches + MISMATCH_SCORE*num_mismatches
* (like SW with infinite gap penalties).
* @param first First read bytes
* @param second Second read bytes
* @return Pair of integers (x,y). x = best offset between reads, y = corresponding score
*/
@Requires({"first != null","second != null"})
@Ensures("result != null")
protected static Pair<Integer,Integer> findOverlappingSequence(final byte[] first,
final byte[] second) {
final int MATCH_SCORE = 1;
final int MISMATCH_SCORE = -1;
// try every possible offset - O(N^2) algorithm
// In case of following structure,
// 111111111
// 222222222
// computed offset will be negative (=-5 in this case).
// If however,
// 111111111
// 222222222
// then offset will be positive (=3 in this case)
int maxScore = 0, bestOffset =0;
for (int offset = -second.length; offset < first.length; offset++) {
int score = 0;
// compute start index for each array
int ind1 = (offset<0)?0:offset;
int ind2 = (offset<0)?-offset:0;
for (int k=0; k < Math.min(first.length, second.length) ; k++) {
if (ind1 >= first.length)
break;
if (ind2 >= second.length )
break;
if (first[ind1] != 'N' && second[ind2] != 'N') {
if (first[ind1] == second[ind2])
score += MATCH_SCORE;
else
score += MISMATCH_SCORE;
}
ind1++;
ind2++;
}
if (score > maxScore) {
maxScore = score;
bestOffset = offset;
}
}
return new Pair<Integer, Integer>(bestOffset,maxScore);
}
}

View File

@ -1,159 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.variantutils;
import org.broadinstitute.gatk.engine.walkers.Reference;
import org.broadinstitute.gatk.engine.walkers.RodWalker;
import org.broadinstitute.gatk.engine.walkers.Window;
import org.broadinstitute.gatk.utils.commandline.ArgumentCollection;
import org.broadinstitute.gatk.utils.commandline.Output;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.engine.SampleUtils;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import htsjdk.variant.vcf.VCFHeader;
import htsjdk.variant.vcf.VCFHeaderLine;
import org.broadinstitute.gatk.engine.GATKVCFUtils;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
import htsjdk.variant.variantcontext.VariantContext;
import java.util.*;
/**
* Filters a lifted-over VCF file for reference bases that have been changed
*
* <p>"Lifting over" variants means adjusting variant calls from one reference to another. Specifically, the process
* adjusts the position of the call to match the corresponding position on the target reference. For example, if you
* have variants called from reads aligned to the hg19 reference, and you want to compare them to calls made based on
* the b37 reference, you need to liftover one of the callsets to the other reference.</p>
*
* <p>This tool is intended to be the second of two processing steps for the liftover process. The first step is to
* run LiftoverVariants on your VCF file. The second step is to run FilterLiftedVariants on the output of
* LiftoverVariants. This will produce valid well-behaved VCF files, where you'll see that the contig names in the
* header have all been correctly replaced.</p>
*
* <h3>Input</h3>
* <p>
* A lifted-over variant call set to filter.
* </p>
*
* <h3>Output</h3>
* <p>
* The filtered call set.
* </p>
*
* <h3>Usage example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -T FilterLiftedVariants \
* -R reference.fasta \
* -V liftedover_input.vcf \
* -o filtered_output.vcf
* </pre>
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP, extraDocs = {CommandLineGATK.class} )
@Reference(window=@Window(start=0,stop=100))
public class FilterLiftedVariants extends RodWalker<Integer, Integer> {
@ArgumentCollection
protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
private static final int MAX_VARIANT_SIZE = 100;
@Output(doc="File to which variants should be written")
protected VariantContextWriter writer = null;
private long failedLocs = 0, totalLocs = 0;
public void initialize() {
String trackName = variantCollection.variants.getName();
Set<String> samples = SampleUtils.getSampleListWithVCFHeader(getToolkit(), Arrays.asList(trackName));
Map<String, VCFHeader> vcfHeaders = GATKVCFUtils.getVCFHeadersFromRods(getToolkit(), Arrays.asList(trackName));
final VCFHeader vcfHeader = new VCFHeader(vcfHeaders.containsKey(trackName) ? vcfHeaders.get(trackName).getMetaDataInSortedOrder() : Collections.<VCFHeaderLine>emptySet(), samples);
writer.writeHeader(vcfHeader);
}
/**
* Determines whether records should be filtered; if not, writes them to the output
*
* @param ref the reference context
* @param vc the VariantContext to process
* @return true if the record is not filtered, false otherwise
*/
protected boolean filterOrWrite(final byte[] ref, final VariantContext vc) {
if ( ref == null ) throw new IllegalArgumentException("Cannot filter based on a null reference array");
if ( vc == null ) throw new IllegalArgumentException("Cannot filter a null Variant Context");
totalLocs++;
boolean filter = false;
final byte[] recordRef = vc.getReference().getBases();
// this can happen for records that get placed at the ends of chromosomes
if ( recordRef.length > ref.length ) {
filter = true;
} else {
for (int i = 0; i < recordRef.length && i < MAX_VARIANT_SIZE; i++) {
if ( recordRef[i] != ref[i] ) {
filter = true;
break;
}
}
}
if ( filter )
failedLocs++;
else
writer.add(vc);
return !filter;
}
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
if ( tracker == null )
return 0;
final Collection<VariantContext> VCs = tracker.getValues(variantCollection.variants, context.getLocation());
for ( final VariantContext vc : VCs )
filterOrWrite(ref.getBases(), vc);
return 0;
}
public Integer reduceInit() { return 0; }
public Integer reduce(Integer value, Integer sum) { return 0; }
public void onTraversalDone(Integer result) {
System.out.println("Filtered " + failedLocs + " records out of " + totalLocs + " total records.");
}
}

View File

@ -1,209 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.variantutils;
import htsjdk.samtools.liftover.LiftOver;
import htsjdk.samtools.util.Interval;
import htsjdk.samtools.SAMFileHeader;
import htsjdk.samtools.SAMFileReader;
import htsjdk.variant.vcf.VCFHeader;
import htsjdk.variant.vcf.VCFHeaderLine;
import org.broadinstitute.gatk.utils.commandline.Argument;
import org.broadinstitute.gatk.utils.commandline.ArgumentCollection;
import org.broadinstitute.gatk.utils.commandline.Output;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.engine.walkers.RodWalker;
import org.broadinstitute.gatk.engine.SampleUtils;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import org.broadinstitute.gatk.engine.GATKVCFUtils;
import org.broadinstitute.gatk.utils.variant.GATKVCFConstants;
import org.broadinstitute.gatk.utils.variant.GATKVCFHeaderLines;
import org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils;
import htsjdk.variant.variantcontext.writer.Options;
import org.broadinstitute.gatk.utils.exceptions.UserException;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.variantcontext.VariantContextBuilder;
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
import htsjdk.variant.variantcontext.writer.VariantContextWriterFactory;
import java.io.File;
import java.util.*;
/**
* Lifts a VCF file over from one build to another
*
* <p>"Lifting over" variants means adjusting variant calls from one reference to another. Specifically, the process
* adjusts the position of the call to match the corresponding position on the target reference. For example, if you
* have variants called from reads aligned to the hg19 reference, and you want to compare them to calls made based on
* the b37 reference, you need to liftover one of the callsets to the other reference.</p>
*
* <p>LiftoverVariants is intended to be the first of two processing steps for the liftover process.
* The second step is to run FilterLiftedVariants on the output of LiftoverVariants. This will produce valid
* well-behaved VCF files, where you'll see that the contig names in the header have all been correctly replaced.</p>
*
* <h3>Caveat</h3>
* <p>To be clear, the VCF resulting from the LiftoverVariants run is not guaranteed to be valid according to the official specification. The file could
* possibly be mis-sorted and the header may not be complete. That is why you need to run FilterLiftedVariants on it.</p>
*
* <h3>Input</h3>
* <p>
* A variant call set to lift over, the sequence dictionary of the new reference build and the appropriate liftover
* chain file.
* </p>
*
* <h3>Output</h3>
* <p>
* The lifted-over call set.
* </p>
*
* <h3>Usage example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -T LiftoverVariants \
* -R reference_hg19.fasta \
* -V input_hg19.vcf \
* -chain liftover_hg19_to_b37.txt \
* -dict reference_b37.dict \
* -o liftedover_output_b37.vcf
* </pre>
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VARMANIP, extraDocs = {CommandLineGATK.class} )
public class LiftoverVariants extends RodWalker<Integer, Integer> {
@ArgumentCollection
protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
@Output(doc="File to which variants should be written", required=true, defaultToStdout=false)
protected File file = null;
protected VariantContextWriter writer = null;
@Argument(fullName="chain", shortName="chain", doc="Chain file", required=true)
protected File CHAIN = null;
@Argument(fullName="newSequenceDictionary", shortName="dict", doc="Sequence .dict file for the new build", required=true)
protected File NEW_SEQ_DICT = null;
@Argument(fullName="recordOriginalLocation", shortName="recordOriginalLocation", doc="Should we record what the original location was in the INFO field?", required=false)
protected Boolean RECORD_ORIGINAL_LOCATION = false;
private LiftOver liftOver;
private long successfulIntervals = 0, failedIntervals = 0;
public void initialize() {
try {
liftOver = new LiftOver(CHAIN);
} catch (RuntimeException e) {
throw new UserException.BadInput("there is a problem with the chain file you are using: " + e.getMessage());
}
liftOver.setLiftOverMinMatch(LiftOver.DEFAULT_LIFTOVER_MINMATCH);
try {
final SAMFileHeader toHeader = new SAMFileReader(NEW_SEQ_DICT).getFileHeader();
liftOver.validateToSequences(toHeader.getSequenceDictionary());
} catch (RuntimeException e) {
throw new UserException.BadInput("the chain file you are using is not compatible with the reference you are trying to lift over to; please use the appropriate chain file for the given reference");
}
String trackName = variantCollection.variants.getName();
Set<String> samples = SampleUtils.getSampleListWithVCFHeader(getToolkit(), Arrays.asList(trackName));
Map<String, VCFHeader> vcfHeaders = GATKVCFUtils.getVCFHeadersFromRods(getToolkit(), Arrays.asList(trackName));
Set<VCFHeaderLine> metaData = new HashSet<>();
if ( vcfHeaders.containsKey(trackName) )
metaData.addAll(vcfHeaders.get(trackName).getMetaDataInSortedOrder());
if ( RECORD_ORIGINAL_LOCATION ) {
metaData.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.ORIGINAL_CONTIG_KEY));
metaData.add(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.ORIGINAL_START_KEY));
}
final VCFHeader vcfHeader = new VCFHeader(metaData, samples);
writer = VariantContextWriterFactory.create(file, getMasterSequenceDictionary(), EnumSet.of(Options.ALLOW_MISSING_FIELDS_IN_HEADER));
writer.writeHeader(vcfHeader);
}
private void convertAndWrite(VariantContext vc, ReferenceContext ref) {
final Interval fromInterval = new Interval(vc.getChr(), vc.getStart(), vc.getStart(), false, String.format("%s:%d", vc.getChr(), vc.getStart()));
final int length = vc.getEnd() - vc.getStart();
final Interval toInterval = liftOver.liftOver(fromInterval);
VariantContext originalVC = vc;
if ( toInterval != null ) {
// check whether the strand flips, and if so reverse complement everything
if ( fromInterval.isPositiveStrand() != toInterval.isPositiveStrand() && vc.isPointEvent() ) {
vc = GATKVariantContextUtils.reverseComplement(vc);
}
vc = new VariantContextBuilder(vc).loc(toInterval.getSequence(), toInterval.getStart(), toInterval.getStart() + length).make();
if ( RECORD_ORIGINAL_LOCATION ) {
vc = new VariantContextBuilder(vc)
.attribute(GATKVCFConstants.ORIGINAL_CONTIG_KEY, fromInterval.getSequence())
.attribute(GATKVCFConstants.ORIGINAL_START_KEY, fromInterval.getStart()).make();
}
if ( originalVC.isSNP() && originalVC.isBiallelic() && GATKVariantContextUtils.getSNPSubstitutionType(originalVC) != GATKVariantContextUtils.getSNPSubstitutionType(vc) ) {
logger.warn(String.format("VCF at %s / %d => %s / %d is switching substitution type %s/%s to %s/%s",
originalVC.getChr(), originalVC.getStart(), vc.getChr(), vc.getStart(),
originalVC.getReference(), originalVC.getAlternateAllele(0), vc.getReference(), vc.getAlternateAllele(0)));
}
writer.add(vc);
successfulIntervals++;
} else {
failedIntervals++;
}
}
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
if ( tracker == null )
return 0;
Collection<VariantContext> VCs = tracker.getValues(variantCollection.variants, context.getLocation());
for ( VariantContext vc : VCs )
convertAndWrite(vc, ref);
return 0;
}
public Integer reduceInit() { return 0; }
public Integer reduce(Integer value, Integer sum) { return 0; }
public void onTraversalDone(Integer result) {
System.out.println("Converted " + successfulIntervals + " records; failed to convert " + failedIntervals + " records.");
writer.close();
}
}

View File

@ -1,304 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.variantutils;
import org.broadinstitute.gatk.engine.walkers.Reference;
import org.broadinstitute.gatk.engine.walkers.RodWalker;
import org.broadinstitute.gatk.engine.walkers.Window;
import org.broadinstitute.gatk.utils.commandline.*;
import org.broadinstitute.gatk.engine.CommandLineGATK;
import org.broadinstitute.gatk.engine.arguments.StandardVariantContextInputArgumentCollection;
import org.broadinstitute.gatk.utils.contexts.AlignmentContext;
import org.broadinstitute.gatk.utils.contexts.ReferenceContext;
import org.broadinstitute.gatk.utils.refdata.RefMetaDataTracker;
import org.broadinstitute.gatk.utils.QualityUtils;
import org.broadinstitute.gatk.engine.SampleUtils;
import org.broadinstitute.gatk.utils.help.HelpConstants;
import org.broadinstitute.gatk.engine.GATKVCFUtils;
import org.broadinstitute.gatk.utils.variant.GATKVariantContextUtils;
import htsjdk.variant.vcf.*;
import org.broadinstitute.gatk.utils.help.DocumentedGATKFeature;
import htsjdk.variant.variantcontext.writer.VariantContextWriter;
import htsjdk.variant.variantcontext.Allele;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.variantcontext.VariantContextBuilder;
import java.util.*;
/**
* Annotate a validation VCF with QC metrics
*
* <p>
* This tool is intended for vetting/assessing validation data (containing genotypes).
* The tool produces a VCF that is annotated with information pertaining to plate quality control and by
* default is soft-filtered by high no-call rate or low Hardy-Weinberg probability.
* If you have .ped files, please first convert them to VCF format.</p>
*
* <h3>Input</h3>
* <p>
* A validation VCF to annotate.
* </p>
*
* <h3>Output</h3>
* <p>
* An annotated VCF. Additionally, a table like the following will be output:
* </p>
* <pre>
* Total number of samples assayed: 185
* Total number of records processed: 152
* Number of Hardy-Weinberg violations: 34 (22%)
* Number of no-call violations: 12 (7%)
* Number of homozygous variant violations: 0 (0%)
* Number of records passing all filters: 106 (69%)
* Number of passing records that are polymorphic: 98 (92%)
* </pre>
*
* <h3>Usage example</h3>
* <pre>
* java -jar GenomeAnalysisTK.jar \
* -T VariantValidationAssessor \
* -R reference.fasta \
* -V input.vcf \
* -o output.vcf
* </pre>
*
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_VALIDATION, extraDocs = {CommandLineGATK.class} )
@Reference(window=@Window(start=0,stop=40))
public class VariantValidationAssessor extends RodWalker<VariantContext,Integer> {
@ArgumentCollection
protected StandardVariantContextInputArgumentCollection variantCollection = new StandardVariantContextInputArgumentCollection();
@Output(doc="File to which variants should be written")
protected VariantContextWriter vcfwriter = null;
@Argument(fullName="maxHardy", doc="Maximum phred-scaled Hardy-Weinberg violation pvalue to consider an assay valid", required=false)
protected double maxHardy = 20.0;
/**
* To disable, set to a value greater than 1.
*/
@Argument(fullName="maxNoCall", doc="Maximum no-call rate (as a fraction) to consider an assay valid", required=false)
protected double maxNoCall = 0.05;
/**
* To disable, set to a value greater than 1.
*/
@Argument(fullName="maxHomVar", doc="Maximum homozygous variant rate (as a fraction) to consider an assay valid", required=false)
protected double maxHomNonref = 1.1;
//@Argument(fullName="populationFile", shortName="populations", doc="A tab-delimited file relating individuals to populations,"+
// "used for smart Hardy-Weinberg annotation",required = false)
//private File popFile = null;
// sample names
private TreeSet<String> sampleNames = null;
// variant context records
private ArrayList<VariantContext> records = new ArrayList<VariantContext>();
// statistics
private int numRecords = 0;
private int numHWViolations = 0;
private int numNoCallViolations = 0;
private int numHomVarViolations = 0;
private int numTrueVariants = 0;
//private HashMap<String,String> samplesToPopulation;
public void initialize() {
//if ( popFile != null ) {
// samplesToPopulation = parsePopulationFile(popFile);
//}
}
public Integer reduceInit() {
return 0;
}
public VariantContext map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
if ( tracker == null )
return null;
VariantContext vc = tracker.getFirstValue(variantCollection.variants, ref.getLocus());
// ignore places where we don't have a variant
if ( vc == null )
return null;
if ( sampleNames == null )
sampleNames = new TreeSet<String>(vc.getSampleNames());
return addVariantInformationToCall(vc);
}
public Integer reduce(VariantContext call, Integer numVariants) {
if ( call != null ) {
numVariants++;
records.add(call);
}
return numVariants;
}
public void onTraversalDone(Integer finalReduce) {
final List<String> inputNames = Arrays.asList(variantCollection.variants.getName());
// setup the header fields
Set<VCFHeaderLine> hInfo = new HashSet<VCFHeaderLine>();
hInfo.addAll(GATKVCFUtils.getHeaderFields(getToolkit(), inputNames));
// set up the info and filter headers
hInfo.add(new VCFInfoHeaderLine("NoCallPct", 1, VCFHeaderLineType.Float, "Percent of no-calls"));
hInfo.add(new VCFInfoHeaderLine("HomRefPct", 1, VCFHeaderLineType.Float, "Percent of homozygous reference genotypes"));
hInfo.add(new VCFInfoHeaderLine("HetPct", 1, VCFHeaderLineType.Float, "Percent of heterozygous genotypes"));
hInfo.add(new VCFInfoHeaderLine("HomVarPct", 1, VCFHeaderLineType.Float, "Percent homozygous variant genotypes"));
hInfo.add(new VCFInfoHeaderLine("HW", 1, VCFHeaderLineType.Float, "Phred-scaled Hardy-Weinberg violation p-value"));
hInfo.add(VCFStandardHeaderLines.getInfoLine(VCFConstants.ALLELE_COUNT_KEY));
hInfo.add(VCFStandardHeaderLines.getInfoLine(VCFConstants.ALLELE_NUMBER_KEY));
hInfo.add(new VCFFilterHeaderLine("HardyWeinbergViolation", "The validation is in Hardy-Weinberg violation"));
hInfo.add(new VCFFilterHeaderLine("HighNoCallRate", "The validation no-call rate is too high"));
hInfo.add(new VCFFilterHeaderLine("TooManyHomVars", "The validation homozygous variant rate is too high"));
// print out (and add to headers) the validation metrics
System.out.println(String.format("Total number of samples assayed:\t\t\t%d", sampleNames.size()));
hInfo.add(new VCFHeaderLine("ValidationMetrics_SamplesAssayed", String.format("%d", sampleNames.size())));
System.out.println(String.format("Total number of records processed:\t\t\t%d", numRecords));
hInfo.add(new VCFHeaderLine("ValidationMetrics_RecordsProcessed", String.format("%d", numRecords)));
if ( numRecords > 0 ) {
System.out.println(String.format("Number of Hardy-Weinberg violations:\t\t\t%d (%d%%)", numHWViolations, 100*numHWViolations/numRecords));
hInfo.add(new VCFHeaderLine("ValidationMetrics_HardyWeinbergViolations", String.format("\"%d (%d%%)\"", numHWViolations, 100*numHWViolations/numRecords)));
System.out.println(String.format("Number of no-call violations:\t\t\t\t%d (%d%%)", numNoCallViolations, 100*numNoCallViolations/numRecords));
hInfo.add(new VCFHeaderLine("ValidationMetrics_NoCallViolations", String.format("\"%d (%d%%)\"", numNoCallViolations, 100*numNoCallViolations/numRecords)));
System.out.println(String.format("Number of homozygous variant violations:\t\t%d (%d%%)", numHomVarViolations, 100*numHomVarViolations/numRecords));
hInfo.add(new VCFHeaderLine("ValidationMetrics_HomVarViolations", String.format("\"%d (%d%%)\"", numHomVarViolations, 100*numHomVarViolations/numRecords)));
int goodRecords = numRecords - numHWViolations - numNoCallViolations - numHomVarViolations;
System.out.println(String.format("Number of records passing all filters:\t\t\t%d (%d%%)", goodRecords, 100*goodRecords/numRecords));
hInfo.add(new VCFHeaderLine("ValidationMetrics_RecordsPassingFilters", String.format("\"%d (%d%%)\"", goodRecords, 100*goodRecords/numRecords)));
if ( goodRecords > 0 ) {
System.out.println(String.format("Number of passing records that are polymorphic:\t\t%d (%d%%)", numTrueVariants, 100*numTrueVariants/goodRecords));
hInfo.add(new VCFHeaderLine("ValidationMetrics_PolymorphicPassingRecords", String.format("\"%d (%d%%)\"", numTrueVariants, 100*numTrueVariants/goodRecords)));
}
}
vcfwriter.writeHeader(new VCFHeader(hInfo, SampleUtils.getUniqueSamplesFromRods(getToolkit(), inputNames)));
for ( VariantContext record : records )
vcfwriter.add(record);
}
private VariantContext addVariantInformationToCall(VariantContext vContext) {
// check possible filters
double hwPvalue = hardyWeinbergCalculation(vContext);
double hwScore = Math.abs(QualityUtils.phredScaleErrorRate(hwPvalue));
double noCallProp = (double)vContext.getNoCallCount() / (double)vContext.getNSamples();
double homRefProp = (double)vContext.getHomRefCount() / (double)vContext.getNSamples();
double hetProp = (double)vContext.getHetCount() / (double)vContext.getNSamples();
double homVarProp = (double)vContext.getHomVarCount() / (double)vContext.getNSamples();
boolean isViolation = false;
Set<String> filters = new HashSet<String>();
if ( noCallProp > maxNoCall ) {
filters.add("HighNoCallRate");
numNoCallViolations++;
isViolation = true;
} else if ( hwScore > maxHardy ) {
filters.add("HardyWeinbergViolation");
numHWViolations++;
isViolation = true;
} else if ( homVarProp > maxHomNonref) {
filters.add("TooManyHomVars");
numHomVarViolations++;
isViolation = true;
}
VariantContextBuilder builder = new VariantContextBuilder(vContext).filters(filters);
numRecords++;
// add the info fields
builder.attribute("NoCallPct", String.format("%.1f", 100.0 * noCallProp));
builder.attribute("HomRefPct", String.format("%.1f", 100.0 * homRefProp));
builder.attribute("HomVarPct", String.format("%.1f", 100.0 * homVarProp));
builder.attribute("HetPct", String.format("%.1f", 100.0 * hetProp));
builder.attribute("HW", String.format("%.2f", hwScore));
Collection<Allele> altAlleles = vContext.getAlternateAlleles();
int altAlleleCount = altAlleles.size() == 0 ? 0 : vContext.getCalledChrCount(altAlleles.iterator().next());
if ( !isViolation && altAlleleCount > 0 )
numTrueVariants++;
builder.attribute(VCFConstants.ALLELE_COUNT_KEY, String.format("%d", altAlleleCount));
builder.attribute(VCFConstants.ALLELE_NUMBER_KEY, String.format("%d", vContext.getCalledChrCount()));
return builder.make();
}
private double hardyWeinbergCalculation(VariantContext vc) {
//if ( popFile != null ) {
// throw new GATKException("We still need to implement this!");
//} else {
return GATKVariantContextUtils.computeHardyWeinbergPvalue(vc);
//}
}
// TODO -- REWRITE THIS TO WORK WITH VARIANT CONTEXT
/******
private String smartHardy(ReferenceContext ref, VCFRecord rec) {
HashMap<String,ArrayList<Genotype>> genotypesByPopulation = new HashMap<String,ArrayList<Genotype>>(10);
HashMap<String,String> hardyWeinbergByPopulation = new HashMap<String,String>(10);
for ( String population : samplesToPopulation.values() ) {
genotypesByPopulation.put(population,new ArrayList<Genotype>());
}
//for ( String name : sampleNames ) {
// String pop = samplesToPopulation.get(name);
// if ( rec.getGenotype(name) != null ) {
// genotypesByPopulation.get(pop).add(rec.getGenotype(name));
// }
//}
for ( String population : samplesToPopulation.values() ) {
VCFVariationCall v = new VCFVariationCall(ref.getBase(),ref.getLocus(),VCFVariationCall.VARIANT_TYPE.SNP);
v.setGenotypeCalls(genotypesByPopulation.get(population));
hardyWeinbergByPopulation.put(population,HWCalc.annotate(null,ref,null,v));
}
return smartHardyString(hardyWeinbergByPopulation);
}
private String smartHardyString(HashMap<String,String> hwByPop) {
// for now just return the maximum:
int maxH = -100;
for ( String pop : samplesToPopulation.values() ) {
maxH = Integer.parseInt(hwByPop.get(pop)) > maxH ? Integer.parseInt(hwByPop.get(pop)) : maxH;
}
return String.format("%s",maxH);
}
*********/
}

View File

@ -1,60 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.readutils;
import org.broadinstitute.gatk.engine.walkers.WalkerTest;
import org.testng.annotations.Test;
import java.util.Arrays;
import java.util.Collections;
/**
* Created with IntelliJ IDEA.
* User: delangel
* Date: 4/13/13
* Time: 7:28 AM
* To change this template use File | Settings | File Templates.
*/
public class ReadAdaptorTrimmerIntegrationTest extends WalkerTest {
private String getBaseCommand(final String BAM) {
return "-T ReadAdaptorTrimmer -R " + b37KGReference +
" -I " + privateTestDir + BAM +
" -o %s";
}
@Test
public void testBasicTrimmer() {
WalkerTestSpec spec = new WalkerTestSpec( getBaseCommand("shortInsertTest.bam"), 1, Arrays.asList("c7d7f69e6b532ec693bfbd821c2e9766"));
executeTest(String.format("testBasicTrimmer"), spec);
}
@Test
public void testSkippingBadPairs() {
WalkerTestSpec spec = new WalkerTestSpec( getBaseCommand("shortInsertTest2.bam")+" -removeUnpairedReads", 1, Arrays.asList("f7aa76d1a2535774764e06ba610c21de"));
executeTest(String.format("testSkippingBadPairs"), spec);
}
}

View File

@ -1,54 +0,0 @@
/*
* Copyright 2012-2015 Broad Institute, Inc.
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.gatk.tools.walkers.variantutils;
import org.broadinstitute.gatk.utils.BaseTest;
import htsjdk.variant.variantcontext.Allele;
import htsjdk.variant.variantcontext.VariantContext;
import htsjdk.variant.variantcontext.VariantContextBuilder;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.util.ArrayList;
import java.util.List;
public class FilterLiftedVariantsUnitTest extends BaseTest {
@Test
public void testIndelAtEndOfContig() {
final List<Allele> alleles = new ArrayList<>(2);
alleles.add(Allele.create("AAAAA", true));
alleles.add(Allele.create("A", false));
final VariantContext vc = new VariantContextBuilder("test", "1", 10, 14, alleles).make();
final FilterLiftedVariants filter = new FilterLiftedVariants();
Assert.assertFalse(filter.filterOrWrite(new byte[]{'A'}, vc));
}
}

View File

@ -47,10 +47,19 @@ public class DeprecatedToolChecks {
deprecatedGATKWalkers.put("TableRecalibration", "2.0 (use PrintReads with -BQSR instead; see documentation for usage)");
deprecatedGATKWalkers.put("AlignmentWalker", "2.2 (no replacement)");
deprecatedGATKWalkers.put("CountBestAlignments", "2.2 (no replacement)");
deprecatedGATKWalkers.put("SomaticIndelDetector", "2.0 (replaced by the standalone tool Indelocator; see Cancer Tools documentation)");
deprecatedGATKWalkers.put("SomaticIndelDetector", "2.0 (replaced by MuTect2; see documentation for usage)");
deprecatedGATKWalkers.put("BeagleOutputToVCF", "3,4 (replaced by Beagle native functions; see Beagle 4 documentation at https://faculty.washington.edu/browning/beagle/beagle.html)");
deprecatedGATKWalkers.put("VariantsToBeagleUnphased", "3.4 (replaced by Beagle native functions; see Beagle 4 documentation at https://faculty.washington.edu/browning/beagle/beagle.html)");
deprecatedGATKWalkers.put("ProduceBeagleInput", "3.4 (replaced by Beagle native functions; see Beagle 4 documentation at https://faculty.washington.edu/browning/beagle/beagle.html)");
deprecatedGATKWalkers.put("ReadAdaptorTrimmer","3.5 (this tool was unsound and untested -- no specific replacement, see Picard tools for alternatives)");
deprecatedGATKWalkers.put("BaseCoverageDistribution","3.5 (use DiagnoseTargets instead; see documentation for usage)");
deprecatedGATKWalkers.put("CoveredByNSamplesSites","3.5 (use DiagnoseTargets instead; see documentation for usage)");
deprecatedGATKWalkers.put("FindCoveredIntervals","3.5 (use DiagnoseTargets instead; see documentation for usage)");
deprecatedGATKWalkers.put("VariantValidationAssessor","3.5 (this tool was unsound and untested -- no replacement)");
deprecatedGATKWalkers.put("LiftOverVariants","3.5 (use Picard LiftoverVCF instead; see documentation for usage)");
deprecatedGATKWalkers.put("FilterLiftedVariants","3.5 (use Picard LiftoverVCF instead; see documentation for usage)");
deprecatedGATKWalkers.put("ListAnnotations","3.5 (this tool was impractical; see the online documentation instead)");
}
// Mapping from walker name to major version number where the walker first disappeared and optional replacement options

View File

@ -1,83 +0,0 @@
#!/usr/bin/perl -w
# Runs the liftover tool on a VCF and properly handles the output
use strict;
use Getopt::Long;
my $in = undef;
my $gatk = undef;
my $chain = undef;
my $newRef = undef;
my $oldRef = undef;
my $out = undef;
my $tmp = "/tmp";
my $recordOriginalLocation = 0;
GetOptions( "vcf=s" => \$in,
"gatk=s" => \$gatk,
"chain=s" => \$chain,
"newRef=s" => \$newRef,
"oldRef=s" => \$oldRef,
"out=s" => \$out,
"tmp=s" => \$tmp,
"recordOriginalLocation" => \$recordOriginalLocation);
if ( !$in || !$gatk || !$chain || !$newRef || !$oldRef || !$out ) {
print "Usage: liftOverVCF.pl\n\t-vcf \t\t<input vcf>\n\t-gatk \t\t<path to gatk trunk>\n\t-chain \t\t<chain file>\n\t-newRef \t<path to new reference prefix; we will need newRef.dict, .fasta, and .fasta.fai>\n\t-oldRef \t<path to old reference prefix; we will need oldRef.fasta>\n\t-out \t\t<output vcf>\n\t-tmp \t\t<temp file location; defaults to /tmp>\n\t-recordOriginalLocation \t\t<Should we record what the original location was in the INFO field?; defaults to false>\n";
print "Example: ./liftOverVCF.pl\n\t-vcf /humgen/gsa-hpprojects/GATK/data/Comparisons/Validated/1kg_snp_validation/all_validation_batches.b36.vcf\n\t-chain b36ToHg19.broad.over.chain\n\t-out lifted.hg19.vcf\n\t-gatk /humgen/gsa-scr1/ebanks/Sting_dev\n\t-newRef /seq/references/Homo_sapiens_assembly19/v0/Homo_sapiens_assembly19\n\t-oldRef /humgen/1kg/reference/human_b36_both\n";
exit(1);
}
# generate a random number
my $random_number = rand();
my $tmp_prefix = "$tmp/$random_number";
print "Writing temporary files to prefix: $tmp_prefix\n";
my $unsorted_vcf = "$tmp_prefix.unsorted.vcf";
# lift over the file
print "Lifting over the vcf...";
my $cmd = "java -jar $gatk/dist/GenomeAnalysisTK.jar -T LiftoverVariants -R $oldRef.fasta -V:variant $in -o $unsorted_vcf -chain $chain -dict $newRef.dict -U LENIENT_VCF_PROCESSING";
if ($recordOriginalLocation) {
$cmd .= " -recordOriginalLocation";
}
system($cmd) == 0 or quit("The liftover step failed. Please correct the necessary errors before retrying.");
# we need to sort the lifted over file now
print "\nRe-sorting the vcf...\n";
my $sorted_vcf = "$tmp_prefix.sorted.vcf";
open(SORTED, ">$sorted_vcf") or die "can't open $sorted_vcf: $!";
# write the header
open(UNSORTED, "< $unsorted_vcf") or die "can't open $unsorted_vcf: $!";
my $inHeader = 1;
while ( $inHeader == 1 ) {
my $line = <UNSORTED>;
if ( $line !~ m/^#/ ) {
$inHeader = 0;
} else {
print SORTED "$line";
}
}
close(UNSORTED);
close(SORTED);
$cmd = "grep \"^#\" -v $unsorted_vcf | sort -n -k2 -T $tmp | $gatk/public/perl/sortByRef.pl --tmp $tmp - $newRef.fasta.fai >> $sorted_vcf";
system($cmd) == 0 or quit("The sorting step failed. Please correct the necessary errors before retrying.");
# Filter the VCF for bad records
print "\nFixing/removing bad records...\n";
$cmd = "java -jar $gatk/dist/GenomeAnalysisTK.jar -T FilterLiftedVariants -R $newRef.fasta -V:variant $sorted_vcf -o $out -U LENIENT_VCF_PROCESSING";
system($cmd) == 0 or quit("The filtering step failed. Please correct the necessary errors before retrying.");
# clean up
unlink $unsorted_vcf;
unlink $sorted_vcf;
my $sorted_index = "$sorted_vcf.idx";
unlink $sorted_index;
print "\nDone!\n";
sub quit {
print "\n$_[0]\n";
exit(1);
}

View File

@ -1,127 +0,0 @@
#!/usr/bin/perl -w
use strict;
use Getopt::Long;
sub usage {
print "\nUsage:\n";
print "sortByRef.pl [--k POS] [--tmp dir] INPUT REF_DICT\n\n";
print " Sorts lines of the input file INFILE according\n";
print " to the reference contig order specified by the\n";
print " reference dictionary REF_DICT (.fai file).\n";
print " The sort is stable. If -k option is not specified,\n";
print " it is assumed that the contig name is the first\n";
print " field in each line.\n\n";
print " INPUT input file to sort. If '-' is specified, \n";
print " then reads from STDIN.\n";
print " REF_DICT .fai file, or ANY file that has contigs, in the\n";
print " desired soting order, as its first column.\n";
print " --k POS : contig name is in the field POS (1-based)\n";
print " of input lines.\n\n";
print " --tmp DIR : temp directory [default=/tmp]\n\n";
exit(1);
}
my $pos = 1;
my $tmp = "/tmp";
GetOptions( "k:i" => \$pos,
"tmp=s" => \$tmp);
$pos--;
usage() if ( scalar(@ARGV) == 0 );
if ( scalar(@ARGV) != 2 ) {
print "Wrong number of arguments\n";
usage();
}
my $input_file = $ARGV[0];
my $dict_file = $ARGV[1];
open(DICT, "< $dict_file") or die("Can not open $dict_file: $!");
my %ref_order;
my $n = 0;
while ( <DICT> ) {
chomp;
my ($contig, $rest) = split '\s';
die("Dictionary file is probably corrupt: multiple instances of contig $contig") if ( defined $ref_order{$contig} );
$ref_order{$contig} = $n;
$n++;
}
close DICT;
#we have loaded contig ordering now
my $INPUT;
if ($input_file eq "-" ) {
$INPUT = "STDIN";
} else {
open($INPUT, "< $input_file") or die("Can not open $input_file: $!");
}
my %temp_outputs;
while ( <$INPUT> ) {
my @fields = split '\s';
die("Specified field position exceeds the number of fields:\n$_")
if ( $pos >= scalar(@fields) );
my $contig = $fields[$pos];
if ( $contig =~ m/:/ ) {
my @loc = split(/:/, $contig);
# print $contig . " " . $loc[0] . "\n";
$contig = $loc[0]
}
chomp $contig if ( $pos == scalar(@fields) - 1 ); # if last field in line
my $order;
if ( defined $ref_order{$contig} ) { $order = $ref_order{$contig}; }
else {
$ref_order{$contig} = $n;
$order = $n; # input line has contig that was not in the dict;
$n++; # this contig will go at the end of the output,
# after all known contigs
}
my $fhandle;
if ( defined $temp_outputs{$order} ) { $fhandle = $temp_outputs{$order} }
else {
#print "opening $order $$ $_\n";
open( $fhandle, " > $tmp/sortByRef.$$.$order.tmp" ) or
die ( "Can not open temporary file $order: $!");
$temp_outputs{$order} = $fhandle;
}
# we got the handle to the temp file that keeps all
# lines with contig $contig
print $fhandle $_; # send current line to its corresponding temp file
}
close $INPUT;
foreach my $f ( values %temp_outputs ) { close $f; }
# now collect back into single output stream:
for ( my $i = 0 ; $i < $n ; $i++ ) {
# if we did not have any lines on contig $i, then there's
# no temp file and nothing to do
next if ( ! defined $temp_outputs{$i} ) ;
my $f;
open ( $f, "< $tmp/sortByRef.$$.$i.tmp" );
while ( <$f> ) { print ; }
close $f;
unlink "$tmp/sortByRef.$$.$i.tmp";
}