Write a new tool for spliting reads that have N cigar string.

For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.

In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns

create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils

create integration test for that class

small change in a comment in FullProcessingPipeline

last commit:

Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)

- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs

- Allow the user to print the split positions (not working proparly currently)
This commit is contained in:
Ami Levy-Moonshine 2013-11-14 13:09:38 -05:00
parent ece346689c
commit 6da53aea09
10 changed files with 803 additions and 153 deletions

View File

@ -0,0 +1,266 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012 Broad Institute, Inc.
* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 4. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 5. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 6. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 7. MISCELLANEOUS
* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.sting.gatk.walkers.rnaseq;
import net.sf.samtools.*;
import org.broadinstitute.sting.commandline.Argument;
import org.broadinstitute.sting.commandline.Output;
import org.broadinstitute.sting.gatk.CommandLineGATK;
import org.broadinstitute.sting.gatk.GenomeAnalysisEngine;
import org.broadinstitute.sting.gatk.contexts.ReferenceContext;
import org.broadinstitute.sting.gatk.io.StingSAMFileWriter;
import org.broadinstitute.sting.gatk.iterators.ReadTransformer;
import org.broadinstitute.sting.gatk.iterators.ReadTransformersMode;
import org.broadinstitute.sting.gatk.refdata.RefMetaDataTracker;
import org.broadinstitute.sting.gatk.walkers.BAQMode;
import org.broadinstitute.sting.gatk.walkers.DataSource;
import org.broadinstitute.sting.gatk.walkers.ReadWalker;
import org.broadinstitute.sting.gatk.walkers.Requires;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.baq.BAQ;
import org.broadinstitute.sting.utils.clipping.ReadClipper;
import org.broadinstitute.sting.utils.exceptions.UserException;
import org.broadinstitute.sting.utils.help.DocumentedGATKFeature;
import org.broadinstitute.sting.utils.help.HelpConstants;
import org.broadinstitute.sting.utils.sam.CigarUtils;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import java.io.PrintStream;
import java.util.*;
/**
*
* Splits reads that contain Ns in their cigar string (e.g. spanning splicing events).
*
* Identifies all N cigar elements and creates k+1 new reads (where k is the number of N cigar elements).
* The first read includes the bases that are to the left of the first N element, while the part of the read that is to the right of the N
* (including the Ns) is hard clipped and so on for the rest of the new reads.
*
*
* User: ami
* Date: 11/14/13
* Time: 11:52 AM
*/
@DocumentedGATKFeature( groupName = HelpConstants.DOCS_CAT_DATA, extraDocs = {CommandLineGATK.class} )
@Requires({DataSource.READS, DataSource.REFERENCE})
public class SplitNCigarReads extends ReadWalker<List<GATKSAMRecord>, SAMFileWriter> {
@Output(doc="Write output to this BAM filename instead of STDOUT")
StingSAMFileWriter out;
@Argument(required = false)
PrintStream splitPositionsOutput = System.out;
@Argument(fullName="outputAsBED", shortName="bed", doc="Output as BED file", required=false)
boolean outputAsBED = false;
@Argument(fullName="printSplitPositions", shortName="splitPosition", doc="print the split positions", required=false)
boolean printSplitPositions = false;
public static final String PROGRAM_RECORD_NAME = "GATK SplitNCigarReads"; // The name that will go in the @PG tag
// public static SplitPositions splitPositions = null;
public static String results = "";
/**
* The initialize function.
*/
public void initialize() {
final GenomeAnalysisEngine toolkit = getToolkit();
final boolean preSorted = false;
if (getToolkit() != null) {
Utils.setupWriter(out, toolkit, toolkit.getSAMFileHeader(), preSorted, this, PROGRAM_RECORD_NAME);
}
out.setPresorted(preSorted);
// splitPositions = new SplitPositions();
}
/**
* The reads map function.
*
* @param ref the reference bases that correspond to our read, if a reference was provided
* @param read the read itself, as a GATKSAMRecord
*
* @return a list of split read if there are N's in the cigar string, or the read itself.
*/
public List<GATKSAMRecord> map(final ReferenceContext ref,final GATKSAMRecord read,final RefMetaDataTracker metaDataTracker) {
return splitNCigarRead(read);
}
/**
* Goes through the cigar string of the read and create new reads for each consecutive non-N elements (while hard clipping the rest of the read).
* For example: for a read with cigar '1H2M2D1M2N1M2I1N1M2S' 3 new reads will be created with cigar strings: 1H2M2D1M, 1M2I and 1M2S
*
* @param read the read to split, as a GATKSAMRecord
* @return a list of split read if there are N's in the cigar string, or the read itself.
*/
public static List<GATKSAMRecord> splitNCigarRead(final GATKSAMRecord read){
final List<GATKSAMRecord> splitReads = new ArrayList<>();
int firstCigarIndex = 0;
for (int i = 0; i < read.getCigar().numCigarElements(); i ++){
final CigarElement cigarElement = read.getCigar().getCigarElement(i);
if(cigarElement.getOperator() == CigarOperator.N){
final boolean addToSplitPositions = true;
splitReads.add(splitReadBasedOnCigar(read,firstCigarIndex,i, addToSplitPositions));
firstCigarIndex = i+1;
}
}
//if there are no N's in the read
if (firstCigarIndex == 0)
splitReads.add(read);
//add the last section of the read: from the last N to the the end of the read
// (it will be done for all the usual cigar string that does not end with N)
else if(firstCigarIndex < read.getCigar().numCigarElements()){
final boolean addToSplitPositions = false;
splitReads.add(splitReadBasedOnCigar(read,firstCigarIndex,read.getCigar().numCigarElements(), addToSplitPositions));
}
return splitReads;
}
/**
* reduceInit is called once before any calls to the map function. We use it here to setup the splitPositionsOutput
* bam file, if it was specified on the command line
*
* @return SAMFileWriter, set to the BAM splitPositionsOutput file if the command line option was set, null otherwise
*/
public SAMFileWriter reduceInit() {
return out;
}
/**
* given a read and a splitPositionsOutput location, reduce by emitting the read
*
* @param reads the split reads itself
* @param output the splitPositionsOutput source
* @return the SAMFileWriter, so that the next reduce can emit to the same source
*/
public SAMFileWriter reduce(final List<GATKSAMRecord> reads,final SAMFileWriter output ) {
for (final GATKSAMRecord read: reads)
output.addAlignment(read);
return output;
}
public void onTraversalDone(SAMFileWriter readResult) {
super.onTraversalDone(readResult);
if(printSplitPositions)
splitPositionsOutput.println(results);
// splitPositionsOutput.println(splitPositions);
}
private static GATKSAMRecord splitReadBasedOnCigar(final GATKSAMRecord read, final int cigarStartIndex, final int cigarEndIndex, final boolean addToSplitPositions){
int cigarFirstIndex = cigarStartIndex;
int cigarSecondIndex = cigarEndIndex;
//in case a section of the read is end or start with D (for example the first section in 1M1D1N1M is 1M1D), we should trim this cigar element
// it can be if, but it was kept as while to make sure the code can work with Cigar string that were not "cleaned"
while(read.getCigar().getCigarElement(cigarFirstIndex).getOperator().equals(CigarOperator.D))
cigarFirstIndex++;
while(read.getCigar().getCigarElement(cigarSecondIndex-1).getOperator().equals(CigarOperator.D))
cigarSecondIndex--;
if(cigarFirstIndex > cigarSecondIndex)
throw new UserException.BadInput("Cannot split this read (might be an empty section between Ns, for example 1N1D1N): "+read.getCigarString());
// we keep only the section of the read that is aligned to the reference between startRefIndex and stopRefIndex (inclusive).
// the other sections of the read are clipped:
final int startRefIndex = read.getOriginalAlignmentStart() + CigarUtils.countRefBasesBasedOnCigar(read,0,cigarFirstIndex); //goes through the prefix of the cigar (up to cigarStartIndex) and move the reference index.
final int stopRefIndex = startRefIndex + CigarUtils.countRefBasesBasedOnCigar(read,cigarFirstIndex,cigarSecondIndex)-1; //goes through a consecutive non-N section of the cigar (up to cigarEndIndex) and move the reference index.
if(addToSplitPositions){
final int splitPosition = startRefIndex + CigarUtils.countRefBasesBasedOnCigar(read,cigarFirstIndex,cigarEndIndex); //we use cigarEndIndex instead of cigarSecondIndex so we won't take into account the D's at the end.
final String contig = read.getReferenceName();
// results += String.format("%s:%d-%d\n", contig, splitPosition, splitPosition );
// splitPositions.addSplitPosition(contig,splitPosition);
}
return ReadClipper.hardClipToRegionIncludingClippedBases(read, startRefIndex, stopRefIndex);
}
private class SplitPosition {
public final String contig;
public final int start;
public final int end;
public SplitPosition(final String c, final int position) {
contig = c;
start = position;
end = position;
}
}
private class SplitPositions {
private final HashSet<SplitPosition> splitPositions;
public SplitPositions() {
splitPositions = new HashSet<>();
}
public void addSplitPosition(final String contig, final int position) {
final SplitPosition newSplitPosition = new SplitPosition(contig, position);
splitPositions.add(newSplitPosition);
}
public String toString() {
String result = ""; // = "Contig\tstart\tstop\n";
for (SplitPosition position: splitPositions) {
if (outputAsBED)
result += String.format("%s\t%d\t%d\n", position.contig, position.start-1, position.end );
else
result += String.format("%s:%d-%d\n", position.contig, position.start, position.end );
}
return result;
}
}
}

View File

@ -0,0 +1,83 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012 Broad Institute, Inc.
* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 4. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 5. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 6. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 7. MISCELLANEOUS
* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.sting.gatk.walkers.rnaseq;
import org.broadinstitute.sting.WalkerTest;
import org.testng.annotations.Test;
import org.broadinstitute.sting.BaseTest;
import java.util.Arrays;
/**
* Created with IntelliJ IDEA.
* User: ami
* Date: 12/5/13
* Time: 1:04 PM
*/
public class SplitNCigarReadsIntegrationTests extends WalkerTest {
@Test
// contain reads without N's, with N's and with N's and I's
public void testSplitWithInsertions() {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T SplitNCigarReads -R " + BaseTest.b37KGReference + " -I " + BaseTest.privateTestDir + "SplitNCigarReads.integrationTest.unsplitReads.withI.bam -o %s -U ALLOW_N_CIGAR_READS", 1,
Arrays.asList("037c72fe1572efb63cccbe0a8dda3cb1"));
executeTest("test split N cigar reads with insertions", spec);
}
@Test
// contain reads without N's, with N's and with N's and D's, and also with more then one N element in the cigar.
public void testSplitWithDeletions() {
WalkerTest.WalkerTestSpec spec = new WalkerTest.WalkerTestSpec(
"-T SplitNCigarReads -R " + BaseTest.b37KGReference + " -I " + BaseTest.privateTestDir + "SplitNCigarReads.integrationTest.unsplitReads.withD.bam -o %s -U ALLOW_N_CIGAR_READS", 1,
Arrays.asList("8472005c16353715025353d6d453faf4"));
executeTest("test split N cigar reads with deletions", spec);
}
}

View File

@ -0,0 +1,182 @@
/*
* By downloading the PROGRAM you agree to the following terms of use:
*
* BROAD INSTITUTE - SOFTWARE LICENSE AGREEMENT - FOR ACADEMIC NON-COMMERCIAL RESEARCH PURPOSES ONLY
*
* This Agreement is made between the Broad Institute, Inc. with a principal address at 7 Cambridge Center, Cambridge, MA 02142 (BROAD) and the LICENSEE and is effective at the date the downloading is completed (EFFECTIVE DATE).
*
* WHEREAS, LICENSEE desires to license the PROGRAM, as defined hereinafter, and BROAD wishes to have this PROGRAM utilized in the public interest, subject only to the royalty-free, nonexclusive, nontransferable license rights of the United States Government pursuant to 48 CFR 52.227-14; and
* WHEREAS, LICENSEE desires to license the PROGRAM and BROAD desires to grant a license on the following terms and conditions.
* NOW, THEREFORE, in consideration of the promises and covenants made herein, the parties hereto agree as follows:
*
* 1. DEFINITIONS
* 1.1 PROGRAM shall mean copyright in the object code and source code known as GATK2 and related documentation, if any, as they exist on the EFFECTIVE DATE and can be downloaded from http://www.broadinstitute/GATK on the EFFECTIVE DATE.
*
* 2. LICENSE
* 2.1 Grant. Subject to the terms of this Agreement, BROAD hereby grants to LICENSEE, solely for academic non-commercial research purposes, a non-exclusive, non-transferable license to: (a) download, execute and display the PROGRAM and (b) create bug fixes and modify the PROGRAM.
* The LICENSEE may apply the PROGRAM in a pipeline to data owned by users other than the LICENSEE and provide these users the results of the PROGRAM provided LICENSEE does so for academic non-commercial purposes only. For clarification purposes, academic sponsored research is not a commercial use under the terms of this Agreement.
* 2.2 No Sublicensing or Additional Rights. LICENSEE shall not sublicense or distribute the PROGRAM, in whole or in part, without prior written permission from BROAD. LICENSEE shall ensure that all of its users agree to the terms of this Agreement. LICENSEE further agrees that it shall not put the PROGRAM on a network, server, or other similar technology that may be accessed by anyone other than the LICENSEE and its employees and users who have agreed to the terms of this agreement.
* 2.3 License Limitations. Nothing in this Agreement shall be construed to confer any rights upon LICENSEE by implication, estoppel, or otherwise to any computer software, trademark, intellectual property, or patent rights of BROAD, or of any other entity, except as expressly granted herein. LICENSEE agrees that the PROGRAM, in whole or part, shall not be used for any commercial purpose, including without limitation, as the basis of a commercial software or hardware product or to provide services. LICENSEE further agrees that the PROGRAM shall not be copied or otherwise adapted in order to circumvent the need for obtaining a license for use of the PROGRAM.
*
* 3. OWNERSHIP OF INTELLECTUAL PROPERTY
* LICENSEE acknowledges that title to the PROGRAM shall remain with BROAD. The PROGRAM is marked with the following BROAD copyright notice and notice of attribution to contributors. LICENSEE shall retain such notice on all copies. LICENSEE agrees to include appropriate attribution if any results obtained from use of the PROGRAM are included in any publication.
* Copyright 2012 Broad Institute, Inc.
* Notice of attribution: The GATK2 program was made available through the generosity of Medical and Population Genetics program at the Broad Institute, Inc.
* LICENSEE shall not use any trademark or trade name of BROAD, or any variation, adaptation, or abbreviation, of such marks or trade names, or any names of officers, faculty, students, employees, or agents of BROAD except as states above for attribution purposes.
*
* 4. INDEMNIFICATION
* LICENSEE shall indemnify, defend, and hold harmless BROAD, and their respective officers, faculty, students, employees, associated investigators and agents, and their respective successors, heirs and assigns, (Indemnitees), against any liability, damage, loss, or expense (including reasonable attorneys fees and expenses) incurred by or imposed upon any of the Indemnitees in connection with any claims, suits, actions, demands or judgments arising out of any theory of liability (including, without limitation, actions in the form of tort, warranty, or strict liability and regardless of whether such action has any factual basis) pursuant to any right or license granted under this Agreement.
*
* 5. NO REPRESENTATIONS OR WARRANTIES
* THE PROGRAM IS DELIVERED AS IS. BROAD MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE PROGRAM OR THE COPYRIGHT, EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, WHETHER OR NOT DISCOVERABLE. BROAD EXTENDS NO WARRANTIES OF ANY KIND AS TO PROGRAM CONFORMITY WITH WHATEVER USER MANUALS OR OTHER LITERATURE MAY BE ISSUED FROM TIME TO TIME.
* IN NO EVENT SHALL BROAD OR ITS RESPECTIVE DIRECTORS, OFFICERS, EMPLOYEES, AFFILIATED INVESTIGATORS AND AFFILIATES BE LIABLE FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES OF ANY KIND, INCLUDING, WITHOUT LIMITATION, ECONOMIC DAMAGES OR INJURY TO PROPERTY AND LOST PROFITS, REGARDLESS OF WHETHER BROAD SHALL BE ADVISED, SHALL HAVE OTHER REASON TO KNOW, OR IN FACT SHALL KNOW OF THE POSSIBILITY OF THE FOREGOING.
*
* 6. ASSIGNMENT
* This Agreement is personal to LICENSEE and any rights or obligations assigned by LICENSEE without the prior written consent of BROAD shall be null and void.
*
* 7. MISCELLANEOUS
* 7.1 Export Control. LICENSEE gives assurance that it will comply with all United States export control laws and regulations controlling the export of the PROGRAM, including, without limitation, all Export Administration Regulations of the United States Department of Commerce. Among other things, these laws and regulations prohibit, or require a license for, the export of certain types of software to specified countries.
* 7.2 Termination. LICENSEE shall have the right to terminate this Agreement for any reason upon prior written notice to BROAD. If LICENSEE breaches any provision hereunder, and fails to cure such breach within thirty (30) days, BROAD may terminate this Agreement immediately. Upon termination, LICENSEE shall provide BROAD with written assurance that the original and all copies of the PROGRAM have been destroyed, except that, upon prior written authorization from BROAD, LICENSEE may retain a copy for archive purposes.
* 7.3 Survival. The following provisions shall survive the expiration or termination of this Agreement: Articles 1, 3, 4, 5 and Sections 2.2, 2.3, 7.3, and 7.4.
* 7.4 Notice. Any notices under this Agreement shall be in writing, shall specifically refer to this Agreement, and shall be sent by hand, recognized national overnight courier, confirmed facsimile transmission, confirmed electronic mail, or registered or certified mail, postage prepaid, return receipt requested. All notices under this Agreement shall be deemed effective upon receipt.
* 7.5 Amendment and Waiver; Entire Agreement. This Agreement may be amended, supplemented, or otherwise modified only by means of a written instrument signed by all parties. Any waiver of any rights or failure to act in a specific instance shall relate only to such instance and shall not be construed as an agreement to waive any rights or fail to act in any other instance, whether or not similar. This Agreement constitutes the entire agreement among the parties with respect to its subject matter and supersedes prior agreements or understandings between the parties relating to its subject matter.
* 7.6 Binding Effect; Headings. This Agreement shall be binding upon and inure to the benefit of the parties and their respective permitted successors and assigns. All headings are for convenience only and shall not affect the meaning of any provision of this Agreement.
* 7.7 Governing Law. This Agreement shall be construed, governed, interpreted and applied in accordance with the internal laws of the Commonwealth of Massachusetts, U.S.A., without regard to conflict of laws principles.
*/
package org.broadinstitute.sting.gatk.walkers.rnaseq;
import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
import org.broadinstitute.sting.gatk.walkers.rnaseq.SplitNCigarReads;
import org.broadinstitute.sting.utils.clipping.ReadClipperTestUtils;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import org.testng.Assert;
import org.testng.annotations.Test;
import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;
/**
*
* Tests all possible (and valid) cigar strings that might contain any cigar elements. It uses a code that were written to test the ReadClipper walker.
* For valid cigar sting in length 8 there are few thousands options, with N in every possible option and with more than one N (for example 1M1N1M1N1M1N2M).
* The cigarElements array is used to provide all the possible cigar element that might be included.
*
* User: ami
* Date: 11/14/13
* Time: 6:49 PM
*/
public class SplitNCigarReadsUnitTest {
final static CigarElement[] cigarElements = {
new CigarElement(1, CigarOperator.HARD_CLIP),
new CigarElement(1, CigarOperator.SOFT_CLIP),
new CigarElement(1, CigarOperator.INSERTION),
new CigarElement(1, CigarOperator.DELETION),
new CigarElement(1, CigarOperator.MATCH_OR_MISMATCH),
new CigarElement(1, CigarOperator.SKIPPED_REGION)
};
@Test(enabled = true)
public void splitReadAtN() {
final int cigarStringLength = 10;
final List<Cigar> cigarList = ReadClipperTestUtils.generateCigarList(cigarStringLength,cigarElements);
// For Debugging use those lines (instead of above cigarList) to create specific read:
//------------------------------------------------------------------------------------
// final GATKSAMRecord tmpRead = GATKSAMRecord.createRandomRead(6);
// tmpRead.setCigarString("1M1N1M");
// final List<Cigar> cigarList = new ArrayList<>();
// cigarList.add(tmpRead.getCigar());
for(Cigar cigar: cigarList){
final int numOfSplits = numOfNElements(cigar.getCigarElements());
if(numOfSplits != 0 && isCigarDoesNotHaveEmptyRegionsBetweenNs(cigar)){
GATKSAMRecord read = ReadClipperTestUtils.makeReadFromCigar(cigar);
List<GATKSAMRecord> splitReads = SplitNCigarReads.splitNCigarRead(read);
final int expectedReads = numOfSplits+1;
Assert.assertEquals(splitReads.size(),expectedReads,"wrong number of reads after split read with cigar: "+cigar+" at Ns [expected]: "+expectedReads+" [actual value]: "+splitReads.size());
final List<Integer> readLengths = consecutiveNonNElements(read.getCigar().getCigarElements());
int index = 0;
int offsetFromStart = 0;
for(GATKSAMRecord splitRead: splitReads){
int expectedLength = readLengths.get(index);
Assert.assertTrue(splitRead.getReadLength() == expectedLength,
"the "+index+" (starting with 0) split read has a wrong length.\n" +
"cigar of original read: "+cigar+"\n"+
"expected length: "+expectedLength+"\n"+
"actual length: "+splitRead.getReadLength()+"\n");
assertBases(splitRead.getReadBases(), read.getReadBases(), offsetFromStart);
index++;
offsetFromStart += expectedLength;
}
}
}
}
private int numOfNElements(final List<CigarElement> cigarElements){
int numOfNElements = 0;
for (CigarElement element: cigarElements){
if (element.getOperator() == CigarOperator.SKIPPED_REGION)
numOfNElements++;
}
return numOfNElements;
}
private static boolean isCigarDoesNotHaveEmptyRegionsBetweenNs(final Cigar cigar) {
boolean sawM = false;
boolean sawS = false;
for (CigarElement cigarElement : cigar.getCigarElements()) {
if (cigarElement.getOperator().equals(CigarOperator.SKIPPED_REGION)) {
if(!sawM && !sawS)
return false;
sawM = false;
sawS = false;
}
if (cigarElement.getOperator().equals(CigarOperator.MATCH_OR_MISMATCH))
sawM = true;
if (cigarElement.getOperator().equals(CigarOperator.SOFT_CLIP))
sawS = true;
}
if(!sawS && !sawM)
return false;
return true;
}
private List<Integer> consecutiveNonNElements(final List<CigarElement> cigarElements){
final LinkedList<Integer> results = new LinkedList<>();
int consecutiveLength = 0;
for(CigarElement element: cigarElements){
final CigarOperator op = element.getOperator();
if(op.equals(CigarOperator.MATCH_OR_MISMATCH) || op.equals(CigarOperator.SOFT_CLIP) || op.equals(CigarOperator.INSERTION)){
consecutiveLength += element.getLength();
}
else if(op.equals(CigarOperator.SKIPPED_REGION))
{
if(consecutiveLength != 0){
results.addLast(consecutiveLength);
consecutiveLength = 0;
}
}
}
if(consecutiveLength != 0)
results.addLast(consecutiveLength);
return results;
}
private void assertBases(final byte[] actualBase, final byte[] expectedBase, final int startIndex) {
for (int i = 0; i < actualBase.length; i++) {
Assert.assertEquals(actualBase[i], expectedBase[startIndex + i],"unmatched bases between: "+ Arrays.toString(actualBase)+"\nand:\n"+Arrays.toString(expectedBase)+"\nat position: "+i);
}
}
}

View File

@ -362,21 +362,16 @@ public class ClippingOp {
* @return a cloned version of read that has been properly trimmed down
*/
private GATKSAMRecord hardClip(GATKSAMRecord read, int start, int stop) {
final int firstBaseAfterSoftClips = read.getAlignmentStart() - read.getSoftStart();
final int lastBaseBeforeSoftClips = read.getSoftEnd() - read.getSoftStart();
if (start == firstBaseAfterSoftClips && stop == lastBaseBeforeSoftClips) // note that if the read has no soft clips, these constants will be 0 and read length - 1 (beauty of math).
return GATKSAMRecord.emptyRead(read);
// If the read is unmapped there is no Cigar string and neither should we create a new cigar string
CigarShift cigarShift = (read.getReadUnmappedFlag()) ? new CigarShift(new Cigar(), 0, 0) : hardClipCigar(read.getCigar(), start, stop);
final CigarShift cigarShift = (read.getReadUnmappedFlag()) ? new CigarShift(new Cigar(), 0, 0) : hardClipCigar(read.getCigar(), start, stop);
// the cigar may force a shift left or right (or both) in case we are left with insertions
// starting or ending the read after applying the hard clip on start/stop.
int newLength = read.getReadLength() - (stop - start + 1) - cigarShift.shiftFromStart - cigarShift.shiftFromEnd;
byte[] newBases = new byte[newLength];
byte[] newQuals = new byte[newLength];
int copyStart = (start == 0) ? stop + 1 + cigarShift.shiftFromStart : cigarShift.shiftFromStart;
final int newLength = read.getReadLength() - (stop - start + 1) - cigarShift.shiftFromStart - cigarShift.shiftFromEnd;
final byte[] newBases = new byte[newLength];
final byte[] newQuals = new byte[newLength];
final int copyStart = (start == 0) ? stop + 1 + cigarShift.shiftFromStart : cigarShift.shiftFromStart;
System.arraycopy(read.getReadBases(), copyStart, newBases, 0, newLength);
System.arraycopy(read.getBaseQualities(), copyStart, newQuals, 0, newLength);
@ -396,8 +391,8 @@ public class ClippingOp {
hardClippedRead.setAlignmentStart(read.getAlignmentStart() + calculateAlignmentStartShift(read.getCigar(), cigarShift.cigar));
if (read.hasBaseIndelQualities()) {
byte[] newBaseInsertionQuals = new byte[newLength];
byte[] newBaseDeletionQuals = new byte[newLength];
final byte[] newBaseInsertionQuals = new byte[newLength];
final byte[] newBaseDeletionQuals = new byte[newLength];
System.arraycopy(read.getBaseInsertionQualities(), copyStart, newBaseInsertionQuals, 0, newLength);
System.arraycopy(read.getBaseDeletionQualities(), copyStart, newBaseDeletionQuals, 0, newLength);
hardClippedRead.setBaseQualities(newBaseInsertionQuals, EventType.BASE_INSERTION);
@ -516,20 +511,20 @@ public class ClippingOp {
}
/**
* Checks if a hard clipped cigar left a read starting or ending with insertions/deletions
* Checks if a hard clipped cigar left a read starting or ending with deletions or gap (N)
* and cleans it up accordingly.
*
* @param cigar the original cigar
* @return an object with the shifts (see CigarShift class)
*/
private CigarShift cleanHardClippedCigar(Cigar cigar) {
Cigar cleanCigar = new Cigar();
private CigarShift cleanHardClippedCigar(final Cigar cigar) {
final Cigar cleanCigar = new Cigar();
int shiftFromStart = 0;
int shiftFromEnd = 0;
Stack<CigarElement> cigarStack = new Stack<CigarElement>();
Stack<CigarElement> inverseCigarStack = new Stack<CigarElement>();
final Stack<CigarElement> inverseCigarStack = new Stack<CigarElement>();
for (CigarElement cigarElement : cigar.getCigarElements())
for (final CigarElement cigarElement : cigar.getCigarElements())
cigarStack.push(cigarElement);
for (int i = 1; i <= 2; i++) {
@ -542,8 +537,8 @@ public class ClippingOp {
CigarElement cigarElement = cigarStack.pop();
if (!readHasStarted &&
// cigarElement.getOperator() != CigarOperator.INSERTION &&
cigarElement.getOperator() != CigarOperator.DELETION &&
cigarElement.getOperator() != CigarOperator.SKIPPED_REGION &&
cigarElement.getOperator() != CigarOperator.HARD_CLIP)
readHasStarted = true;
@ -553,6 +548,9 @@ public class ClippingOp {
else if (!readHasStarted && cigarElement.getOperator() == CigarOperator.DELETION)
totalHardClip += cigarElement.getLength();
else if (!readHasStarted && cigarElement.getOperator() == CigarOperator.SKIPPED_REGION)
totalHardClip += cigarElement.getLength();
if (readHasStarted) {
if (i == 1) {
if (!addedHardClips) {

View File

@ -30,6 +30,7 @@ import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import org.broadinstitute.sting.utils.recalibration.EventType;
import org.broadinstitute.sting.utils.sam.CigarUtils;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import org.broadinstitute.sting.utils.sam.ReadUtils;
@ -348,20 +349,38 @@ public class ReadClipper {
public static GATKSAMRecord hardClipToRegion( final GATKSAMRecord read, final int refStart, final int refStop ) {
final int start = read.getAlignmentStart();
final int stop = read.getAlignmentEnd();
return hardClipToRegion(read, refStart, refStop,start,stop);
}
/**
* Hard clip the read to the variable region (from refStart to refStop) processing also the clipped bases
*
* @param read the read to be clipped
* @param refStart the beginning of the variant region (inclusive)
* @param refStop the end of the variant region (inclusive)
* @return the read hard clipped to the variant region
*/
public static GATKSAMRecord hardClipToRegionIncludingClippedBases( final GATKSAMRecord read, final int refStart, final int refStop ) {
final int start = read.getOriginalAlignmentStart();
final int stop = start + CigarUtils.countRefBasesBasedOnCigar(read,0,read.getCigarLength()) - 1;
return hardClipToRegion(read, refStart, refStop,start,stop);
}
private static GATKSAMRecord hardClipToRegion( final GATKSAMRecord read, final int refStart, final int refStop, final int alignmentStart, final int alignmentStop){
// check if the read is contained in region
if (start <= refStop && stop >= refStart) {
if (start < refStart && stop > refStop)
if (alignmentStart <= refStop && alignmentStop >= refStart) {
if (alignmentStart < refStart && alignmentStop > refStop)
return hardClipBothEndsByReferenceCoordinates(read, refStart - 1, refStop + 1);
else if (start < refStart)
else if (alignmentStart < refStart)
return hardClipByReferenceCoordinatesLeftTail(read, refStart - 1);
else if (stop > refStop)
else if (alignmentStop > refStop)
return hardClipByReferenceCoordinatesRightTail(read, refStop + 1);
return read;
} else
return GATKSAMRecord.emptyRead(read);
}
public static List<GATKSAMRecord> hardClipToRegion( final List<GATKSAMRecord> reads, final int refStart, final int refStop ) {
final List<GATKSAMRecord> returnList = new ArrayList<GATKSAMRecord>( reads.size() );
for( final GATKSAMRecord read : reads ) {

View File

@ -0,0 +1,167 @@
/*
* Copyright (c) 2012 The Broad Institute
*
* Permission is hereby granted, free of charge, to any person
* obtaining a copy of this software and associated documentation
* files (the "Software"), to deal in the Software without
* restriction, including without limitation the rights to use,
* copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following
* conditions:
*
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
* WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE.
*/
package org.broadinstitute.sting.utils.sam;
import net.sf.samtools.Cigar;
import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
import net.sf.samtools.TextCigarCodec;
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException;
import java.util.Stack;
/**
* Created with IntelliJ IDEA.
* User: ami
* Date: 11/26/13
* Time: 11:33 AM
* To change this template use File | Settings | File Templates.
*/
public class CigarUtils {
/**
* Combines equal adjacent elements of a Cigar object
*
* @param rawCigar the cigar object
* @return a combined cigar object
*/
public static Cigar combineAdjacentCigarElements(Cigar rawCigar) {
Cigar combinedCigar = new Cigar();
CigarElement lastElement = null;
int lastElementLength = 0;
for (CigarElement cigarElement : rawCigar.getCigarElements()) {
if (lastElement != null &&
((lastElement.getOperator() == cigarElement.getOperator()) ||
(lastElement.getOperator() == CigarOperator.I && cigarElement.getOperator() == CigarOperator.D) ||
(lastElement.getOperator() == CigarOperator.D && cigarElement.getOperator() == CigarOperator.I)))
lastElementLength += cigarElement.getLength();
else
{
if (lastElement != null)
combinedCigar.add(new CigarElement(lastElementLength, lastElement.getOperator()));
lastElement = cigarElement;
lastElementLength = cigarElement.getLength();
}
}
if (lastElement != null)
combinedCigar.add(new CigarElement(lastElementLength, lastElement.getOperator()));
return combinedCigar;
}
public static Cigar invertCigar (Cigar cigar) {
Stack<CigarElement> cigarStack = new Stack<CigarElement>();
for (CigarElement cigarElement : cigar.getCigarElements())
cigarStack.push(cigarElement);
Cigar invertedCigar = new Cigar();
while (!cigarStack.isEmpty())
invertedCigar.add(cigarStack.pop());
return invertedCigar;
}
/**
* Checks whether or not the read has any cigar element that is not H or S
*
* @param read the read
* @return true if it has any M, I or D, false otherwise
*/
public static boolean readHasNonClippedBases(GATKSAMRecord read) {
for (CigarElement cigarElement : read.getCigar().getCigarElements())
if (cigarElement.getOperator() != CigarOperator.SOFT_CLIP && cigarElement.getOperator() != CigarOperator.HARD_CLIP)
return true;
return false;
}
public static Cigar cigarFromString(String cigarString) {
return TextCigarCodec.getSingleton().decode(cigarString);
}
/**
* A valid cigar object obeys the following rules:
* - No Hard/Soft clips in the middle of the read
* - No deletions in the beginning / end of the read
* - No repeated adjacent element (e.g. 1M2M -> this should be 3M)
* - No consecutive I/D elements
**/
public static boolean isCigarValid(Cigar cigar) {
if (cigar.isValid(null, -1) == null) { // This should take care of most invalid Cigar Strings (picard's "exhaustive" implementation)
Stack<CigarElement> cigarElementStack = new Stack<CigarElement>(); // Stack to invert cigar string to find ending operator
CigarOperator startingOp = null;
CigarOperator endingOp = null;
// check if it doesn't start with deletions
boolean readHasStarted = false; // search the list of elements for the starting operator
for (CigarElement cigarElement : cigar.getCigarElements()) {
if (!readHasStarted) {
if (cigarElement.getOperator() != CigarOperator.SOFT_CLIP && cigarElement.getOperator() != CigarOperator.HARD_CLIP) {
readHasStarted = true;
startingOp = cigarElement.getOperator();
}
}
cigarElementStack.push(cigarElement);
}
while (!cigarElementStack.empty()) {
CigarElement cigarElement = cigarElementStack.pop();
if (cigarElement.getOperator() != CigarOperator.SOFT_CLIP && cigarElement.getOperator() != CigarOperator.HARD_CLIP) {
endingOp = cigarElement.getOperator();
break;
}
}
if (startingOp != CigarOperator.DELETION && endingOp != CigarOperator.DELETION && startingOp != CigarOperator.SKIPPED_REGION && endingOp != CigarOperator.SKIPPED_REGION)
return true; // we don't accept reads starting or ending in deletions (add any other constraint here)
}
return false;
}
public static final int countRefBasesBasedOnCigar(final GATKSAMRecord read, final int cigarStartIndex, final int cigarEndIndex){
int result = 0;
for(int i = cigarStartIndex; i<cigarEndIndex;i++){
final CigarElement cigarElement = read.getCigar().getCigarElement(i);
switch (cigarElement.getOperator()) {
case M:
case S:
case D:
case N:
case H:
result += cigarElement.getLength();
break;
case I:
break;
default:
throw new ReviewedStingException("Unsupported cigar operator: " + cigarElement.getOperator());
}
}
return result;
}
}

View File

@ -457,21 +457,22 @@ public class ReadUtils {
/**
* Returns the read coordinate corresponding to the requested reference coordinate.
*
* WARNING: if the requested reference coordinate happens to fall inside a deletion in the read, this function
* will return the last read base before the deletion. This function returns a
* Pair(int readCoord, boolean fallsInsideDeletion) so you can choose which readCoordinate to use when faced with
* a deletion.
* WARNING: if the requested reference coordinate happens to fall inside or just before a deletion (or skipped region) in the read, this function
* will return the last read base before the deletion (or skipped region). This function returns a
* Pair(int readCoord, boolean fallsInsideOrJustBeforeDeletionOrSkippedRegion) so you can choose which readCoordinate to use when faced with
* a deletion (or skipped region).
*
* SUGGESTION: Use getReadCoordinateForReferenceCoordinate(GATKSAMRecord, int, ClippingTail) instead to get a
* pre-processed result according to normal clipping needs. Or you can use this function and tailor the
* behavior to your needs.
*
* @param read
* @param refCoord
* @param refCoord the requested reference coordinate
* @return the read coordinate corresponding to the requested reference coordinate. (see warning!)
*/
@Requires({"refCoord >= read.getSoftStart()", "refCoord <= read.getSoftEnd()"})
@Ensures({"result.getFirst() >= 0", "result.getFirst() < read.getReadLength()"})
//TODO since we do not have contracts any more, should we check for the requirements in the method code?
public static Pair<Integer, Boolean> getReadCoordinateForReferenceCoordinate(GATKSAMRecord read, int refCoord) {
return getReadCoordinateForReferenceCoordinate(read.getSoftStart(), read.getCigar(), refCoord, false);
}
@ -479,9 +480,11 @@ public class ReadUtils {
public static Pair<Integer, Boolean> getReadCoordinateForReferenceCoordinate(final int alignmentStart, final Cigar cigar, final int refCoord, final boolean allowGoalNotReached) {
int readBases = 0;
int refBases = 0;
boolean fallsInsideDeletion = false;
boolean fallsInsideDeletionOrSkippedRegion = false;
boolean endJustBeforeDeletionOrSkippedRegion = false;
boolean fallsInsideOrJustBeforeDeletionOrSkippedRegion = false;
int goal = refCoord - alignmentStart; // The goal is to move this many reference bases
final int goal = refCoord - alignmentStart; // The goal is to move this many reference bases
if (goal < 0) {
if (allowGoalNotReached) {
return new Pair<Integer, Boolean>(CLIPPING_GOAL_NOT_REACHED, false);
@ -493,7 +496,7 @@ public class ReadUtils {
Iterator<CigarElement> cigarElementIterator = cigar.getCigarElements().iterator();
while (!goalReached && cigarElementIterator.hasNext()) {
CigarElement cigarElement = cigarElementIterator.next();
final CigarElement cigarElement = cigarElementIterator.next();
int shift = 0;
if (cigarElement.getOperator().consumesReferenceBases() || cigarElement.getOperator() == CigarOperator.SOFT_CLIP) {
@ -511,7 +514,7 @@ public class ReadUtils {
if (goalReached) {
// Is this base's reference position within this cigar element? Or did we use it all?
boolean endsWithinCigar = shift < cigarElement.getLength();
final boolean endsWithinCigar = shift < cigarElement.getLength();
// If it isn't, we need to check the next one. There should *ALWAYS* be a next one
// since we checked if the goal coordinate is within the read length, so this is just a sanity check.
@ -523,13 +526,13 @@ public class ReadUtils {
}
}
CigarElement nextCigarElement;
CigarElement nextCigarElement = null;
// if we end inside the current cigar element, we just have to check if it is a deletion
// if we end inside the current cigar element, we just have to check if it is a deletion (or skipped region)
if (endsWithinCigar)
fallsInsideDeletion = cigarElement.getOperator() == CigarOperator.DELETION;
fallsInsideDeletionOrSkippedRegion = (cigarElement.getOperator() == CigarOperator.DELETION || cigarElement.getOperator() == CigarOperator.SKIPPED_REGION) ;
// if we end outside the current cigar element, we need to check if the next element is an insertion or deletion.
// if we end outside the current cigar element, we need to check if the next element is an insertion, deletion or skipped region.
else {
nextCigarElement = cigarElementIterator.next();
@ -547,22 +550,27 @@ public class ReadUtils {
nextCigarElement = cigarElementIterator.next();
}
// if it's a deletion, we will pass the information on to be handled downstream.
fallsInsideDeletion = nextCigarElement.getOperator() == CigarOperator.DELETION;
// if it's a deletion (or skipped region), we will pass the information on to be handled downstream.
endJustBeforeDeletionOrSkippedRegion = (nextCigarElement.getOperator() == CigarOperator.DELETION || nextCigarElement.getOperator() == CigarOperator.SKIPPED_REGION);
}
// If we reached our goal outside a deletion, add the shift
if (!fallsInsideDeletion && cigarElement.getOperator().consumesReadBases())
fallsInsideOrJustBeforeDeletionOrSkippedRegion = endJustBeforeDeletionOrSkippedRegion || fallsInsideDeletionOrSkippedRegion;
// If we reached our goal outside a deletion (or skipped region), add the shift
if (!fallsInsideOrJustBeforeDeletionOrSkippedRegion && cigarElement.getOperator().consumesReadBases())
readBases += shift;
// If we reached our goal inside a deletion, but the deletion is the next cigar element then we need
// If we reached our goal just before a deletion (or skipped region) we need
// to add the shift of the current cigar element but go back to it's last element to return the last
// base before the deletion (see warning in function contracts)
else if (fallsInsideDeletion && !endsWithinCigar && cigarElement.getOperator().consumesReadBases())
// base before the deletion (or skipped region) (see warning in function contracts)
else if (endJustBeforeDeletionOrSkippedRegion && cigarElement.getOperator().consumesReadBases())
readBases += shift - 1;
// If we reached our goal inside a deletion then we must backtrack to the last base before the deletion
else if (fallsInsideDeletion && endsWithinCigar)
// If we reached our goal inside a deletion (or skipped region), or just between a deletion and a skipped region,
// then we must backtrack to the last base before the deletion (or skipped region)
else if (fallsInsideDeletionOrSkippedRegion ||
(endJustBeforeDeletionOrSkippedRegion && nextCigarElement.getOperator().equals(CigarOperator.N)) ||
(endJustBeforeDeletionOrSkippedRegion && nextCigarElement.getOperator().equals(CigarOperator.D)))
readBases--;
}
}
@ -575,7 +583,7 @@ public class ReadUtils {
}
}
return new Pair<Integer, Boolean>(readBases, fallsInsideDeletion);
return new Pair<Integer, Boolean>(readBases, fallsInsideOrJustBeforeDeletionOrSkippedRegion);
}
/**

View File

@ -31,6 +31,7 @@ import net.sf.samtools.CigarOperator;
import net.sf.samtools.TextCigarCodec;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.sam.ArtificialSAMUtils;
import org.broadinstitute.sting.utils.sam.CigarUtils;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import org.testng.Assert;
@ -57,11 +58,15 @@ public class ReadClipperTestUtils {
}
public static GATKSAMRecord makeReadFromCigar(String cigarString) {
return makeReadFromCigar(cigarFromString(cigarString));
return makeReadFromCigar(CigarUtils.cigarFromString(cigarString));
}
public static List<Cigar> generateCigarList(int maximumLength) {
return generateCigarList(maximumLength, cigarElements);
}
/**
* This function generates every valid permutation of cigar strings with a given length.
* This function generates every valid permutation of cigar strings (with a given set of cigarElement) with a given length.
*
* A valid cigar object obeys the following rules:
* - No Hard/Soft clips in the middle of the read
@ -72,7 +77,7 @@ public class ReadClipperTestUtils {
* @param maximumLength the maximum number of elements in the cigar
* @return a list with all valid Cigar objects
*/
public static List<Cigar> generateCigarList(int maximumLength) {
public static List<Cigar> generateCigarList(int maximumLength, CigarElement[] cigarElements) {
int numCigarElements = cigarElements.length;
LinkedList<Cigar> cigarList = new LinkedList<Cigar>();
byte [] cigarCombination = new byte[maximumLength];
@ -80,9 +85,9 @@ public class ReadClipperTestUtils {
Utils.fillArrayWithByte(cigarCombination, (byte) 0); // we start off with all 0's in the combination array.
int currentIndex = 0;
while (true) {
Cigar cigar = createCigarFromCombination(cigarCombination); // create the cigar
cigar = combineAdjacentCigarElements(cigar); // combine adjacent elements
if (isCigarValid(cigar)) { // check if it's valid
Cigar cigar = createCigarFromCombination(cigarCombination, cigarElements); // create the cigar
cigar = CigarUtils.combineAdjacentCigarElements(cigar); // combine adjacent elements
if (CigarUtils.isCigarValid(cigar)) { // check if it's valid
cigarList.add(cigar); // add it
}
@ -107,41 +112,7 @@ public class ReadClipperTestUtils {
return cigarList;
}
private static boolean isCigarValid(Cigar cigar) {
if (cigar.isValid(null, -1) == null) { // This should take care of most invalid Cigar Strings (picard's "exhaustive" implementation)
Stack<CigarElement> cigarElementStack = new Stack<CigarElement>(); // Stack to invert cigar string to find ending operator
CigarOperator startingOp = null;
CigarOperator endingOp = null;
// check if it doesn't start with deletions
boolean readHasStarted = false; // search the list of elements for the starting operator
for (CigarElement cigarElement : cigar.getCigarElements()) {
if (!readHasStarted) {
if (cigarElement.getOperator() != CigarOperator.SOFT_CLIP && cigarElement.getOperator() != CigarOperator.HARD_CLIP) {
readHasStarted = true;
startingOp = cigarElement.getOperator();
}
}
cigarElementStack.push(cigarElement);
}
while (!cigarElementStack.empty()) {
CigarElement cigarElement = cigarElementStack.pop();
if (cigarElement.getOperator() != CigarOperator.SOFT_CLIP && cigarElement.getOperator() != CigarOperator.HARD_CLIP) {
endingOp = cigarElement.getOperator();
break;
}
}
if (startingOp != CigarOperator.DELETION && endingOp != CigarOperator.DELETION)
return true; // we don't accept reads starting or ending in deletions (add any other constraint here)
}
return false;
}
private static Cigar createCigarFromCombination(byte[] cigarCombination) {
private static Cigar createCigarFromCombination(byte[] cigarCombination, CigarElement[] cigarElements) {
Cigar cigar = new Cigar();
for (byte i : cigarCombination) {
cigar.add(cigarElements[i]);
@ -149,38 +120,6 @@ public class ReadClipperTestUtils {
return cigar;
}
/**
* Combines equal adjacent elements of a Cigar object
*
* @param rawCigar the cigar object
* @return a combined cigar object
*/
private static Cigar combineAdjacentCigarElements(Cigar rawCigar) {
Cigar combinedCigar = new Cigar();
CigarElement lastElement = null;
int lastElementLength = 0;
for (CigarElement cigarElement : rawCigar.getCigarElements()) {
if (lastElement != null &&
((lastElement.getOperator() == cigarElement.getOperator()) ||
(lastElement.getOperator() == CigarOperator.I && cigarElement.getOperator() == CigarOperator.D) ||
(lastElement.getOperator() == CigarOperator.D && cigarElement.getOperator() == CigarOperator.I)))
lastElementLength += cigarElement.getLength();
else
{
if (lastElement != null)
combinedCigar.add(new CigarElement(lastElementLength, lastElement.getOperator()));
lastElement = cigarElement;
lastElementLength = cigarElement.getLength();
}
}
if (lastElement != null)
combinedCigar.add(new CigarElement(lastElementLength, lastElement.getOperator()));
return combinedCigar;
}
public static GATKSAMRecord makeRead() {
return ArtificialSAMUtils.createArtificialRead(BASES, QUALS, CIGAR);
}
@ -201,34 +140,5 @@ public class ReadClipperTestUtils {
// Otherwise test if they're both empty
else
Assert.assertEquals(actual.isEmpty(), expected.isEmpty());
}
public static Cigar invertCigar (Cigar cigar) {
Stack<CigarElement> cigarStack = new Stack<CigarElement>();
for (CigarElement cigarElement : cigar.getCigarElements())
cigarStack.push(cigarElement);
Cigar invertedCigar = new Cigar();
while (!cigarStack.isEmpty())
invertedCigar.add(cigarStack.pop());
return invertedCigar;
}
/**
* Checks whether or not the read has any cigar element that is not H or S
*
* @param read the read
* @return true if it has any M, I or D, false otherwise
*/
public static boolean readHasNonClippedBases(GATKSAMRecord read) {
for (CigarElement cigarElement : read.getCigar().getCigarElements())
if (cigarElement.getOperator() != CigarOperator.SOFT_CLIP && cigarElement.getOperator() != CigarOperator.HARD_CLIP)
return true;
return false;
}
public static Cigar cigarFromString(String cigarString) {
return TextCigarCodec.getSingleton().decode(cigarString);
}
}
}

View File

@ -30,6 +30,7 @@ import net.sf.samtools.CigarElement;
import net.sf.samtools.CigarOperator;
import org.broadinstitute.sting.BaseTest;
import org.broadinstitute.sting.utils.Utils;
import org.broadinstitute.sting.utils.sam.CigarUtils;
import org.broadinstitute.sting.utils.sam.GATKSAMRecord;
import org.testng.Assert;
import org.testng.annotations.BeforeClass;
@ -241,7 +242,7 @@ public class ReadClipperUnitTest extends BaseTest {
int expectedLength = read.getReadLength() - leadingCigarElementLength(read.getCigar(), CigarOperator.INSERTION);
if (cigarHasElementsDifferentThanInsertionsAndHardClips(read.getCigar()))
expectedLength -= leadingCigarElementLength(ReadClipperTestUtils.invertCigar(read.getCigar()), CigarOperator.INSERTION);
expectedLength -= leadingCigarElementLength(CigarUtils.invertCigar(read.getCigar()), CigarOperator.INSERTION);
if (!clippedRead.isEmpty()) {
Assert.assertEquals(expectedLength, clippedRead.getReadLength(), String.format("%s -> %s", read.getCigarString(), clippedRead.getCigarString())); // check that everything else is still there
@ -256,7 +257,7 @@ public class ReadClipperUnitTest extends BaseTest {
public void testRevertSoftClippedBases() {
for (Cigar cigar : cigarList) {
final int leadingSoftClips = leadingCigarElementLength(cigar, CigarOperator.SOFT_CLIP);
final int tailSoftClips = leadingCigarElementLength(ReadClipperTestUtils.invertCigar(cigar), CigarOperator.SOFT_CLIP);
final int tailSoftClips = leadingCigarElementLength(CigarUtils.invertCigar(cigar), CigarOperator.SOFT_CLIP);
final GATKSAMRecord read = ReadClipperTestUtils.makeReadFromCigar(cigar);
final GATKSAMRecord unclipped = ReadClipper.revertSoftClippedBases(read);
@ -278,7 +279,7 @@ public class ReadClipperUnitTest extends BaseTest {
public void testRevertSoftClippedBasesWithThreshold() {
for (Cigar cigar : cigarList) {
final int leadingSoftClips = leadingCigarElementLength(cigar, CigarOperator.SOFT_CLIP);
final int tailSoftClips = leadingCigarElementLength(ReadClipperTestUtils.invertCigar(cigar), CigarOperator.SOFT_CLIP);
final int tailSoftClips = leadingCigarElementLength(CigarUtils.invertCigar(cigar), CigarOperator.SOFT_CLIP);
final GATKSAMRecord read = ReadClipperTestUtils.makeReadFromCigar(cigar);
final GATKSAMRecord unclipped = ReadClipper.revertSoftClippedBases(read);
@ -348,7 +349,7 @@ public class ReadClipperUnitTest extends BaseTest {
* @param clipped clipped read
*/
private void assertUnclippedLimits(GATKSAMRecord original, GATKSAMRecord clipped) {
if (ReadClipperTestUtils.readHasNonClippedBases(clipped)) {
if (CigarUtils.readHasNonClippedBases(clipped)) {
Assert.assertEquals(original.getUnclippedStart(), clipped.getUnclippedStart());
Assert.assertEquals(original.getUnclippedEnd(), clipped.getUnclippedEnd());
}

View File

@ -220,7 +220,7 @@ public class ReadUtilsUnitTest extends BaseTest {
}
@Test (enabled = true)
public void testReadWithNs() throws FileNotFoundException {
public void testReadWithNsRefIndexInDeletion() throws FileNotFoundException {
final IndexedFastaSequenceFile seq = new CachingIndexedFastaSequenceFile(new File(b37KGReference));
final SAMFileHeader header = ArtificialSAMUtils.createArtificialSamHeader(seq.getSequenceDictionary());
@ -232,6 +232,22 @@ public class ReadUtilsUnitTest extends BaseTest {
read.setCigarString("3M414N1D73M");
final int result = ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, 9392, ReadUtils.ClippingTail.LEFT_TAIL);
Assert.assertEquals(result, 2);
}
@Test (enabled = true)
public void testReadWithNsRefAfterDeletion() throws FileNotFoundException {
final IndexedFastaSequenceFile seq = new CachingIndexedFastaSequenceFile(new File(b37KGReference));
final SAMFileHeader header = ArtificialSAMUtils.createArtificialSamHeader(seq.getSequenceDictionary());
final int readLength = 76;
final GATKSAMRecord read = ArtificialSAMUtils.createArtificialRead(header, "myRead", 0, 8975, readLength);
read.setReadBases(Utils.dupBytes((byte) 'A', readLength));
read.setBaseQualities(Utils.dupBytes((byte)30, readLength));
read.setCigarString("3M414N1D73M");
final int result = ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, 9393, ReadUtils.ClippingTail.LEFT_TAIL);
Assert.assertEquals(result, 3);
}