2010-04-20 07:00:08 +08:00
/ *
* Copyright ( c ) 2010 The Broad Institute
2010-04-20 23:26:32 +08:00
*
2010-04-20 07:00:08 +08:00
* Permission is hereby granted , free of charge , to any person
* obtaining a copy of this software and associated documentation
2010-04-20 23:26:32 +08:00
* files ( the "Software" ) , to deal in the Software without
2010-04-20 07:00:08 +08:00
* restriction , including without limitation the rights to use ,
* copy , modify , merge , publish , distribute , sublicense , and / or sell
* copies of the Software , and to permit persons to whom the
* Software is furnished to do so , subject to the following
* conditions :
2010-04-20 23:26:32 +08:00
*
2010-04-20 07:00:08 +08:00
* The above copyright notice and this permission notice shall be
* included in all copies or substantial portions of the Software .
*
2010-04-20 23:26:32 +08:00
* THE SOFTWARE IS PROVIDED "AS IS" , WITHOUT WARRANTY OF ANY KIND ,
2010-04-20 07:00:08 +08:00
* EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES
* OF MERCHANTABILITY , FITNESS FOR A PARTICULAR PURPOSE AND
* NONINFRINGEMENT . IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
* HOLDERS BE LIABLE FOR ANY CLAIM , DAMAGES OR OTHER LIABILITY ,
* WHETHER IN AN ACTION OF CONTRACT , TORT OR OTHERWISE , ARISING
* FROM , OUT OF OR IN CONNECTION WITH THE SOFTWARE OR
* THE USE OR OTHER DEALINGS IN THE SOFTWARE .
* /
2009-06-22 22:39:41 +08:00
package org.broadinstitute.sting.utils ;
2011-07-18 08:29:58 +08:00
import com.google.java.contract.Ensures ;
import com.google.java.contract.Invariant ;
import com.google.java.contract.Requires ;
import com.google.java.contract.ThrowEnsures ;
2010-06-10 03:25:02 +08:00
import net.sf.picard.reference.ReferenceSequenceFile ;
2009-06-22 22:39:41 +08:00
import net.sf.samtools.SAMRecord ;
import net.sf.samtools.SAMSequenceDictionary ;
import net.sf.samtools.SAMSequenceRecord ;
import org.apache.log4j.Logger ;
2011-08-04 04:04:51 +08:00
import org.broad.tribble.Feature ;
2011-11-10 23:58:40 +08:00
import org.broadinstitute.sting.utils.codecs.vcf.VCFConstants ;
2010-09-12 23:07:38 +08:00
import org.broadinstitute.sting.utils.exceptions.ReviewedStingException ;
2010-09-12 22:02:43 +08:00
import org.broadinstitute.sting.utils.exceptions.UserException ;
2011-11-10 23:58:40 +08:00
import org.broadinstitute.sting.utils.variantcontext.VariantContext ;
2009-06-22 22:39:41 +08:00
/ * *
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
* Factory class for creating GenomeLocs
2009-06-22 22:39:41 +08:00
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Invariant ( {
"logger != null" ,
"contigInfo != null" } )
2009-06-22 22:39:41 +08:00
public class GenomeLocParser {
private static Logger logger = Logger . getLogger ( GenomeLocParser . class ) ;
// --------------------------------------------------------------------------------------------------------------
//
// Ugly global variable defining the optional ordering of contig elements
//
// --------------------------------------------------------------------------------------------------------------
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
private final MasterSequenceDictionary contigInfo ;
2011-04-21 09:31:26 +08:00
/ * *
* A wrapper class that provides efficient last used caching for the global
* SAMSequenceDictionary underlying all of the GATK engine capabilities
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
// todo -- enable when CoFoJa developers identify the problem (likely thread unsafe invariants)
// @Invariant({
// "dict != null",
// "dict.size() > 0",
// "lastSSR == null || dict.getSequence(lastContig).getSequenceIndex() == lastIndex",
// "lastSSR == null || dict.getSequence(lastContig).getSequenceName() == lastContig",
// "lastSSR == null || dict.getSequence(lastContig) == lastSSR"})
2011-04-21 09:31:26 +08:00
private final class MasterSequenceDictionary {
final private SAMSequenceDictionary dict ;
// cache
SAMSequenceRecord lastSSR = null ;
String lastContig = "" ;
int lastIndex = - 1 ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( { "dict != null" , "dict.size() > 0" } )
2011-04-21 09:31:26 +08:00
public MasterSequenceDictionary ( SAMSequenceDictionary dict ) {
this . dict = dict ;
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Ensures ( "result > 0" )
2011-04-21 09:31:26 +08:00
public final int getNSequences ( ) {
return dict . size ( ) ;
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "contig != null" )
public synchronized boolean hasContig ( final String contig ) {
return lastContig = = contig | | dict . getSequence ( contig ) ! = null ;
}
@Requires ( "index >= 0" )
public synchronized boolean hasContig ( final int index ) {
return lastIndex = = index | | dict . getSequence ( index ) ! = null ;
}
@Requires ( "contig != null" )
@Ensures ( "result != null" )
2011-04-21 09:31:26 +08:00
public synchronized final SAMSequenceRecord getSequence ( final String contig ) {
if ( isCached ( contig ) )
return lastSSR ;
else
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
return updateCache ( contig , - 1 ) ;
2011-04-21 09:31:26 +08:00
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "index >= 0" )
@Ensures ( "result != null" )
2011-04-21 09:31:26 +08:00
public synchronized final SAMSequenceRecord getSequence ( final int index ) {
if ( isCached ( index ) )
return lastSSR ;
else
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
return updateCache ( null , index ) ;
2011-04-21 09:31:26 +08:00
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "contig != null" )
@Ensures ( "result >= 0" )
2011-04-21 09:31:26 +08:00
public synchronized final int getSequenceIndex ( final String contig ) {
if ( ! isCached ( contig ) ) {
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
updateCache ( contig , - 1 ) ;
2011-04-21 09:31:26 +08:00
}
return lastIndex ;
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( { "contig != null" , "lastContig != null" } )
2011-04-21 09:31:26 +08:00
private synchronized boolean isCached ( final String contig ) {
return lastContig . equals ( contig ) ;
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( { "lastIndex != -1" , "index >= 0" } )
2011-04-21 09:31:26 +08:00
private synchronized boolean isCached ( final int index ) {
return lastIndex = = index ;
}
/ * *
* The key algorithm . Given a new record , update the last used record , contig
* name , and index .
*
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
* @param contig
* @param index
2011-04-21 09:31:26 +08:00
* @return
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "contig != null || index >= 0" )
@Ensures ( "result != null" )
private synchronized SAMSequenceRecord updateCache ( final String contig , int index ) {
SAMSequenceRecord rec = contig = = null ? dict . getSequence ( index ) : dict . getSequence ( contig ) ;
2011-04-21 09:31:26 +08:00
if ( rec = = null ) {
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
throw new ReviewedStingException ( "BUG: requested unknown contig=" + contig + " index=" + index ) ;
2011-04-21 09:31:26 +08:00
} else {
lastSSR = rec ;
lastContig = rec . getSequenceName ( ) ;
lastIndex = rec . getSequenceIndex ( ) ;
return rec ;
}
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
}
2009-06-22 22:39:41 +08:00
/ * *
2010-11-11 01:59:50 +08:00
* set our internal reference contig order
* @param refFile the reference file
2009-06-22 22:39:41 +08:00
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "refFile != null" )
2010-11-11 01:59:50 +08:00
public GenomeLocParser ( final ReferenceSequenceFile refFile ) {
this ( refFile . getSequenceDictionary ( ) ) ;
}
public GenomeLocParser ( SAMSequenceDictionary seqDict ) {
if ( seqDict = = null ) { // we couldn't load the reference dictionary
//logger.info("Failed to load reference dictionary, falling back to lexicographic order for contigs");
throw new UserException . CommandLineException ( "Failed to load reference dictionary" ) ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
}
contigInfo = new MasterSequenceDictionary ( seqDict ) ;
logger . debug ( String . format ( "Prepared reference sequence contig dictionary" ) ) ;
for ( SAMSequenceRecord contig : seqDict . getSequences ( ) ) {
logger . debug ( String . format ( " %s (%d bp)" , contig . getSequenceName ( ) , contig . getSequenceLength ( ) ) ) ;
2010-11-11 01:59:50 +08:00
}
2009-06-22 22:39:41 +08:00
}
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
/ * *
* Determines whether the given contig is valid with respect to the sequence dictionary
* already installed in the GenomeLoc .
*
* @return True if the contig is valid . False otherwise .
* /
public boolean contigIsInDictionary ( String contig ) {
2011-05-21 10:01:59 +08:00
return contig ! = null & & contigInfo . hasContig ( contig ) ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
}
public boolean indexIsInDictionary ( final int index ) {
2011-05-21 10:01:59 +08:00
return index > = 0 & & contigInfo . hasContig ( index ) ;
}
/ * *
* get the contig ' s SAMSequenceRecord
*
* @param contig the string name of the contig
*
* @return the sam sequence record
* /
@Ensures ( "result != null" )
@ThrowEnsures ( { "UserException.MalformedGenomeLoc" , "!contigIsInDictionary(contig) || contig == null" } )
public SAMSequenceRecord getContigInfo ( final String contig ) {
if ( contig = = null | | ! contigIsInDictionary ( contig ) )
throw new UserException . MalformedGenomeLoc ( String . format ( "Contig %s given as location, but this contig isn't present in the Fasta sequence dictionary" , contig ) ) ;
return contigInfo . getSequence ( contig ) ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
}
2009-06-22 22:39:41 +08:00
/ * *
* Returns the contig index of a specified string version of the contig
*
* @param contig the contig string
*
* @return the contig index , - 1 if not found
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Ensures ( "result >= 0" )
2011-05-21 10:01:59 +08:00
@ThrowEnsures ( { "UserException.MalformedGenomeLoc" , "!contigIsInDictionary(contig) || contig == null" } )
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
public int getContigIndex ( final String contig ) {
2011-05-21 10:01:59 +08:00
return getContigInfo ( contig ) . getSequenceIndex ( ) ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
}
@Requires ( "contig != null" )
protected int getContigIndexWithoutException ( final String contig ) {
2011-05-21 10:01:59 +08:00
if ( contig = = null | | ! contigInfo . hasContig ( contig ) )
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
return - 1 ;
return contigInfo . getSequenceIndex ( contig ) ;
2009-06-22 22:39:41 +08:00
}
2011-05-21 10:01:59 +08:00
// --------------------------------------------------------------------------------------------------------------
//
// Low-level creation functions
//
// --------------------------------------------------------------------------------------------------------------
2010-06-11 04:54:36 +08:00
/ * *
2011-05-21 10:01:59 +08:00
* create a genome loc , given the contig name , start , and stop
2010-04-01 20:47:48 +08:00
*
2011-05-21 10:01:59 +08:00
* @param contig the contig name
* @param start the starting position
* @param stop the stop position
2010-04-01 20:47:48 +08:00
*
2011-05-21 10:01:59 +08:00
* @return a new genome loc
* /
@Ensures ( "result != null" )
@ThrowEnsures ( { "UserException.MalformedGenomeLoc" , "!isValidGenomeLoc(contig, start, stop)" } )
public GenomeLoc createGenomeLoc ( String contig , final int start , final int stop ) {
return createGenomeLoc ( contig , getContigIndex ( contig ) , start , stop ) ;
}
public GenomeLoc createGenomeLoc ( String contig , final int start , final int stop , boolean mustBeOnReference ) {
return createGenomeLoc ( contig , getContigIndex ( contig ) , start , stop , mustBeOnReference ) ;
}
@ThrowEnsures ( { "UserException.MalformedGenomeLoc" , "!isValidGenomeLoc(contig, start, stop, false)" } )
public GenomeLoc createGenomeLoc ( String contig , int index , final int start , final int stop ) {
return createGenomeLoc ( contig , index , start , stop , false ) ;
}
@ThrowEnsures ( { "UserException.MalformedGenomeLoc" , "!isValidGenomeLoc(contig, start, stop,mustBeOnReference)" } )
public GenomeLoc createGenomeLoc ( String contig , int index , final int start , final int stop , boolean mustBeOnReference ) {
validateGenomeLoc ( contig , index , start , stop , mustBeOnReference , true ) ;
return new GenomeLoc ( contig , index , start , stop ) ;
}
/ * *
* validate a position or interval on the genome as valid
2010-04-01 20:47:48 +08:00
*
2011-05-21 10:01:59 +08:00
* Requires that contig exist in the master sequence dictionary , and that contig index be valid as well . Requires
* that start < = stop .
*
* if mustBeOnReference is true ,
* performs boundary validation for genome loc INTERVALS :
* start and stop are on contig and start < = stop
2010-04-01 20:47:48 +08:00
*
2011-05-21 10:01:59 +08:00
* @param contig the contig name
* @param start the start position
* @param stop the stop position
*
* @return true if it ' s valid , false otherwise . If exceptOnError , then throws a UserException if invalid
2009-09-22 09:32:35 +08:00
* /
2011-05-21 10:01:59 +08:00
private boolean validateGenomeLoc ( String contig , int contigIndex , int start , int stop , boolean mustBeOnReference , boolean exceptOnError ) {
if ( ! contigInfo . hasContig ( contig ) )
return vglHelper ( exceptOnError , String . format ( "Unknown contig %s" , contig ) ) ;
if ( stop < start )
return vglHelper ( exceptOnError , String . format ( "The stop position %d is less than start %d" , stop , start ) ) ;
if ( contigIndex < 0 )
return vglHelper ( exceptOnError , String . format ( "The contig index %d is less than 0" , contigIndex ) ) ;
if ( contigIndex > = contigInfo . getNSequences ( ) )
return vglHelper ( exceptOnError , String . format ( "The contig index %d is greater than the stored sequence count (%d)" , contigIndex , contigInfo . getNSequences ( ) ) ) ;
if ( mustBeOnReference ) {
if ( start < 0 )
return vglHelper ( exceptOnError , String . format ( "The start position %d is less than 0" , start ) ) ;
if ( stop < 0 )
return vglHelper ( exceptOnError , String . format ( "The stop position %d is less than 0" , stop ) ) ;
int contigSize = contigInfo . getSequence ( contigIndex ) . getSequenceLength ( ) ;
if ( start > contigSize | | stop > contigSize )
return vglHelper ( exceptOnError , String . format ( "The genome loc coordinates %d-%d exceed the contig size (%d)" , start , stop , contigSize ) ) ;
}
// we passed
return true ;
}
public boolean isValidGenomeLoc ( String contig , int start , int stop , boolean mustBeOnReference ) {
return validateGenomeLoc ( contig , getContigIndexWithoutException ( contig ) , start , stop , mustBeOnReference , false ) ;
}
public boolean isValidGenomeLoc ( String contig , int start , int stop ) {
return validateGenomeLoc ( contig , getContigIndexWithoutException ( contig ) , start , stop , true , false ) ;
}
private boolean vglHelper ( boolean exceptOnError , String msg ) {
if ( exceptOnError )
throw new UserException . MalformedGenomeLoc ( "Parameters to GenomeLocParser are incorrect:" + msg ) ;
else
return false ;
}
// --------------------------------------------------------------------------------------------------------------
//
// Parsing genome locs
//
// --------------------------------------------------------------------------------------------------------------
2009-09-22 09:32:35 +08:00
2009-06-22 22:39:41 +08:00
/ * *
2011-05-21 10:01:59 +08:00
* parse a genome interval , from a location string
2010-06-11 04:54:36 +08:00
*
2011-05-21 10:01:59 +08:00
* Performs interval - style validation :
2009-06-22 22:39:41 +08:00
*
2011-05-21 10:01:59 +08:00
* contig is valid ; start and stop less than the end ; start < = stop , and start / stop are on the contig
2009-06-22 22:39:41 +08:00
* @param str the string to parse
*
* @return a GenomeLoc representing the String
2010-04-01 20:47:48 +08:00
*
2009-06-22 22:39:41 +08:00
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "str != null" )
@Ensures ( "result != null" )
2010-11-11 01:59:50 +08:00
public GenomeLoc parseGenomeLoc ( final String str ) {
2009-06-22 22:39:41 +08:00
// 'chr2', 'chr2:1000000' or 'chr2:1,000,000-2,000,000'
//System.out.printf("Parsing location '%s'%n", str);
2010-06-11 04:54:36 +08:00
2009-06-22 22:39:41 +08:00
String contig = null ;
2010-11-11 01:59:50 +08:00
int start = 1 ;
int stop = - 1 ;
2010-06-11 04:54:36 +08:00
final int colonIndex = str . indexOf ( ":" ) ;
if ( colonIndex = = - 1 ) {
contig = str . substring ( 0 , str . length ( ) ) ; // chr1
stop = Integer . MAX_VALUE ;
} else {
contig = str . substring ( 0 , colonIndex ) ;
final int dashIndex = str . indexOf ( '-' , colonIndex ) ;
try {
if ( dashIndex = = - 1 ) {
if ( str . charAt ( str . length ( ) - 1 ) = = '+' ) {
start = parsePosition ( str . substring ( colonIndex + 1 , str . length ( ) - 1 ) ) ; // chr:1+
stop = Integer . MAX_VALUE ;
} else {
start = parsePosition ( str . substring ( colonIndex + 1 ) ) ; // chr1:1
stop = start ;
}
} else {
start = parsePosition ( str . substring ( colonIndex + 1 , dashIndex ) ) ; // chr1:1-1
stop = parsePosition ( str . substring ( dashIndex + 1 ) ) ;
2010-03-11 00:25:16 +08:00
}
2010-06-11 04:54:36 +08:00
} catch ( Exception e ) {
2010-09-12 22:02:43 +08:00
throw new UserException ( "Failed to parse Genome Location string: " + str , e ) ;
2010-03-11 00:25:16 +08:00
}
2009-06-22 22:39:41 +08:00
}
2010-04-01 20:47:48 +08:00
// is the contig valid?
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
if ( ! contigIsInDictionary ( contig ) )
2011-05-21 10:01:59 +08:00
throw new UserException . MalformedGenomeLoc ( "Contig '" + contig + "' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?" ) ;
2009-09-22 06:37:47 +08:00
2010-11-11 01:59:50 +08:00
if ( stop = = Integer . MAX_VALUE )
2009-06-22 22:39:41 +08:00
// lookup the actually stop position!
stop = getContigInfo ( contig ) . getSequenceLength ( ) ;
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( contig , getContigIndex ( contig ) , start , stop , true ) ;
2009-06-22 22:39:41 +08:00
}
2010-06-11 04:54:36 +08:00
/ * *
* Parses a number like 1 , 000 , 000 into a long .
* @param pos
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "pos != null" )
@Ensures ( "result >= 0" )
2010-11-11 01:59:50 +08:00
private int parsePosition ( final String pos ) {
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
if ( pos . indexOf ( '-' ) ! = - 1 ) {
throw new NumberFormatException ( "Position: '" + pos + "' can't contain '-'." ) ;
}
2010-06-11 04:54:36 +08:00
if ( pos . indexOf ( ',' ) ! = - 1 ) {
final StringBuilder buffer = new StringBuilder ( ) ;
for ( int i = 0 ; i < pos . length ( ) ; i + + ) {
final char c = pos . charAt ( i ) ;
if ( c = = ',' ) {
continue ;
} else if ( c < '0' | | c > '9' ) {
throw new NumberFormatException ( "Position: '" + pos + "' contains invalid chars." ) ;
2010-11-11 01:59:50 +08:00
} else {
2010-06-11 04:54:36 +08:00
buffer . append ( c ) ;
}
}
2010-11-11 01:59:50 +08:00
return Integer . parseInt ( buffer . toString ( ) ) ;
2010-06-11 04:54:36 +08:00
} else {
2010-11-11 01:59:50 +08:00
return Integer . parseInt ( pos ) ;
2010-06-11 04:54:36 +08:00
}
2009-06-22 22:39:41 +08:00
}
2011-05-21 10:01:59 +08:00
// --------------------------------------------------------------------------------------------------------------
//
// Parsing string representations
//
// --------------------------------------------------------------------------------------------------------------
2009-06-22 22:39:41 +08:00
/ * *
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
* create a genome loc , given a read . If the read is unmapped , * and * yet the read has a contig and start position ,
* then a GenomeLoc is returned for contig : start - start , otherwise and UNMAPPED GenomeLoc is returned .
2009-06-22 22:39:41 +08:00
*
* @param read
*
* @return
* /
2011-05-21 10:01:59 +08:00
@Requires ( "read != null" )
@Ensures ( "result != null" )
2010-11-11 01:59:50 +08:00
public GenomeLoc createGenomeLoc ( final SAMRecord read ) {
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
if ( read . getReadUnmappedFlag ( ) & & read . getReferenceIndex ( ) = = - 1 )
// read is unmapped and not placed anywhere on the genome
2011-05-21 10:01:59 +08:00
return GenomeLoc . UNMAPPED ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
else {
2011-06-04 02:06:41 +08:00
// Use Math.max to ensure that end >= start (Picard assigns the end to reads that are entirely within an insertion as start-1)
2011-06-03 04:40:56 +08:00
int end = read . getReadUnmappedFlag ( ) ? read . getAlignmentStart ( ) : Math . max ( read . getAlignmentEnd ( ) , read . getAlignmentStart ( ) ) ;
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( read . getReferenceName ( ) , read . getReferenceIndex ( ) , read . getAlignmentStart ( ) , end , false ) ;
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
}
2009-06-22 22:39:41 +08:00
}
2011-08-04 04:04:51 +08:00
/ * *
* Creates a GenomeLoc from a Tribble feature
* @param feature
* @return
* /
public GenomeLoc createGenomeLoc ( final Feature feature ) {
return createGenomeLoc ( feature . getChr ( ) , feature . getStart ( ) , feature . getEnd ( ) ) ;
}
2011-11-10 23:58:40 +08:00
/ * *
* Creates a GenomeLoc corresponding to the variant context vc . If includeSymbolicEndIfPossible
* is true , and VC is a symbolic allele the end of the created genome loc will be the value
* of the END info field key , if it exists , or vc . getEnd ( ) if not .
*
* @param vc
* @param includeSymbolicEndIfPossible
* @return
* /
public GenomeLoc createGenomeLoc ( final VariantContext vc , boolean includeSymbolicEndIfPossible ) {
if ( includeSymbolicEndIfPossible & & vc . isSymbolic ( ) ) {
int end = vc . getAttributeAsInt ( VCFConstants . END_KEY , vc . getEnd ( ) ) ;
return createGenomeLoc ( vc . getChr ( ) , vc . getStart ( ) , end ) ;
}
else
return createGenomeLoc ( vc . getChr ( ) , vc . getStart ( ) , vc . getEnd ( ) ) ;
}
public GenomeLoc createGenomeLoc ( final VariantContext vc ) {
return createGenomeLoc ( vc , false ) ;
}
2009-06-22 22:39:41 +08:00
/ * *
2011-05-21 10:01:59 +08:00
* create a new genome loc , given the contig name , and a single position . Must be on the reference
2009-06-22 22:39:41 +08:00
*
* @param contig the contig name
* @param pos the postion
*
* @return a genome loc representing a single base at the specified postion on the contig
* /
2011-05-21 10:01:59 +08:00
@Ensures ( "result != null" )
@ThrowEnsures ( { "UserException.MalformedGenomeLoc" , "!isValidGenomeLoc(contig, pos, pos, true)" } )
2010-11-11 01:59:50 +08:00
public GenomeLoc createGenomeLoc ( final String contig , final int pos ) {
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( contig , getContigIndex ( contig ) , pos , pos ) ;
2009-06-22 22:39:41 +08:00
}
2009-07-01 03:17:24 +08:00
/ * *
* create a new genome loc from an existing loc , with a new start position
2010-03-18 03:39:30 +08:00
* Note that this function will NOT explicitly check the ending offset , in case someone wants to
* set the start of a new GenomeLoc pertaining to a read that goes off the end of the contig .
2009-07-01 03:17:24 +08:00
*
* @param loc the old location
* @param start a new start position
*
* @return the newly created genome loc
* /
2010-11-11 01:59:50 +08:00
public GenomeLoc setStart ( GenomeLoc loc , int start ) {
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( loc . getContig ( ) , loc . getContigIndex ( ) , start , loc . getStop ( ) ) ;
2009-07-01 03:17:24 +08:00
}
/ * *
* create a new genome loc from an existing loc , with a new stop position
2010-03-18 03:39:30 +08:00
* Note that this function will NOT explicitly check the ending offset , in case someone wants to
* set the stop of a new GenomeLoc pertaining to a read that goes off the end of the contig .
2009-07-01 03:17:24 +08:00
*
* @param loc the old location
* @param stop a new stop position
*
* @return
* /
2010-11-11 01:59:50 +08:00
public GenomeLoc setStop ( GenomeLoc loc , int stop ) {
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( loc . getContig ( ) , loc . getContigIndex ( ) , loc . start , stop ) ;
2009-07-01 03:17:24 +08:00
}
/ * *
* return a new genome loc , with an incremented position
2009-08-21 22:40:57 +08:00
*
2009-07-01 03:17:24 +08:00
* @param loc the old location
2009-08-21 22:40:57 +08:00
*
2009-07-01 03:17:24 +08:00
* @return a new genome loc
2009-06-22 22:39:41 +08:00
* /
2010-11-11 01:59:50 +08:00
public GenomeLoc incPos ( GenomeLoc loc ) {
2009-07-01 03:17:24 +08:00
return incPos ( loc , 1 ) ;
}
/ * *
* return a new genome loc , with an incremented position
2009-08-21 22:40:57 +08:00
*
2009-07-01 03:17:24 +08:00
* @param loc the old location
2009-08-21 22:40:57 +08:00
* @param by how much to move the start and stop by
*
2009-07-01 03:17:24 +08:00
* @return a new genome loc
* /
2010-11-11 01:59:50 +08:00
public GenomeLoc incPos ( GenomeLoc loc , int by ) {
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( loc . getContig ( ) , loc . getContigIndex ( ) , loc . start + by , loc . stop + by ) ;
2009-07-01 03:17:24 +08:00
}
/ * *
2010-11-11 01:59:50 +08:00
* Creates a GenomeLoc than spans the entire contig .
* @param contigName Name of the contig .
* @return A locus spanning the entire contig .
2009-07-01 03:17:24 +08:00
* /
Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this.
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-20 23:43:27 +08:00
@Requires ( "contigName != null" )
@Ensures ( "result != null" )
2010-11-11 01:59:50 +08:00
public GenomeLoc createOverEntireContig ( String contigName ) {
SAMSequenceRecord contig = contigInfo . getSequence ( contigName ) ;
2011-05-21 10:01:59 +08:00
return createGenomeLoc ( contigName , contig . getSequenceIndex ( ) , 1 , contig . getSequenceLength ( ) , true ) ;
}
2009-06-22 22:39:41 +08:00
}