Commit Graph

80 Commits (2cd33b2f1f20abc481e7ab77d9a6650933a6e249)

Author SHA1 Message Date
Eric Banks 2799a1b686 Catch exception for bad type and throw as a TribbleException 2012-01-25 12:15:51 -05:00
Mark DePristo 763c81d520 No longer enforce MAX_ALLELE_SIZE in VCF codec
-- Instead issue a warning when a large (>1MB) record is encountered
-- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()
2012-01-18 07:35:11 -05:00
Mark DePristo 0c7865fdb5 UnitTest for reverseAlleleClipping
-- No code modified yet, just implementing a unit test to ensure correctness of the existing code
2012-01-18 07:35:11 -05:00
Mark DePristo 62801e430a Bugfix for unnecessary optimization
-- don't cache the ref bytes
2012-01-17 16:40:26 -05:00
Mark DePristo f2b0575dee Detect unreasonably large allele strings (>2^16) and throw an error
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele.  Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Mark DePristo b06074d6e7 Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe
-- Uses PriorityBlockingQueue instead of PriorityQueue
-- synchronized keywords added to all key functions that modify internal state

Note that this hasn't been tested extensivesly.  Based on report:

http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic
2012-01-13 09:33:16 -05:00
Christopher Hartl 69661da37d Moving ValidationSiteSelector to validation package in public under my ownership. JunctionGenotyper added and modified several times, this commit is due to merging conflix fixes. 2011-12-19 10:57:28 -05:00
Eric Banks 9497e9492c Bug fix for complex records: do not ever reverse clip out a complete allele. 2011-12-14 11:21:28 -05:00
Eric Banks 09a5a9eac0 Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number. 2011-12-14 10:43:52 -05:00
Eric Banks d3f4a5a901 Fail gracefully when encountering malformed VCFs without enough data columns 2011-12-14 10:37:38 -05:00
Eric Banks d7d8b8e380 Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now. 2011-11-28 14:18:28 -05:00
Mark DePristo 9ea7b70a02 Added decode method to LazyGenotypesContext
-- AbstractVCFCodec calls this if the samples are not sorted.  Previously called getGenotypes() which didn't actually trigger the decode
2011-11-21 16:21:23 -05:00
Mark DePristo 2e9ecf639e Generalized interface to LazyGenotypesContext
-- Now you provide a LazyParsing object
-- LazyGenotypesContext now knows nothing about the VCF parser itself.  The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object
-- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version
-- Deleted VCFParser interface, as it was no longer necessary
2011-11-21 09:30:40 -05:00
Mark DePristo 9cb3fe3a59 Vastly better way of doing on-demand genotyping loading
-- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading.
-- This new class was replaced all of the old, complex functionality
-- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency.  This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily
-- Misc. bug fixes throughout the system
-- Bug fixes for PhaseByTransmission with new GenotypesContext
2011-11-20 08:23:09 -05:00
Mark DePristo 73119c8e3c Merge with master
-- A few bug fixes
2011-11-19 09:56:06 -05:00
Mark DePristo 6cf315e17b Change interface to getNegLog10PError to getLog10PError 2011-11-18 21:07:30 -05:00
Mark DePristo 660d6009a2 Documentation and contracts for GenotypesContext and VariantContextBuilder 2011-11-18 13:59:30 -05:00
Mark DePristo f54afc19b4 VariantContextBuilder
-- New approach to making VariantContexts modeled on StringBuilder
-- No more modify routines -- use VariantContextBuilder
-- Renamed isPolymorphic to isPolymorphicInSamples.   Same for mono
-- getChromosomeCount -> getCalledChrCount
-- Walkers changed to use new VariantContext.  Some deprecated new VariantContext calls remain
-- VCFCodec now uses optimized cached information to create GenotypesContext.
2011-11-18 12:39:10 -05:00
Mark DePristo 7490dbb6eb First version of VariantContextBuilder 2011-11-18 11:06:15 -05:00
Mark DePristo aa0610ea92 GenotypeCollection renamed to GenotypesContext 2011-11-16 16:24:05 -05:00
Mark DePristo 460a51f473 ID field now stored in the VariantContext itself, not the attributes 2011-11-15 14:56:33 -05:00
Mark DePristo f0234ab67f GenotypeMap -> GenotypeCollection part 2
-- Code actually builds
2011-11-14 17:42:55 -05:00
Mark DePristo 1fbdcb4f43 GenotypeMap -> GenotypeCollection 2011-11-14 15:32:03 -05:00
Mark DePristo fee9b367e4 VariantContext genotypes are now stored as GenotypeMap objects
-- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples
-- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type.
-- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous.  Now everything uses GenotypeMap with a specific ordering of samples (by name)
-- Integrationtests updated and all pass
2011-11-11 15:00:35 -05:00
Mark DePristo ee40791776 Attributes are now Map<String,Object> not Map<String,?>
-- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).
2011-11-11 09:55:42 -05:00
Mark DePristo 67b022c34b Cleanup for new SampleUtils function
-- getVCFHeadersFromRods(rods) is now available so that you don't have getVCFHeadersFromRods(rods, null) throughout the codebase
2011-11-10 13:27:13 -05:00
Mark DePristo 35fe9c8a06 Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-10 11:11:33 -05:00
Eric Banks 21bf43f3bb Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2011-11-09 15:34:40 -05:00
Christopher Hartl d828eba7f4 Allow comments in a table-formatted file to precede the header line. 2011-11-09 15:27:38 -05:00
Eric Banks 8205efbb29 Merge branch 'master' into intervals 2011-11-09 15:27:15 -05:00
Eric Banks d64f8a89a9 Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly. 2011-11-09 15:24:29 -05:00
Mark DePristo e639f0798e mergeEvals allows you to treat -eval 1.vcf -eval 2.vcf as a single call set
-- A bit of code cleanup in VCFUtils
-- VariantEval table to create 1000G Phase I variant summary table
-- First version of 1000G Phase I summary table Qscript
2011-11-09 14:35:50 -05:00
Christopher Hartl 091229e4db MVLikelihoodRatio now checks if the family string is provided before attempting to instantiate. Also check that variant contexts have both genotypes and genotype likelihoods.
Table codec now yells at users for not providing a HEADER with the table - parsing tables without a header line was causing the first line of the file to be eaten.
Table feature now has a toString method.

These are minor bug fixes.
2011-11-09 11:03:29 -05:00
Mark DePristo e99871f587 Bug fix for decode loc
-- decodeLoc() wasn't skipping input header lines, so the system blew up when there was an = line being split.
2011-11-04 13:20:54 -04:00
Mark DePristo a340a1aeac Bug fix. decodeLoc() should update lineNo so you get meaningful line no when indexing
due to malformed VCF files.
2011-11-04 11:44:24 -04:00
Eric Banks 31ee5432c5 Merged bug fix from Stable into Unstable 2011-10-31 14:56:59 -04:00
Eric Banks f62af0291b Check for invalid VCF records (not enough tokens) instead of assuming they are there. 2011-10-31 14:09:51 -04:00
Mark DePristo 1a92ee3593 No longer adds a binding of ID -> . when the ID field is dot in the VCF
-- Really we should make ID a primary key in VariantContext.  Putting it into the attributes is just annoying now
2011-10-18 10:57:02 -04:00
Eric Banks c3eff7451a Found a small inefficiency while profiling: we were still using String.split instead of ParsingUtils.split to break up array values in the INFO field. There was a noticeable (albeit not big) difference in the change when reading sites only files. 2011-10-03 14:20:39 -04:00
Eric Banks 8f8b59a932 My interpretation of the VCF spec is that the FORMAT field should only be present if there is genotype/sample data. So the VCFCodec now throws an exception when it encounters such a case. I had to fix one of the integration test VCFs. 2011-09-21 22:23:28 -04:00
Mark DePristo 6bcfce225f Fix for dynamic type determination for bgzip files
-- GZipInputStream handles bgzip files under linux, but not mac
-- Added BlockCompressedInputStream test as well, which works properly on bgzip files
2011-09-21 15:39:19 -04:00
Mark DePristo 9f6f0c443c Marginally cleaner isVCFStream() function
-- cleanup trying to debug minor bug.  Failed to fix the bug, but the code is nicer now
2011-09-21 15:25:01 -04:00
Eric Banks 174859fc68 Don't allow whitespace in the INFO field 2011-09-21 11:14:54 -04:00
Mark DePristo b7511c5ff3 Fixed long-standing bug in tribble index creation
-- Previously, on the fly indices didn't have dictionary set on the fly, so the GATK would read, add dictionary, and rewrite the index.  This is now fixed, so that the on the fly index contains the reference dictionary when first written, avoiding the unnecessary read and write
-- Added a GenomeAnalysisEngine and Walker function called getMasterSequenceDictionary() that fetches the reference sequence dictionary.  This can be used conveniently everywhere, and is what's written into the Tribble index
-- Refactored tribble index utilities from RMDTrackBuilder into IndexDictionaryUtils
-- VCFWriter now requires the master sequence dictionary
-- Updated walkers that create VCFWriters to provide the master sequence dictionary
2011-09-20 10:53:18 -04:00
Mark DePristo aa8afa3899 Merge 2011-09-19 21:16:47 -04:00
Christopher Hartl 8143def292 Fix the -T argument in the DepthOfCoverage docs
Add documentation for the RefSeqCodec, pointing users to the wiki page describing how to create the file
2011-09-19 12:31:47 -04:00
Christopher Hartl ca1b30e4a4 Fix the -T argument in the DepthOfCoverage docs
Add documentation for the RefSeqCodec, pointing users to the wiki page describing how to create the file
2011-09-19 10:29:06 -04:00
David Roazen 1213b2f8c6 SnpEff 2.0.2 support
-Rewrote SnpEff support in VariantAnnotator to support the latest SnpEff release (version 2.0.2)
-Removed support for SnpEff 1.9.6 (and associated tribble codec)
-Will refuse to parse SnpEff output files produced by unsupported versions (or without a version tag)
-Correctly matches ref/alt alleles before annotating a record, unlike the previous version
-Correctly handles indels (again, unlike the previous version
2011-09-14 07:09:47 -04:00
Mark DePristo fe5724b6ea Refactored indexing part of StandardVCFWriter into superclass
-- Now other implementations of the VCFWriter can easily share common functions, such as writing an index on the fly
2011-09-07 23:27:08 -04:00
Mark DePristo 03aa04e37c Simple refactoring to make formating functions public 2011-09-02 21:13:08 -04:00