gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	2799a1b686	Catch exception for bad type and throw as a TribbleException	2012-01-25 12:15:51 -05:00
Mark DePristo	763c81d520	No longer enforce MAX_ALLELE_SIZE in VCF codec -- Instead issue a warning when a large (>1MB) record is encountered -- Optimized ref.getBytes()[i] => (byte)ref.charAt(i), which avoids an implicit O(n) allocation each iteration through computeReverseClipping()	2012-01-18 07:35:11 -05:00
Mark DePristo	0c7865fdb5	UnitTest for reverseAlleleClipping -- No code modified yet, just implementing a unit test to ensure correctness of the existing code	2012-01-18 07:35:11 -05:00
Mark DePristo	62801e430a	Bugfix for unnecessary optimization -- don't cache the ref bytes	2012-01-17 16:40:26 -05:00
Mark DePristo	f2b0575dee	Detect unreasonably large allele strings (>2^16) and throw an error -- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places. -- Tribble was updated so we actually could read the line properly (rev. to 51 here). -- Still the parsing algorithms in the GATK aren't happy with such a long allele. Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.	2012-01-17 16:40:26 -05:00
Mark DePristo	b06074d6e7	Updated SortingVCFWriterBase to use PriorityBlockingQueue so that the class is thread-safe -- Uses PriorityBlockingQueue instead of PriorityQueue -- synchronized keywords added to all key functions that modify internal state Note that this hasn't been tested extensivesly. Based on report: http://getsatisfaction.com/gsa/topics/missing_loci_output_in_multi_thread_mode_when_implement_sortingvcfwriterbase?utm_content=topic_link&utm_medium=email&utm_source=new_topic	2012-01-13 09:33:16 -05:00
Christopher Hartl	69661da37d	Moving ValidationSiteSelector to validation package in public under my ownership. JunctionGenotyper added and modified several times, this commit is due to merging conflix fixes.	2011-12-19 10:57:28 -05:00
Eric Banks	9497e9492c	Bug fix for complex records: do not ever reverse clip out a complete allele.	2011-12-14 11:21:28 -05:00
Eric Banks	09a5a9eac0	Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.	2011-12-14 10:43:52 -05:00
Eric Banks	d3f4a5a901	Fail gracefully when encountering malformed VCFs without enough data columns	2011-12-14 10:37:38 -05:00
Eric Banks	d7d8b8e380	Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now.	2011-11-28 14:18:28 -05:00
Mark DePristo	9ea7b70a02	Added decode method to LazyGenotypesContext -- AbstractVCFCodec calls this if the samples are not sorted. Previously called getGenotypes() which didn't actually trigger the decode	2011-11-21 16:21:23 -05:00
Mark DePristo	2e9ecf639e	Generalized interface to LazyGenotypesContext -- Now you provide a LazyParsing object -- LazyGenotypesContext now knows nothing about the VCF parser itself. The parser holds all of the necessary data to parse the VCF genotypes when necessarily, and the LGC only has a pointer to this object -- Using new interface added LazyGenotypesContext to unit tests with a simple lazy version -- Deleted VCFParser interface, as it was no longer necessary	2011-11-21 09:30:40 -05:00
Mark DePristo	9cb3fe3a59	Vastly better way of doing on-demand genotyping loading -- With our GenotypesContext class we can naturally create a LazyGenotypesContext subclass that does the on-demand loading. -- This new class was replaced all of the old, complex functionality -- Better still, there were many cases were the genotypes were being loaded unnecessarily, resulting in efficiency. This was detected because some of the integration tests changed as the genotypes were no longer being parsing unnecessarily -- Misc. bug fixes throughout the system -- Bug fixes for PhaseByTransmission with new GenotypesContext	2011-11-20 08:23:09 -05:00
Mark DePristo	73119c8e3c	Merge with master -- A few bug fixes	2011-11-19 09:56:06 -05:00
Mark DePristo	6cf315e17b	Change interface to getNegLog10PError to getLog10PError	2011-11-18 21:07:30 -05:00
Mark DePristo	660d6009a2	Documentation and contracts for GenotypesContext and VariantContextBuilder	2011-11-18 13:59:30 -05:00
Mark DePristo	f54afc19b4	VariantContextBuilder -- New approach to making VariantContexts modeled on StringBuilder -- No more modify routines -- use VariantContextBuilder -- Renamed isPolymorphic to isPolymorphicInSamples. Same for mono -- getChromosomeCount -> getCalledChrCount -- Walkers changed to use new VariantContext. Some deprecated new VariantContext calls remain -- VCFCodec now uses optimized cached information to create GenotypesContext.	2011-11-18 12:39:10 -05:00
Mark DePristo	7490dbb6eb	First version of VariantContextBuilder	2011-11-18 11:06:15 -05:00
Mark DePristo	aa0610ea92	GenotypeCollection renamed to GenotypesContext	2011-11-16 16:24:05 -05:00
Mark DePristo	460a51f473	ID field now stored in the VariantContext itself, not the attributes	2011-11-15 14:56:33 -05:00
Mark DePristo	f0234ab67f	GenotypeMap -> GenotypeCollection part 2 -- Code actually builds	2011-11-14 17:42:55 -05:00
Mark DePristo	1fbdcb4f43	GenotypeMap -> GenotypeCollection	2011-11-14 15:32:03 -05:00
Mark DePristo	fee9b367e4	VariantContext genotypes are now stored as GenotypeMap objects -- Enables further sophisticated optimizations, as this class can be smarter about storing the data and will directly support operations like subset to samples -- All instances in the gatk that used Map<String, Genotype> now use GenotypeMap type. -- Amazingly, there were many places where HashMap<String, Genotype> is used, so that the order of the genotypes is technically undefined and could be dangerous. Now everything uses GenotypeMap with a specific ordering of samples (by name) -- Integrationtests updated and all pass	2011-11-11 15:00:35 -05:00
Mark DePristo	ee40791776	Attributes are now Map<String,Object> not Map<String,?> -- Allows us to avoid an unnecessary copy when creating InferredGeneticContext (whose name really needs to change).	2011-11-11 09:55:42 -05:00
Mark DePristo	67b022c34b	Cleanup for new SampleUtils function -- getVCFHeadersFromRods(rods) is now available so that you don't have getVCFHeadersFromRods(rods, null) throughout the codebase	2011-11-10 13:27:13 -05:00
Mark DePristo	35fe9c8a06	Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-10 11:11:33 -05:00
Eric Banks	21bf43f3bb	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-11-09 15:34:40 -05:00
Christopher Hartl	d828eba7f4	Allow comments in a table-formatted file to precede the header line.	2011-11-09 15:27:38 -05:00
Eric Banks	8205efbb29	Merge branch 'master' into intervals	2011-11-09 15:27:15 -05:00
Eric Banks	d64f8a89a9	Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly.	2011-11-09 15:24:29 -05:00
Mark DePristo	e639f0798e	mergeEvals allows you to treat -eval 1.vcf -eval 2.vcf as a single call set -- A bit of code cleanup in VCFUtils -- VariantEval table to create 1000G Phase I variant summary table -- First version of 1000G Phase I summary table Qscript	2011-11-09 14:35:50 -05:00
Christopher Hartl	091229e4db	MVLikelihoodRatio now checks if the family string is provided before attempting to instantiate. Also check that variant contexts have both genotypes and genotype likelihoods. Table codec now yells at users for not providing a HEADER with the table - parsing tables without a header line was causing the first line of the file to be eaten. Table feature now has a toString method. These are minor bug fixes.	2011-11-09 11:03:29 -05:00
Mark DePristo	e99871f587	Bug fix for decode loc -- decodeLoc() wasn't skipping input header lines, so the system blew up when there was an = line being split.	2011-11-04 13:20:54 -04:00
Mark DePristo	a340a1aeac	Bug fix. decodeLoc() should update lineNo so you get meaningful line no when indexing due to malformed VCF files.	2011-11-04 11:44:24 -04:00
Eric Banks	31ee5432c5	Merged bug fix from Stable into Unstable	2011-10-31 14:56:59 -04:00
Eric Banks	f62af0291b	Check for invalid VCF records (not enough tokens) instead of assuming they are there.	2011-10-31 14:09:51 -04:00
Mark DePristo	1a92ee3593	No longer adds a binding of ID -> . when the ID field is dot in the VCF -- Really we should make ID a primary key in VariantContext. Putting it into the attributes is just annoying now	2011-10-18 10:57:02 -04:00
Eric Banks	c3eff7451a	Found a small inefficiency while profiling: we were still using String.split instead of ParsingUtils.split to break up array values in the INFO field. There was a noticeable (albeit not big) difference in the change when reading sites only files.	2011-10-03 14:20:39 -04:00
Eric Banks	8f8b59a932	My interpretation of the VCF spec is that the FORMAT field should only be present if there is genotype/sample data. So the VCFCodec now throws an exception when it encounters such a case. I had to fix one of the integration test VCFs.	2011-09-21 22:23:28 -04:00
Mark DePristo	6bcfce225f	Fix for dynamic type determination for bgzip files -- GZipInputStream handles bgzip files under linux, but not mac -- Added BlockCompressedInputStream test as well, which works properly on bgzip files	2011-09-21 15:39:19 -04:00
Mark DePristo	9f6f0c443c	Marginally cleaner isVCFStream() function -- cleanup trying to debug minor bug. Failed to fix the bug, but the code is nicer now	2011-09-21 15:25:01 -04:00
Eric Banks	174859fc68	Don't allow whitespace in the INFO field	2011-09-21 11:14:54 -04:00
Mark DePristo	b7511c5ff3	Fixed long-standing bug in tribble index creation -- Previously, on the fly indices didn't have dictionary set on the fly, so the GATK would read, add dictionary, and rewrite the index. This is now fixed, so that the on the fly index contains the reference dictionary when first written, avoiding the unnecessary read and write -- Added a GenomeAnalysisEngine and Walker function called getMasterSequenceDictionary() that fetches the reference sequence dictionary. This can be used conveniently everywhere, and is what's written into the Tribble index -- Refactored tribble index utilities from RMDTrackBuilder into IndexDictionaryUtils -- VCFWriter now requires the master sequence dictionary -- Updated walkers that create VCFWriters to provide the master sequence dictionary	2011-09-20 10:53:18 -04:00
Mark DePristo	aa8afa3899	Merge	2011-09-19 21:16:47 -04:00
Christopher Hartl	8143def292	Fix the -T argument in the DepthOfCoverage docs Add documentation for the RefSeqCodec, pointing users to the wiki page describing how to create the file	2011-09-19 12:31:47 -04:00
Christopher Hartl	ca1b30e4a4	Fix the -T argument in the DepthOfCoverage docs Add documentation for the RefSeqCodec, pointing users to the wiki page describing how to create the file	2011-09-19 10:29:06 -04:00
David Roazen	1213b2f8c6	SnpEff 2.0.2 support -Rewrote SnpEff support in VariantAnnotator to support the latest SnpEff release (version 2.0.2) -Removed support for SnpEff 1.9.6 (and associated tribble codec) -Will refuse to parse SnpEff output files produced by unsupported versions (or without a version tag) -Correctly matches ref/alt alleles before annotating a record, unlike the previous version -Correctly handles indels (again, unlike the previous version	2011-09-14 07:09:47 -04:00
Mark DePristo	fe5724b6ea	Refactored indexing part of StandardVCFWriter into superclass -- Now other implementations of the VCFWriter can easily share common functions, such as writing an index on the fly	2011-09-07 23:27:08 -04:00
Mark DePristo	03aa04e37c	Simple refactoring to make formating functions public	2011-09-02 21:13:08 -04:00

1 2

80 Commits (2cd33b2f1f20abc481e7ab77d9a6650933a6e249)