Commit Graph

65 Commits (ffbd4d85f2e0112b32df0bbba00330b00a0806cf)

Author SHA1 Message Date
Mark DePristo 0ab4022f23 Final r119 tribble jar 2012-11-02 14:30:33 -04:00
Mark DePristo f8a0a947e3 Critical bugfix for GSA-652 / Multi-threaded VCF -> BCF writing produces invalid intermediate file that fails on merging
-- New tribble library now uses 64 bit sizes.  The 26K VCF has so much data that low-level tribble block indices where overflowing their int size values.  This includes a to-be-committed tribble jar that fixes this problem
-- See https://jira.broadinstitute.org/browse/GSA-652
-- Minor cleanup of error messages that were useful on the way to solving this monster problem
2012-11-02 09:09:59 -04:00
Mark DePristo 61f0c46423 Rev tribble to 110. Log is:
Optimization for PositionalBufferedStream with specialized read(byte, int, int) method

-- For binary codecs having an efficient reader of lots of bytes that doesn't fall back into read() itself vastly improves performance. The old version was 10x slower than InputStream, while the new version is +30%.
-- Generalize PositionalBufferedStream main() method for performance testing, now accepts cmdline arguments for the file to read, how many iterations, etc

Generalize AsciiLineReader main() method for performance testing
-- Now accepts cmdline arguments for the file to read, how many iterations, etc

AsciiLineReaderTest and PositionBufferedStreamTest were in src not test/src
2012-06-26 15:28:32 -04:00
Mark DePristo 373ae39e86 Testing of BCF codec
-- Rev.d tribble
-- Minor code cleanup
-- BCF2 encoder / decoder use Double not Float internally everywhere
-- Generalized VC testing framework
2012-05-24 10:57:01 -04:00
Mark DePristo a90482c772 Rev. tribble to v101 with another putative open file leak fix
Scalability bugfixes; can issues tens of thousands of queries to an reader
without opening too many files

-- Fixed missing close() statement in TribbleIndexedFeatureReader
-- Fixed NPE in TabixIteratorLineReader
-- Added scalability test that confirms .query() failure and subsequent fix

Note this actually fixes a tested and reproducible scability issue.  Might not be the only one but I believe it should do the trick.  Sorry everyone for the inconvenience.  Note that we now have a test in Tribble to ensure this doesn't happen again.
2012-05-04 15:40:41 -04:00
Mark DePristo fa84d50a2b Rev. tribble for putative bugfixes for not closing streams 2012-05-04 10:20:46 -04:00
Mark DePristo 0f4cc1884d Rev to tribble 99, optimized AsciiFeatureCodec
-- Removed tmp. GeneralizedFeatureCodec
-- BCF2 Reader update to use new style, but this entire class can be deleted now
-- Rev. tribble to r99
2012-05-03 07:31:48 -04:00
Mark DePristo 43d97c2e00 Rev Tribble to r97, adding binary feature support
From tribble logs:

Binary feature support in tribble

-- Massive refactoring and cleanup
-- Many bug fixes throughout
-- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream
as an argument not a String
-- See ExampleBinaryCodec for an example binary codec
-- AbstractAsciiFeatureCodec provides to its subclass the same String decode,
readHeader functionality before.  Old ASCII codecs should inherit from this base
class, and will work without additional modifications
-- Split AsciiLineReader into a position tracking stream
(PositionalBufferedStream).  The new AsciiLineReader takes as an argument a
PositionalBufferedStream and provides the readLine() functionality of before.
Could potentially use optimizations (its a TODO in the code)
-- The Positional interface includes some more functionality that's now
necessary to support the more general decoding of binary features
-- FeatureReaders now work using the general FeatureCodec interface, so they can
index binary features
-- Bugfixes to LinearIndexCreator off by 1 error in setting the end block
position
-- Deleted VariantType, since this wasn't used anywhere and it's a particularly
clean why of thinking about the problem
-- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package
-- TabixReader requires an AsciiFeatureCodec as it's currently only implemented
to handle line oriented records
-- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles
Ascii and binary features
-- Removed unused functions here and there as encountered
-- Fixed build.xml to be truly headless
-- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a
value and the position in the file where the header ends (not inclusive).
TribbleReaders now skip the header if the position is set, so its no longer
necessary, if one implements the general readHeader(PositionalBufferedStream)
version to see header lines in the decode functions.  Necessary for binary
codecs but a nice side benefit for ascii codecs as well
-- Cleaned up the IndexFactory interface so there's a truly general createIndex
function that takes the enumerated index type.  Added a writeIndex() function
that writes an index to disk.
-- Vastly expanded the index unit tests and reader tests to really test linear,
interval, and tabix indexed files.  Updated test.bed, and created a tabix
version of it as well.
-- Significant BinaryFeaturesTest suite.
-- Some test files have indent changes
2012-05-03 07:31:48 -04:00
Mark DePristo 58c470a6c5 Rev'ing Tribble from 53 to 94
-- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code
-- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase
2012-05-03 07:31:47 -04:00
Mark DePristo b0560f9440 Rev. tribble to fix BED codec bug in tribble 51 2012-01-17 16:40:26 -05:00
Mark DePristo f2b0575dee Detect unreasonably large allele strings (>2^16) and throw an error
-- samtools can emit alleles where the ref is 42M Ns and this caused the GATK (via tribble) to hang in several places.
-- Tribble was updated so we actually could read the line properly (rev. to 51 here).
-- Still the parsing algorithms in the GATK aren't happy with such a long allele.  Instead of optimizing the code around an improper use case I put in a limit of 2^16 bp for any allele, and throw a meaningful exception when encountered.
2012-01-17 16:40:26 -05:00
Matt Hanna c9eae32f6e Revving Tribble to actually close file handles when close() is called. 2011-11-30 22:42:21 -05:00
Eric Banks d7d8b8e380 Tribble v42 changes the Codec.canDecode method to take in a String instead of a File; this is something that Jim was adamant about (because Tribble can handle streams other than files). I didn't want the next person who needed to rev Tribble to deal with this change additionally, so I took care of updating the GATK now. 2011-11-28 14:18:28 -05:00
Eric Banks d64f8a89a9 Instead of the SelfScopingFeatureCodec interface, pushed this functionality into Tribble itself. Now we can e.g. determine that a file can be parsed by the BedCodec on the fly. 2011-11-09 15:24:29 -05:00
Eric Banks 6297561326 Adding the new jar 2011-11-07 15:08:19 -05:00
Eric Banks aa0c8c3600 Revving Tribble jar to v40. Our last jar was busted. 2011-11-07 11:30:08 -05:00
Mark DePristo 34f435565c Accidentally committed unclean tribble jar to repo 2011-09-21 10:16:17 -04:00
Mark DePristo 827c942c80 Rev tribble 2011-09-20 14:01:14 -04:00
Eric Banks da9c8ab386 Revving the Tribble jar where the DbsnpCodec class was renamed to OldDbsnpCodec. Updating GATK code accordingly. 2011-09-06 20:39:42 -04:00
Mark DePristo 0b794b5491 Reving Tribble to 23 2011-09-01 10:43:03 -04:00
Mark DePristo d604019362 Finished my broken tribble code. Updated to rev 22 2011-08-30 16:56:48 -04:00
Mark DePristo 173ca1e215 Reverting tribble temporarily while I fix my subtle problems 2011-08-30 11:08:13 -04:00
Mark DePristo 427c643ce7 The missing tribble jar 2011-08-29 18:46:40 -04:00
Mark DePristo 5defaf5fac Continuing to improve Tribble
-- ProfileRodSystem now has a just load index mode, allowing us to optimize the profiler
-- assessFarmNodes R script for making nice plots of performance of jobs on the farm
-- Rev. tribble to use new, optimized index loading (performance win when loading many many indices)
2011-08-29 17:02:57 -04:00
David Roazen bd5cdb8a43 The tribble dependency is now handled through ivy. Revved tribble to r18 and removed obsolete build targets in build.xml 2011-08-11 16:38:29 -04:00
Mark DePristo 35ec82a467 Oops, need this 2011-07-17 13:08:08 -04:00
Mark DePristo 4db2b13e9e Rev tribble.
Just added more documentation for diffEngine and pointer to new wiki:

http://www.broadinstitute.org/gsa/wiki/index.php/DiffEngine
2011-07-17 13:05:04 -04:00
Mark DePristo a5bfcb1ed9 V15 is broken. Going up to v16 in a second. 2011-07-17 10:25:34 -04:00
Mark DePristo 2b55d5b7c0 Test tribble library where equals() ignores time stamps. 2011-07-16 16:45:55 -04:00
David Roazen 07b875c779 Renaming the updated tribble jar file to match the svn revision number. 2011-07-16 09:57:46 -04:00
Mark DePristo 5e7bc862a3 Rev tribble to include new equal() method that prints out details of why two indices are not the same. 2011-07-16 08:51:21 -04:00
David Roazen 643458d7db Updated the tribble jar -- this should fix most of the integration test
failures we've been seeing.

Note that with tribble's new svn repository the revision numbers have reset,
hence "revision 3"
2011-06-29 01:11:03 -04:00
droazen 171e20a111 Updated the tribble jar to revision 351
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6068 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-22 22:55:57 +00:00
droazen ab1de3bfda Updated the tribble jar to revision 350
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6065 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-22 22:55:46 +00:00
droazen 95614ce3d6 Updated the tribble jar to revision 345
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6033 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-22 22:53:25 +00:00
droazen 32a991c4d3 Updated the tribble jar to revision 343
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6031 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-22 22:53:17 +00:00
droazen 480598842c Updated the tribble jar
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5982 348d0f76-0448-11de-a6fe-93d51630548a
2011-06-13 18:00:09 +00:00
aaron b3fd145161 fix for a bug deep in the tribble indexing: if you had a single record in the first contig, the second contig's index blocks would point to the wrong file seek location, and you'd see no
features in that contig. Thanks to Mark for finding this.  I'm not rev'ing the index version (which would cause all indexes to be rebuilt), since this seems like a pretty rare edge case.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3865 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 18:39:55 +00:00
aaron 9579aace1f updates to code dependent on Tribble, as well as the following Tribble changes:
- makes writing to disk optional for indexes using the indexCreator classes (allow the user to specify the index file, if null don't write it)
- removed some system.out debugging code
- fixed version checking in interval tree 
- made indexes store and return a LinkedHashSet for sequence names (to ensure they've preserved the ordering in the file)
- index creators now read the file before creating the index
- changed the Index.write() method to take a LEDataStream instead of a file
- removed the sequence dictionary code on the header
- added utils for getting LEDataStreams
- added a base Tribble exception




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3857 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-23 01:56:10 +00:00
aaron 1cba81c16f updates to tribble with fixes for some bugs I've found in some new indexing code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3842 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 22:08:04 +00:00
aaron af6b5f000e updating the Tribble library; added writing of indexes to the index interface for working with the tree index.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3836 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 07:02:08 +00:00
aaron 250ab70fed update the Tribble library too.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3827 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 05:00:37 +00:00
aaron dff4c06763 Rev'ing Tribble with a special version that has excluded VCF 3.3
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3640 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-25 18:20:51 +00:00
aaron 54ae0b8e4e some updates to tribble for the svn commit that will follow
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3621 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-23 20:20:07 +00:00
aaron 5b87a00a5f updating with associated Tribble changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3605 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-22 07:54:15 +00:00
depristo 57a13805da GATK now uses a optimized indexing scheme in Tribble. 5x or more performance gain on files with many genotypes. Updated integrationtest that was failing and was clearly wrong. DB=; isn't a valid annotation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3596 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-19 21:36:41 +00:00
aaron 32f6781ac7 updating tribble with the VCF header changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3583 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 08:20:44 +00:00
ebanks 8c28be5933 Fixing a VCF bug for Sendu: we weren't emitting flags (booleans) correctly in VCF3.3 (rev'ed tribble for this).
Updated dbsnp/hapmap membership info fields to be flags now instead of ints.
While I was there, I added the change in the Annotator for Jan to force reads to be from a specific sample.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3536 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-11 16:42:06 +00:00
aaron e27951ab39 re-updating the VCF code to handle spaces in sample names
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3528 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-10 20:18:34 +00:00
aaron ad98512f6c adding changes so that we look at the headers already loaded by the engine for samples and other VCF utils, and not create readers for each file to get them (this caused Tribble to regerenate indices if the index file can't be written to disk).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3518 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-09 17:21:12 +00:00