gatk-3.8

Commit Graph

Author	SHA1	Message	Date
hanna	7428ae338a	A fix for Marian Thieme's NPE in the new sharding system. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5675 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-21 19:47:14 +00:00
hanna	fece2167b3	Prototype implementation of protoshard merging when protoshard n and protoshard n+1 completely overlap. Gives a small but consistent performance increase in non-intervaled whole exome traversals (2.79min original, 2.69min revised). Needs a more in depth analysis of optimal shard sizing to determine a true optimum. Also renamed a variable because Khalid disapproved of my naming choices. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5595 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-08 02:09:14 +00:00
hanna	deab9f0aa5	Initial work on proto-shard merger: - create size() method that returns an approximation of the uncompressed size in bytes of BAM span. I'll use this method as a protoshard weighting function until we determine how to normalize the weights across the different data access mechanisms (reads, reference, RODs). - Implementations of basic union/intersection/subtraction mechanisms for BAM spans; should be enough to get an accurate weight for two proto-shards put together. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5541 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-30 22:03:43 +00:00
hanna	e75366f738	Fixed performance issue in protosharding code -- turns out that the index optimizer was mutating the data stored in the indices. Protosharding still disabled by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5334 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 17:32:12 +00:00
hanna	600f73cbd6	A checkpoint commit of two BAM reading projects going on simultaneously. These two projects are works in progress, and this checkin will provide a baseline against which to gauge improvements to both projects. Low-memory BAM protoshards (disabled by default): - Currently passing ValidatingPileupIntegrationTest. - Gets progressively slower throughout the traversal, but should run at least as fast as original implementation. - Uses 10+ file handles per BAM, but should use 3. BAM performance microbenchmark test system: - Currently tests performance of BAM reading using SAM-JDK vs. GATK - Tests do not hit all GATK performance hotspots. - New tests that require input data in a slightly different form are hard to implement. - Output of test results is not easily parseable (investigating Google Caliper for possible improvements). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5317 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-25 17:50:32 +00:00
hanna	b992abb6eb	A few more unit tests plus some extra functionality for BAM index visualization. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5222 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-09 01:51:34 +00:00
hanna	5c3198520c	A few minor modifications masquerading as significant changes according to svn's logs: - Copied BAM indexing engine from Picard back into the GATK anticipating shard merging algorithm. Tried to leave most of the building blocks in Picard. If this turns into a logistical nightmare, I'll merge the building blocks into the GATK as well. - Reorganized the org.broadinstitute.sting.gatk.datasources package, giving better separation of query and management functionality for reads, ref, rmd, and samples. - Merged Shard building blocks into org.broadinstitute.sting.gatk.datasources. reads package, indicating it's current strong relationship with the reads, rather than the general unifying element I wish this would be. - Collapsed BAMFormatAwareShard into Shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5184 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-03 17:59:19 +00:00
hanna	250c18e679	Error message fixes for the following issues: nvjpM4yOwQAu3fNGxi4oXLuVpKn6aAlf,1GL0OuXK2xKQfvbu34tWYgbojSVSLo0l, ehEGBJOfgc4V7qj8W0Homf5ICuVK5Sm3,cZsreLm1CbY3aYKZhV7DOSvQNwur41zp, GlrlyGEyP9kJDIRCQNFQp7BGJBXSzdDJ,hyz1uiHXr39ANmdZu9K1epOSX8EL3mDw, q0n4EucZESCI4LZhQik306zD4VAuH2cb. Messages: camrhG5tHzlY9WUSEVpVZGkU1tyJqKb5,s0OX2g7nYRctJxyFoQCa6clac9IsjHyi, THIAtjllvYNlnTmiMnJEIHd2Ju4gqQIO,jwVk3JYZJNHloW7HO4LeGxFexknqro0v, BFNRGOGmGGJNNPZqgeF1ikTNFfskbyLc,... Were fixed in 4392. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4428 348d0f76-0448-11de-a6fe-93d51630548a	2010-10-05 03:37:13 +00:00
depristo	7880863eb7	Final step in error refactoring. GATK exception is now ReviewedStingException, indicating that this exception is really what one wants. Only use this exception when you have thought about StingException vs. UserException and made a real decision. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4267 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 15:07:38 +00:00
depristo	7ad8fbdd5a	Moved GATKException to exceptions git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4266 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:47:19 +00:00
depristo	40e6179911	Penultimate step in exception system overhaul. UserError is now UserException. This class should be used for all communication with the USER for problems with their inputs. Engine now validates sequence dictionaries for compatibility, detecting not only lack of overlap but now inconsistent headers (b36 ref with v37 BAM, for example) as well as ref / bam order inconsistency. New -U option to allow users to tolerate dangerous seq dict issues. WalkerTest system now supports testing for exceptions (see email and wiki for docs). Tests for vcf and bam vs. ref incompatibility. Waiting on Tribble seq dict improvements to detect b36 VCF with b37 ref (currently cannot tell this is wrong. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4258 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-12 14:02:43 +00:00
depristo	1de713f354	Massive review of maybe 50% of the exceptions in the GATK. GATKException is a tmp. tracker so that I can tell which StingExceptions I've reviewed. Please don't use it. If you are working on new code and are considering throwing exceptions, it's either UserError or StingException, please git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4246 348d0f76-0448-11de-a6fe-93d51630548a	2010-09-09 23:21:17 +00:00
hanna	4995950d04	IndexedFastaSequenceFile is now in Picard; transitioning to that implementation. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3701 348d0f76-0448-11de-a6fe-93d51630548a	2010-07-01 04:40:31 +00:00
hanna	96662d8d1b	Moving from GATK dependencies on isolated classes checked into the GATK codebase to a dependency on a jar file compiled from my private picard branch. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3034 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-18 17:43:42 +00:00
hanna	a7fe07c404	A few stopgap fixes to get the GATK to the point where the old sharding infrastructure can be torn down: 1) New sharding system emulates old MonolithicSharding mechanism. 2) Better awareness of differences between fasta and BAM files when creating shards. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2948 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-07 21:01:25 +00:00
hanna	1ef1091f7c	Cleanup and simplification of read interval sharding. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2944 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-05 23:34:38 +00:00
hanna	023654696e	First pass at handling SAMFileReaders using a SAMReaderID. This allows us to firewall GATK users from the readers, which they could abuse in ways that could destabilize the GATK. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2923 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-04 00:59:32 +00:00
hanna	104f4f7383	Mediocre implementation of reader pooling within the SAM data source. Will fix this week. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2915 348d0f76-0448-11de-a6fe-93d51630548a	2010-03-02 22:35:02 +00:00
hanna	30eb28886b	Basic functionality for intervaled reads in new sharding system. Not currently filtering out cruft, so the mode of operation is currently queryOverlapping rather than queryContained. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2899 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-26 21:41:55 +00:00
hanna	1017a38f38	Initial refactoring of read traversal to make it easier to drop in intervalled reads traversal. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2894 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-26 15:09:09 +00:00
hanna	88d0677379	Misc correctness enhancements: develop the bin selector into a recursive algorithm and return a shard when reads are missing. Also improve the performance of the read filter that clips reads not actually present in the shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2870 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-22 22:19:06 +00:00
hanna	cc09f48cd8	Correctness fix: index can concat chunks around shard edges, and my code didn't account for that. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2861 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-19 21:44:33 +00:00
hanna	71f18e941f	Significant performance improvements made by subtracting out the contents of the prior highest-level bin. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2859 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-19 16:46:16 +00:00
hanna	232d884578	Got back most of the performance lost when I fixed the dropped reads problem. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2835 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-12 19:59:56 +00:00
hanna	77af5822d4	Correcting my incomplete understanding of how the BAM file index actually works. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2833 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-12 16:15:19 +00:00
hanna	34e566c90d	Fixed bug where new sharding system wasn't grabbing the reads that start at the end of a bin. Caused by what I currently believe to be a bug in Picard -- will verify with Alec. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2826 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-11 17:00:04 +00:00
hanna	dc885ba386	Fix for some correctness bugs found during early performance testing, phase 1. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2822 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-10 22:32:25 +00:00
hanna	0250338ce7	Basic use cases for merging BAM files with the new sharding system work. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2815 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-09 22:14:37 +00:00
hanna	57b8c9a53c	Supporting infrastructure for merging SAM files. Not yet integrated into the datasource. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2810 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-08 23:59:38 +00:00
hanna	e53432d54d	Checkpoint for combining adjacent intervals into the same shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2782 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-05 02:48:02 +00:00
hanna	3f35e181d5	Add an alternate implementation of the BAM file reader that keeps the entire index in memory. Initial revision of BAMFileStat, a tool to inspect BAM file BGZF blocks and index entries. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2769 348d0f76-0448-11de-a6fe-93d51630548a	2010-02-03 19:48:15 +00:00
hanna	668c7da33d	Bug fix in custom override of queryOverlapping. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2743 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 21:35:59 +00:00
hanna	e7f5c93fe5	Cleaning up the inheritance hierarchy from the previous commit. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2738 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 19:13:36 +00:00
hanna	3d922a019f	Basic support for very simple index-driven locus traversals. Interface has been changed to support batched intervals in a single shard, but intervals are not yet compressed into a single shard. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2730 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-29 03:14:26 +00:00
hanna	b19bb19f3d	First successful test of new sharding system prototype. Can traverse over reads from a single BAM file. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2587 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-15 03:35:55 +00:00
hanna	7893aaefe9	Updates to chunk iteration. Includes the return of the dreaded *2.java files; hopefully I can find a way to kill these off before the Picard patch is ready. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2550 348d0f76-0448-11de-a6fe-93d51630548a	2010-01-08 20:20:56 +00:00
hanna	497ae700c4	A rethink of the existing BAM block extraction code: rather than working in chunk space directly, stream data in block space, converting to chunk space on demand. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2484 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-30 18:19:51 +00:00
hanna	87ff2b15d4	First step in introducing a patch to Picard: create our ideal interface into the BAM file for sharding. This commit can iterate over the BAM file, pulling out information about the blocks in the file without actually loading or decompressing the reads. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2434 348d0f76-0448-11de-a6fe-93d51630548a	2009-12-23 21:35:08 +00:00

38 Commits (3ffc2ccd81087453472c7d0200862cf4e9d5fa91)