Commit Graph

124 Commits (73a14a985b0a9a4ce8aeb304b0e084f91e8aa346)

Author SHA1 Message Date
hanna 46c14ec63f New, much less memory intensive implementation of BAM file sharding. Streams indices together with the expectation
that bins will be present in the bin sparse array, which avoids the problem of having to hold the sparse bin array
stored in every BAM file index in memory at the same time.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3075 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-25 17:41:22 +00:00
hanna 1f451e17e5 Changing preloaded index to only "preload" reference sequences on demand.
Results in drastic lowering of startup cost when multiple BAM files are 
merged.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3066 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-23 22:02:28 +00:00
hanna 884a577013 Phase 2 of Picard patch refactoring: kill off SAMFileReader2/BAMFileReader2, merging the changes back into the base classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3065 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-23 16:48:11 +00:00
asivache 543aefc3d7 Fixing the bug introduced with the earlier commit. When trimming locus to the current bases, we need to take into account expanded boundaries (for windowed reference traversals)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3059 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-22 19:20:34 +00:00
asivache d2944461ef We also have to allow the window to be (partially) outside the bounds and trimming to the contig size is not enough (thanks to shards). Now we trim to the current bounds too (i.e. if the interval is not completely within current bounds, we create reference context that contains only bases from the overlap between the interval and the bounds).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3057 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-22 17:36:29 +00:00
asivache 9053406798 LocusReferenceView: If the locus a view is requested for spans beyond the reference contig ends, create the actual window bounded by contig ends (so that the locus will not be fully contained in the window!!).
ReferenceContext: constructor does not throw an excepion anymore when locus is not fully contained inside the window. So now we can have a reference context associated with a locus such that the window/actual bases do not cover the whole locus. Scary. I am not sure I like this...

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3056 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-22 15:59:15 +00:00
hanna b4b4e8d672 For Sarah Calvo: initial implementation of read pair traversal, for BAM files
sorted by read name.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3052 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-21 23:22:25 +00:00
hanna c0eb5c27ea Lower memory support for merged sharding. Merged sharding is still not available.
WARNING: If you update frequently, you might have to rm -rf ~/.ant/cache -- this is an unfortunate side effect of the way we
	 distribute picard-private.jar.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3050 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-19 22:03:47 +00:00
hanna 849bd1f451 Set the eagerDecode flag in such a way that the binary data block in the BAM will always be considered dirty.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3014 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 22:01:23 +00:00
hanna 59045ccb28 Filter,merge performs much better than merge,filter. Many thanks to Eric for checking in an integration test that so compellingly demonstrates this.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3011 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 16:23:37 +00:00
hanna 6dd5f192e7 Performance improvements for RODs in conjunction with new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3010 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-16 14:54:12 +00:00
hanna 45f70de6df Fixed bug that failed to reset an accumulator when crossing contig boundaries,
meaning that in special cases of shallow coverage, an interval might get dropped.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2999 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-15 04:45:55 +00:00
aaron 88a48821ea removed the dependence on removeRegion() in GenomeLocSortedSet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2993 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 22:35:49 +00:00
hanna 7aa7a5f9b8 Bug fixes for edge cases and filtration in the earlier performance fixes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2989 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-12 04:46:08 +00:00
hanna 5e8654fcdc Oops! Introduced a performance bug in read interval sharding, when the new sharding system is available. Track more state to avoid this problem in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2987 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 23:19:42 +00:00
aaron 661a043cef adding methods to get RODs by name or type in read traversals, performance improvements to RODs for Reads in general, and some more Tribble infrastructure.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2984 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 21:13:39 +00:00
hanna cbd529d544 Better chopping up of data for ref walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2982 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 20:13:26 +00:00
hanna a7ba88e649 Rework the way the MicroScheduler handles locus shards to handle intervals that span shards
with less memory consumption.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2981 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-11 18:40:31 +00:00
aaron dde9fd8a15 some rods-for-reads cleaning and performance improvements.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2979 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-10 22:54:58 +00:00
asivache c638c29eea In reference traversals, this view did not expect a possibility of TWO alignment contexts (base pileup followed by extended event pileup) associated with the same location. As the result, extended event pileups were silently skipped even when enabled in the traversal engine. Fixed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2970 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-09 22:18:44 +00:00
hanna e4360bac6a More comprehensive support when sharding for ref walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2951 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-08 11:25:20 +00:00
hanna eb165ca844 Celebrate the fact that the new sharding system works with integration tests
by removing the scary debug line.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2950 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 23:40:56 +00:00
hanna 9e107513d0 In the new sharding system, if no read group is present, hallucinate one. Added
for test compatibility, but not sure whether we still need this feature.  TODO: Poll the group about this feature.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2949 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 23:01:34 +00:00
hanna a7fe07c404 A few stopgap fixes to get the GATK to the point where the old sharding
infrastructure can be torn down:
1) New sharding system emulates old MonolithicSharding mechanism.
2) Better awareness of differences between fasta and BAM files when creating
   shards.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2948 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 21:01:25 +00:00
hanna dd6122f682 Fixed another bug in the original sharding system. Updated integration tests
as appropriate.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2947 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-07 15:32:18 +00:00
hanna ee2ec7ced9 Fix off-by-one error in original implementation of read sharding. Tested by
awking output of BamToFastq vs. samtools until the outputs matched exactly.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2945 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-06 18:52:53 +00:00
hanna 1ef1091f7c Cleanup and simplification of read interval sharding.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2944 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-05 23:34:38 +00:00
hanna 7a7e85188c Better eagerDecode default.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2938 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-05 16:42:23 +00:00
hanna adea38fd5e Sharding system fixes for corner cases generally related to lack of coverage
in the BAM file.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2928 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 18:59:21 +00:00
hanna 023654696e First pass at handling SAMFileReaders using a SAMReaderID. This allows us to firewall
GATK users from the readers, which they could abuse in ways that could destabilize the GATK.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2923 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-04 00:59:32 +00:00
aaron 790d2a7776 adding the initial ROD for Reads support; more convenience methods in ReadMetaDataTracker to come.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2918 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-03 15:56:44 +00:00
hanna 104f4f7383 Mediocre implementation of reader pooling within the SAM data source. Will fix this week.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2915 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-02 22:35:02 +00:00
hanna 6133d73bf0 Locus (non-intervalled) traversal with new sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2903 348d0f76-0448-11de-a6fe-93d51630548a
2010-03-01 01:58:44 +00:00
hanna 80f5d2829d Support for read interval sharding with proper filtering.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2902 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-27 20:26:34 +00:00
aaron d8fedd59be docs, cleanup, and some improvements to the iterators.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2901 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 22:36:04 +00:00
hanna b69c2d0f70 Cleanup. Remove some unnecessary methods.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2900 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 21:50:48 +00:00
hanna 30eb28886b Basic functionality for intervaled reads in new sharding system. Not currently filtering out cruft, so
the mode of operation is currently queryOverlapping rather than queryContained.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2899 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 21:41:55 +00:00
aaron 622554d7bd disable a part of the ROD for Reads code until the rest of the system goes live
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2896 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 16:15:42 +00:00
hanna 1017a38f38 Initial refactoring of read traversal to make it easier to drop in intervalled reads traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2894 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-26 15:09:09 +00:00
aaron 246fa28386 RODs for reads phase 2: modified RODRecordList to implement List<ReferenceOrderedDatum> so I could stub it out for testing, added a FlashBackIterator which is needed to prevent the ResourcePool from opening infinity+1 iterators, and some other interfaces to make unit testing much smoother.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2892 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 22:48:55 +00:00
hanna 553d39bb00 Clean up the code a bit following the introduction of reduceByInterval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2887 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 01:20:22 +00:00
hanna 199b43fcf2 Reduce by interval alterations to interface with new sharding system. This checkin with be followed by a
simplification of some of the locus traversal code.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2886 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-25 00:16:50 +00:00
aaron fef1154fc8 starting on RODs for Reads: made RODRecordList implement list<RODatum> (so we can sub in fake lists during testing), and removed unnecessary generic-ness. Removed BrokenRODSimulator, which isn't being used.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2884 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-24 22:11:53 +00:00
hanna 491b30e8de Eliminate a few stray loci that weren't being filtered out.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2875 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-23 18:00:52 +00:00
hanna fff15944fe Bug fix. Stopping condition of recurrence stopped too soon in some cases where an interval *contained* zero reads but *overlapped* with some reads.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2874 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-23 15:58:54 +00:00
hanna a0e8de40cf Bug fix: at one locus in the dataset, two reads were dropped.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2872 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-22 23:54:52 +00:00
hanna 88d0677379 Misc correctness enhancements: develop the bin selector into a recursive algorithm and return a shard when reads are missing. Also improve the performance of the read filter that clips reads not actually present in the shard.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2870 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-22 22:19:06 +00:00
hanna cc09f48cd8 Correctness fix: index can concat chunks around shard edges, and my code didn't account for that.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2861 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-19 21:44:33 +00:00
hanna 71f18e941f Significant performance improvements made by subtracting out the contents of the prior highest-level bin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2859 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-19 16:46:16 +00:00
hanna 232d884578 Got back most of the performance lost when I fixed the dropped reads problem.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2835 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-12 19:59:56 +00:00