Commit Graph

11592 Commits (c8de9b92d0c26d75b96abe8edfa3a5ca2fd54bdc)

Author SHA1 Message Date
Mark DePristo c8de9b92d0 Updating CountReadsInActiveRegionsIntegrationTest integration tests due to new ART 2013-01-15 15:41:33 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Mark DePristo 39bc9e999d Add a test to LocusIteratorByState to ensure that we aren't holding reads anywhere
-- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position.  Will crash with an out-of-memory error if we're holding reads anywhere in the system.
-- Is there a better way to test this behavior?
2013-01-14 16:30:16 -05:00
Mark DePristo b8b2b9b2de ManagingReferenceOrderedView optimization: don't allow a fresh RefMetaDataTracker in the frequent case where there's no reference meta data 2013-01-14 16:30:16 -05:00
Mark DePristo 7eea6b8f92 ReservoirDownsampler optimizations
-- Add an option to not allocate always ArrayLists of targetSampleSize, but rather the previous size + MARGIN.  This helps for LIBS as most of the time we don't need nearly so much space as we allow
-- consumeFinalizedItems returns an empty list if the reservior is empty, which it often true for our BAM files with low coverage
-- Allow empty sample lists for SamplePartitioner as these are used by the RefTraversals and other non-read based traversals

Make the reservoir downsampler use a linked list, rather than a fixed sized array list, in the expectFewOverflows case
2013-01-14 16:30:16 -05:00
Mark DePristo c7f0ca8ac5 Optimization for LIBS: PerSampleReadStateManager now uses a simple LinkedList of AlignmentStateMachine
-- Instead of storing a list of list of alignment starts, which is expensive to manipulate, we instead store a linear list of alignment starts.  Not grouped as previously.  This enables us to simplify iteration and update operations, making them much faster
-- Critically, the downsampler still requires this list of list.  We convert back and forth between these two representations as required, which is very rarely for normal data sets (WGS NA12878 on chr20 is 0.2%, 4x WGS is even less).
2013-01-14 16:30:16 -05:00
Mark DePristo 5a5422e4f8 Refactor PerSampleReadStates into a separate class
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
2013-01-14 16:30:16 -05:00
Mark DePristo 5c2799554a Refactor updateReadStates into PerSampleReadStateManager, add tracking of downsampling rate 2013-01-14 16:30:16 -05:00
Mark DePristo a4334a67e0 SamplePartitioner optimizations and bugfixes
-- Use a linked hash map instead of a hash map since we want to iterate through the map fairly often
-- Ensure that we call doneSubmittingReads before getting reads for samples.  This function call fell out before and since it wasn't enforced I only noticed the problem while writing comments
-- Don't make unnecessary calls to contains for map.  Just use get() and check that the result is null
-- Use a LinkedList in PassThroughDownsampler, since this is faster for add() than the existing ArrayList, and we were's using random access to any resulting
2013-01-14 16:30:16 -05:00
Mark DePristo 19288b007d LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir
-- Added unit tests to ensure this behavior is correct
2013-01-14 16:30:16 -05:00
Mark DePristo 83fcc06e28 LIBS optimizations and performance tools
-- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM
-- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls
-- Moved updateReadStates to ReadStateManager
-- Removed the unnecessary wrapping of an iterator in ReadStateManager
-- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples.  Now those are just map native iterations
-- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks.  The new piece is code is much clearer and avoids several unnecessary function calls
2013-01-14 16:30:15 -05:00
Mark DePristo ec05ecef60 getAdaptorBoundary returns an int, not an Integer, as this was taking 30% of the allocation effort for LIBS 2013-01-14 16:30:15 -05:00
Chris Hartl 682c59ff04 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-14 13:27:34 -05:00
Chris Hartl 61bc334df1 Ensure output table formatting does not contain NaNs. For (0 eval ref calls)/(0 comp ref calls), set the proportion to 0.00.
Added integration tests (checked against manual tabulation)
2013-01-14 09:21:30 -05:00
Mark DePristo 3a6b4b43b7 Backporting LIBSPerformance improvements to original commit 2013-01-13 09:53:10 -05:00
Ryan Poplin a7fe334a3f calculating the md5s for the new tests. 2013-01-11 15:43:52 -05:00
Ryan Poplin 65afec2a53 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-11 15:22:52 -05:00
Mark DePristo 85b529cced Updating MD5s in HC and UG that changed due to new LIBS
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list.  Not ideal)
-- Trivial read count differences in HC
2013-01-11 15:17:19 -05:00
Mark DePristo f204908a94 Add some todos for future optimization to LIBS 2013-01-11 15:17:18 -05:00
Mark DePristo e88dae2758 LocusIteratorByState operates natively on GATKSAMRecords now
-- Updated code to reflect this new typing
2013-01-11 15:17:18 -05:00
Mark DePristo 94cb50d3d6 Retire LegacyLocusIteratorByState
-- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK
2013-01-11 15:17:18 -05:00
Mark DePristo cc0c1b752a Delete old LocusIteratorByState, leaving only new LIBS and legacy 2013-01-11 15:17:18 -05:00
Mark DePristo bd03511e35 Updating AlignmentStateMachinePerformance to include some more useful performance assessments 2013-01-11 15:17:18 -05:00
Mark DePristo 9e23c592e6 ReadBackedPileup cleanup
-- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl
-- ReadBackedPileupImpl was literally a shell class after we removed extended events.  A few bits of code cleanup and we reduced a bunch of class complexity in the gatk
-- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed
-- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well.
-- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups.
-- Added commons-collections for IteratorChain
2013-01-11 15:17:18 -05:00
Mark DePristo e3e3ae29b2 Final documentation for LocusIteratorByState 2013-01-11 15:17:18 -05:00
Mark DePristo 6a91902aa2 Fix final merge conflicts 2013-01-11 15:17:18 -05:00
Mark DePristo b9a33d3c66 Split original and optimized ART into largely independent pieces
-- Allows us to cleanly run old and new art, which now have different traversal behavior (on purpose).  Split unit tests as well.
2013-01-11 15:17:18 -05:00
Mark DePristo 02130dfde7 Cleanup ART
-- Initialize routine captures essential information for running the traversal
2013-01-11 15:17:17 -05:00
Mark DePristo 9b2be795a7 Initial working version of new ActiveRegionTraversal based on the LocusIteratorByState read stream
-- Implemented as a subclass of TraverseActiveRegions
-- Passes all unit tests
-- Will be very slow -- needs logical fixes
2013-01-11 15:17:17 -05:00
Mark DePristo 8b83f4d6c7 Near final cleanup of PileupElement
-- All functions documented and unit tested
-- New constructor interface
-- Cleanup some uses of old / removed functionality
2013-01-11 15:17:17 -05:00
Mark DePristo fb9eb3d4ee PileupElement and LIBS cleanup
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
2013-01-11 15:17:17 -05:00
Mark DePristo 2f2a592c8e Contracts and documentation for AlignmentStateMachine and LocusIteratorByState
-- Add more unit tests for both as well
2013-01-11 15:17:17 -05:00
Mark DePristo cc1d259cac Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement
-- Added unit tests for this behavior.  Updated users of this code
2013-01-11 15:17:17 -05:00
Mark DePristo 2c38310868 Create LIBS using new AlignmentStateMachine infrastructure
-- Optimizations to AlignmentStateMachine
-- Properly count deletions.  Added unit test for counting routines
-- AlignmentStateMachine.java is no longer recursive
-- Traversals now use new LIBS, not the old one
2013-01-11 15:17:17 -05:00
Mark DePristo 80d9b7011c Complete rewrite of low-level machinery of LIBS, not hooked up
-- AlignmentStateMachine does what SAMRecordAlignmentState should really do.  It's correct in that it's more accurate than the LIB_position tests themselves.  This is a non-broken, correct implementation.  Needs cleanup, contracts, etc.
-- This version is like 6x slower than the original implementation (according to the google caliper benchmark here).  Obvious optimizations for future commit
2013-01-11 15:17:16 -05:00
Mark DePristo b53286cc3c HaplotypeCaller mode to skip assembly and genotyping for performance testing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script
2013-01-11 15:17:16 -05:00
Mark DePristo 0ac4352614 LIBS can now (optionally) track the unique reads it uses from the underlying read iterator
-- This capability is essential to provide an ordered set of used reads to downstream users of LIBS, such as ART, who want an efficient way to get the reads used in LIBS
-- Vastly expanded the multi-read, multi-sample LIBS unit tests to make sure this capability is working
-- Added createReadStream to ArtificialSAMUtils that makes it relatively easy to create multi-read, multi-sample read streams for testing
2013-01-11 15:17:16 -05:00
Mark DePristo b3ecfbfce8 Refactor LIBS into component parts, expand unit tests, some code cleanup
-- Split out all of the inner classes of LIBS into separate independent classes
-- Split / add unit tests for many of these components.
-- Radically expand unit tests for SAMRecordAlignmentState (the lowest level piece of code) making sure at least some of it works
-- No need to change unit tests or integration tests.  No change in functionality.
-- Added (currently disabled) code to track all submitted reads to LIBS, but this isn't accessible or tested
2013-01-11 15:17:16 -05:00
Mark DePristo 2e5d38fd0e Updating to latest google caliper code 2013-01-11 15:17:16 -05:00
Mark DePristo b2990497e2 Refactor LIBS into utils.locusiterator before refactoring 2013-01-11 15:17:16 -05:00
Ryan Poplin e952296c10 Adding HC GGA integration test to cover duplicated input alleles. 2013-01-11 15:01:27 -05:00
Ryan Poplin 7f7f40f851 Adding additional HC GGA integration tests to cover more complicated input alleles. 2013-01-11 14:36:21 -05:00
Mauricio Carneiro 9ed922d562 Updating licenses to Eric's last commit
- for now we're still running the script by hand, soon automated solution will be in place.

GSATDG-5
2013-01-11 14:33:00 -05:00
Mauricio Carneiro 009d2f5705 Removed CMI specific script from GATK repo 2013-01-11 14:33:00 -05:00
Ami Levy-Moonshine e9a8b1a403 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-11 14:18:26 -05:00
Ami Levy-Moonshine 9519c3fd6f (1) add scala script to merge bam files; (2) few change in Queue script that run the new CoveredByNSamplesSites walker 2013-01-11 13:47:05 -05:00
Mauricio Carneiro bc64d4240f Licensing update -- batch #2
- caught all scala files that didn't have proper package information / class names
   - included all source files in archive as well

GSATDG-5
2013-01-11 13:38:11 -05:00
Mauricio Carneiro 4ea2c5df43 Updating updateAllLicenses scripts to include archived files
GSATDG-5
2013-01-11 13:38:05 -05:00
Mauricio Carneiro 28235f57f2 Adding package information to scala scripts that were missing it. Including archived ones.
GSATDG-5
2013-01-11 13:38:05 -05:00
Mauricio Carneiro cc9a2aaee7 Script to identify code without package info
- package information is critical for the licensing scripts. All java and scala files MUST contain package information.

GSATDG-5
2013-01-11 13:38:05 -05:00