Commit Graph

3297 Commits (a3b98daf1a4cff4d57f93ce1002e2bf6f99ed700)

Author SHA1 Message Date
Mark DePristo ee8039bf25 Fix trivial call in unit test 2013-01-23 13:51:58 -05:00
Mark DePristo 09edc6baeb TraverseActiveRegions now writes out very nice active region and activity profile IGV formatted files 2013-01-23 13:46:01 -05:00
Mark DePristo 8e8126506b Renaming IncrementalActivityProfile to ActivityProfile
-- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV
2013-01-23 13:46:01 -05:00
Mark DePristo e917f56df8 Remove old ActivityProfile and old BandPassActivityProfile 2013-01-23 13:46:01 -05:00
Mark DePristo 7fd27a5167 Add band pass filtering activity profile
-- Based on the new incremental activity profile
-- Unit Tested!  Fixed a few bugs with the old band pass filter
-- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties
-- Add new UnitTest for BandPassIncrementalActivityProfile
-- Added normalizeFromRealSpace to MathUtils
-- Cleanup unused code in new activity profiles
2013-01-23 13:46:01 -05:00
Mark DePristo eb60235dcd Working version of incremental active region traversals
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version.  This means that ART walkers will now take much less memory than previously.  On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory.  Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries.  The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries.  This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART.  The new contract ensures that you get exactly the regions you requested.  the isActive function is still called, but its result has no impact on the regions.  With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
2013-01-23 13:46:00 -05:00
Mark DePristo ce160931d5 Optimize creation of reads in ArtificialBAMBuilder
-- Now caches the reads so subsequent calls to makeReads() don't reallocate the reads from scratch each time
2013-01-23 13:46:00 -05:00
Mark DePristo e050f649fd IncrementalActivityProfile, complete with extensive unit tests
-- This is an activity profile compatible with fetching its implied active regions incrementally, as activity profile states are added
2013-01-23 13:45:21 -05:00
Mark DePristo 8d9b0f1bd5 Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality.  Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
2013-01-23 13:45:21 -05:00
Mark DePristo 42b807a5fe Unit tests for ActivityProfileResult 2013-01-23 13:45:20 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Ami Levy-Moonshine 0fb7b73107 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-18 15:03:42 -05:00
Ami Levy-Moonshine 826c29827b change the default VCFs gatherer of the GATK (not just the UG) 2013-01-18 15:03:12 -05:00
Eric Banks 6a903f2c23 I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller.
I've resigned myself instead to create a mapping from Allele to Haplotype.  It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
2013-01-18 01:21:08 -05:00
Eric Banks ded659232b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 22:49:56 -05:00
Eric Banks a623cca89a Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call,
we were incorrectly trying to find the reference position of a base on the read that didn't exist.
Added integration test to cover this case.
2013-01-16 22:47:58 -05:00
Mark DePristo 738c24a3b1 Add tests to ensure that all insertion reads appear in the active region traversal 2013-01-16 16:25:36 -05:00
Eric Banks 79bc818022 Bug fix for VariantsToVCF: old dbSNP files can have '-' as reference base and those records always need to be padded. 2013-01-16 16:15:58 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00
Mark DePristo ddcb33fcf8 Cache result of getLocation() in Shard so we don't performance expensive calculation over and over 2013-01-16 15:30:00 -05:00
Mark DePristo 4d0e7b50ec ArtificialBAMBuilder utility class for creating streams of GATKSAMRecords with a variety of properties
--  Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start)
-- This stream can be handed back to the caller immediately, or written to an indexed BAM file
-- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)
2013-01-16 15:29:59 -05:00
Eric Banks ec1cfe6732 Oops, forgot to add 1 of my files 2013-01-16 15:05:49 -05:00
Eric Banks e47a389b26 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 14:59:11 -05:00
Eric Banks d18dbcbac1 Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base.
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0?  When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
2013-01-16 14:55:33 -05:00
Khalid Shakir 4ffb43079f Re-committing the following changes from Dec 18:
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
2013-01-16 12:43:15 -05:00
Eric Banks 445735a4a5 There was no reason to be sharing the Haplotype infrastructure between the HaplotypeCaller and the HaplotypeScore annotation since they were really looking for different things.
Separated them out, adding efficiencies for the HaplotypeScore version.
2013-01-16 11:10:13 -05:00
Eric Banks 392b5cbcdf The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).

As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases.  This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests.  Fortunately that walker is being deprecated soon...
2013-01-16 10:22:43 -05:00
Eric Banks 4fb3e48099 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 00:13:38 -05:00
Eric Banks 0d282a7750 Bam writing from HaplotypeCaller seems to be working on all my test cases. Note that it's a hidden debugging option for now.
Please let me know if you notice any bad behavior with it.
2013-01-16 00:12:02 -05:00
Eric Banks d3baa4b8ca Have Haplotype extend the Allele class.
This way, we don't need to create a new Allele for every read/Haplotype pair to be placed in the PerReadAlleleLikelihoodMap (very inefficient).  Also, now we can easily get the Haplotype associated with the best allele for a given read.
2013-01-15 11:36:20 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Eric Banks 94800771e3 1. Initial implementation of bam writing for the HaplotypeCaller with -bam argument; currently only assembled haplotypes are emitted.
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs.  Until the HC uses meta data though, this won't work.
2013-01-15 10:19:18 -05:00
Mark DePristo 39bc9e999d Add a test to LocusIteratorByState to ensure that we aren't holding reads anywhere
-- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position.  Will crash with an out-of-memory error if we're holding reads anywhere in the system.
-- Is there a better way to test this behavior?
2013-01-14 16:30:16 -05:00
Mark DePristo b8b2b9b2de ManagingReferenceOrderedView optimization: don't allow a fresh RefMetaDataTracker in the frequent case where there's no reference meta data 2013-01-14 16:30:16 -05:00
Mark DePristo 7eea6b8f92 ReservoirDownsampler optimizations
-- Add an option to not allocate always ArrayLists of targetSampleSize, but rather the previous size + MARGIN.  This helps for LIBS as most of the time we don't need nearly so much space as we allow
-- consumeFinalizedItems returns an empty list if the reservior is empty, which it often true for our BAM files with low coverage
-- Allow empty sample lists for SamplePartitioner as these are used by the RefTraversals and other non-read based traversals

Make the reservoir downsampler use a linked list, rather than a fixed sized array list, in the expectFewOverflows case
2013-01-14 16:30:16 -05:00
Mark DePristo c7f0ca8ac5 Optimization for LIBS: PerSampleReadStateManager now uses a simple LinkedList of AlignmentStateMachine
-- Instead of storing a list of list of alignment starts, which is expensive to manipulate, we instead store a linear list of alignment starts.  Not grouped as previously.  This enables us to simplify iteration and update operations, making them much faster
-- Critically, the downsampler still requires this list of list.  We convert back and forth between these two representations as required, which is very rarely for normal data sets (WGS NA12878 on chr20 is 0.2%, 4x WGS is even less).
2013-01-14 16:30:16 -05:00
Mark DePristo 5a5422e4f8 Refactor PerSampleReadStates into a separate class
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
2013-01-14 16:30:16 -05:00
Mark DePristo 5c2799554a Refactor updateReadStates into PerSampleReadStateManager, add tracking of downsampling rate 2013-01-14 16:30:16 -05:00
Mark DePristo a4334a67e0 SamplePartitioner optimizations and bugfixes
-- Use a linked hash map instead of a hash map since we want to iterate through the map fairly often
-- Ensure that we call doneSubmittingReads before getting reads for samples.  This function call fell out before and since it wasn't enforced I only noticed the problem while writing comments
-- Don't make unnecessary calls to contains for map.  Just use get() and check that the result is null
-- Use a LinkedList in PassThroughDownsampler, since this is faster for add() than the existing ArrayList, and we were's using random access to any resulting
2013-01-14 16:30:16 -05:00
Mark DePristo 19288b007d LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir
-- Added unit tests to ensure this behavior is correct
2013-01-14 16:30:16 -05:00
Mark DePristo 83fcc06e28 LIBS optimizations and performance tools
-- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM
-- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls
-- Moved updateReadStates to ReadStateManager
-- Removed the unnecessary wrapping of an iterator in ReadStateManager
-- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples.  Now those are just map native iterations
-- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks.  The new piece is code is much clearer and avoids several unnecessary function calls
2013-01-14 16:30:15 -05:00
Mark DePristo ec05ecef60 getAdaptorBoundary returns an int, not an Integer, as this was taking 30% of the allocation effort for LIBS 2013-01-14 16:30:15 -05:00
Mark DePristo 3a6b4b43b7 Backporting LIBSPerformance improvements to original commit 2013-01-13 09:53:10 -05:00
Mark DePristo f204908a94 Add some todos for future optimization to LIBS 2013-01-11 15:17:18 -05:00
Mark DePristo e88dae2758 LocusIteratorByState operates natively on GATKSAMRecords now
-- Updated code to reflect this new typing
2013-01-11 15:17:18 -05:00
Mark DePristo 94cb50d3d6 Retire LegacyLocusIteratorByState
-- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK
2013-01-11 15:17:18 -05:00
Mark DePristo cc0c1b752a Delete old LocusIteratorByState, leaving only new LIBS and legacy 2013-01-11 15:17:18 -05:00
Mark DePristo bd03511e35 Updating AlignmentStateMachinePerformance to include some more useful performance assessments 2013-01-11 15:17:18 -05:00
Mark DePristo 9e23c592e6 ReadBackedPileup cleanup
-- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl
-- ReadBackedPileupImpl was literally a shell class after we removed extended events.  A few bits of code cleanup and we reduced a bunch of class complexity in the gatk
-- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed
-- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well.
-- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups.
-- Added commons-collections for IteratorChain
2013-01-11 15:17:18 -05:00
Mark DePristo e3e3ae29b2 Final documentation for LocusIteratorByState 2013-01-11 15:17:18 -05:00