Commit Graph

1358 Commits (0f5bb706ffd3e02a2fec45f32317f29e4e45cb6d)

Author SHA1 Message Date
David Roazen f63f27aa13 org.broadinstitute.variant refactor, part 2
-removed sting dependencies from test classes
-removed org.apache.log4j dependency
-misc cleanup
2013-01-28 09:03:46 -05:00
David Roazen 3744d1a596 Collapse the downsampling fork in the GATK engine
With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
2013-01-28 01:50:30 -05:00
Mark DePristo 63913d516f Add join call to Progress meter unit test so we actually know the daemon thread has finished 2013-01-27 16:52:45 -05:00
Mark DePristo c97a361b5d Added realistic BandPassFilterUnitTest that ensures quality results for 1000G phase I VCF and NA12878 VCF
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions.  This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined.  This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
2013-01-27 14:10:08 -05:00
Mauricio Carneiro 705cccaf63 Making SplitReads output FastQ's instead of BAM
- eliminates one step in my pipeline
   - BAM is too finicky and maintaining parameters that wouldn't be useful was becoming a headache, better avoided.
2013-01-27 02:36:31 -05:00
Mauricio Carneiro 6ea7133d95 Updating licenses of latest moved files 2013-01-26 13:46:52 -05:00
Ami Levy-Moonshine 99cb8d68e9 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-25 16:07:38 -05:00
Mark DePristo b8c0b05785 Add contract to ensure that getAdapterBoundary returns the right result
-- Also renamed the function to getAdaptorBoundary for consistency across the codebase
2013-01-25 16:05:17 -05:00
Mark DePristo e445c71161 LIBS optimization for adapter clipping
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
2013-01-25 16:05:17 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Mark DePristo 592f90aaef ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events.  The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
2013-01-24 13:48:00 -05:00
Mark DePristo 0c94e3d96e Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp
-- Previously we allowed band pass filter size to be specified along with the sigma.  But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel.  This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
2013-01-24 13:47:59 -05:00
Mark DePristo 9e43a2028d Making band pass filter size, sigma, active region max size and extension all accessible from the command line 2013-01-24 13:47:59 -05:00
Mark DePristo ee8039bf25 Fix trivial call in unit test 2013-01-23 13:51:58 -05:00
Mark DePristo 8e8126506b Renaming IncrementalActivityProfile to ActivityProfile
-- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV
2013-01-23 13:46:01 -05:00
Mark DePristo e917f56df8 Remove old ActivityProfile and old BandPassActivityProfile 2013-01-23 13:46:01 -05:00
Mark DePristo 7fd27a5167 Add band pass filtering activity profile
-- Based on the new incremental activity profile
-- Unit Tested!  Fixed a few bugs with the old band pass filter
-- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties
-- Add new UnitTest for BandPassIncrementalActivityProfile
-- Added normalizeFromRealSpace to MathUtils
-- Cleanup unused code in new activity profiles
2013-01-23 13:46:01 -05:00
Mark DePristo eb60235dcd Working version of incremental active region traversals
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version.  This means that ART walkers will now take much less memory than previously.  On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory.  Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries.  The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries.  This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART.  The new contract ensures that you get exactly the regions you requested.  the isActive function is still called, but its result has no impact on the regions.  With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
2013-01-23 13:46:00 -05:00
Mark DePristo e050f649fd IncrementalActivityProfile, complete with extensive unit tests
-- This is an activity profile compatible with fetching its implied active regions incrementally, as activity profile states are added
2013-01-23 13:45:21 -05:00
Mark DePristo 8d9b0f1bd5 Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality.  Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
2013-01-23 13:45:21 -05:00
Mark DePristo 42b807a5fe Unit tests for ActivityProfileResult 2013-01-23 13:45:20 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Mark DePristo 738c24a3b1 Add tests to ensure that all insertion reads appear in the active region traversal 2013-01-16 16:25:36 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00
Mark DePristo 4d0e7b50ec ArtificialBAMBuilder utility class for creating streams of GATKSAMRecords with a variety of properties
--  Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start)
-- This stream can be handed back to the caller immediately, or written to an indexed BAM file
-- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)
2013-01-16 15:29:59 -05:00
Eric Banks ec1cfe6732 Oops, forgot to add 1 of my files 2013-01-16 15:05:49 -05:00
Eric Banks e47a389b26 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 14:59:11 -05:00
Eric Banks d18dbcbac1 Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base.
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0?  When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
2013-01-16 14:55:33 -05:00
Khalid Shakir 4ffb43079f Re-committing the following changes from Dec 18:
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
2013-01-16 12:43:15 -05:00
Eric Banks 392b5cbcdf The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).

As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases.  This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests.  Fortunately that walker is being deprecated soon...
2013-01-16 10:22:43 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Mark DePristo 39bc9e999d Add a test to LocusIteratorByState to ensure that we aren't holding reads anywhere
-- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position.  Will crash with an out-of-memory error if we're holding reads anywhere in the system.
-- Is there a better way to test this behavior?
2013-01-14 16:30:16 -05:00
Mark DePristo 5a5422e4f8 Refactor PerSampleReadStates into a separate class
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
2013-01-14 16:30:16 -05:00
Mark DePristo 19288b007d LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir
-- Added unit tests to ensure this behavior is correct
2013-01-14 16:30:16 -05:00
Mark DePristo 83fcc06e28 LIBS optimizations and performance tools
-- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM
-- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls
-- Moved updateReadStates to ReadStateManager
-- Removed the unnecessary wrapping of an iterator in ReadStateManager
-- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples.  Now those are just map native iterations
-- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks.  The new piece is code is much clearer and avoids several unnecessary function calls
2013-01-14 16:30:15 -05:00
Mark DePristo ec05ecef60 getAdaptorBoundary returns an int, not an Integer, as this was taking 30% of the allocation effort for LIBS 2013-01-14 16:30:15 -05:00
Mark DePristo e88dae2758 LocusIteratorByState operates natively on GATKSAMRecords now
-- Updated code to reflect this new typing
2013-01-11 15:17:18 -05:00
Mark DePristo 94cb50d3d6 Retire LegacyLocusIteratorByState
-- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK
2013-01-11 15:17:18 -05:00
Mark DePristo cc0c1b752a Delete old LocusIteratorByState, leaving only new LIBS and legacy 2013-01-11 15:17:18 -05:00
Mark DePristo bd03511e35 Updating AlignmentStateMachinePerformance to include some more useful performance assessments 2013-01-11 15:17:18 -05:00
Mark DePristo 9e23c592e6 ReadBackedPileup cleanup
-- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl
-- ReadBackedPileupImpl was literally a shell class after we removed extended events.  A few bits of code cleanup and we reduced a bunch of class complexity in the gatk
-- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed
-- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well.
-- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups.
-- Added commons-collections for IteratorChain
2013-01-11 15:17:18 -05:00
Mark DePristo b9a33d3c66 Split original and optimized ART into largely independent pieces
-- Allows us to cleanly run old and new art, which now have different traversal behavior (on purpose).  Split unit tests as well.
2013-01-11 15:17:18 -05:00
Mark DePristo 02130dfde7 Cleanup ART
-- Initialize routine captures essential information for running the traversal
2013-01-11 15:17:17 -05:00
Mark DePristo 9b2be795a7 Initial working version of new ActiveRegionTraversal based on the LocusIteratorByState read stream
-- Implemented as a subclass of TraverseActiveRegions
-- Passes all unit tests
-- Will be very slow -- needs logical fixes
2013-01-11 15:17:17 -05:00
Mark DePristo 8b83f4d6c7 Near final cleanup of PileupElement
-- All functions documented and unit tested
-- New constructor interface
-- Cleanup some uses of old / removed functionality
2013-01-11 15:17:17 -05:00
Mark DePristo fb9eb3d4ee PileupElement and LIBS cleanup
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
2013-01-11 15:17:17 -05:00
Mark DePristo 2f2a592c8e Contracts and documentation for AlignmentStateMachine and LocusIteratorByState
-- Add more unit tests for both as well
2013-01-11 15:17:17 -05:00
Mark DePristo cc1d259cac Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement
-- Added unit tests for this behavior.  Updated users of this code
2013-01-11 15:17:17 -05:00
Mark DePristo 2c38310868 Create LIBS using new AlignmentStateMachine infrastructure
-- Optimizations to AlignmentStateMachine
-- Properly count deletions.  Added unit test for counting routines
-- AlignmentStateMachine.java is no longer recursive
-- Traversals now use new LIBS, not the old one
2013-01-11 15:17:17 -05:00
Mark DePristo 80d9b7011c Complete rewrite of low-level machinery of LIBS, not hooked up
-- AlignmentStateMachine does what SAMRecordAlignmentState should really do.  It's correct in that it's more accurate than the LIB_position tests themselves.  This is a non-broken, correct implementation.  Needs cleanup, contracts, etc.
-- This version is like 6x slower than the original implementation (according to the google caliper benchmark here).  Obvious optimizations for future commit
2013-01-11 15:17:16 -05:00
Mark DePristo 0ac4352614 LIBS can now (optionally) track the unique reads it uses from the underlying read iterator
-- This capability is essential to provide an ordered set of used reads to downstream users of LIBS, such as ART, who want an efficient way to get the reads used in LIBS
-- Vastly expanded the multi-read, multi-sample LIBS unit tests to make sure this capability is working
-- Added createReadStream to ArtificialSAMUtils that makes it relatively easy to create multi-read, multi-sample read streams for testing
2013-01-11 15:17:16 -05:00
Mark DePristo b3ecfbfce8 Refactor LIBS into component parts, expand unit tests, some code cleanup
-- Split out all of the inner classes of LIBS into separate independent classes
-- Split / add unit tests for many of these components.
-- Radically expand unit tests for SAMRecordAlignmentState (the lowest level piece of code) making sure at least some of it works
-- No need to change unit tests or integration tests.  No change in functionality.
-- Added (currently disabled) code to track all submitted reads to LIBS, but this isn't accessible or tested
2013-01-11 15:17:16 -05:00
Mark DePristo 2e5d38fd0e Updating to latest google caliper code 2013-01-11 15:17:16 -05:00
Mark DePristo b2990497e2 Refactor LIBS into utils.locusiterator before refactoring 2013-01-11 15:17:16 -05:00
Mauricio Carneiro 2a4ccfe6fd Updated all JAVA file licenses accordingly
GSATDG-5
2013-01-10 17:06:41 -05:00
Eric Banks 4fa439d89e Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now 2013-01-09 11:06:10 -05:00
Eric Banks b099e2b4ae Moving integration tests to protected 2013-01-08 09:34:08 -05:00
Eric Banks 35d9bd377c Moved (nearly) all Walkers from public to protected and removed GATKLite utils 2013-01-07 14:42:40 -05:00
Eric Banks b4e7b3d691 Fixed precision problem in the Bayesian calculation of Qemp: we need to cap below max integer because the MathUtils code add +1.
Added unit tests for handling large number of observations.
2013-01-07 13:07:36 -05:00
Eric Banks ef638489d5 Fixing BQSR gatherer test to keep up to date with latest changes 2013-01-06 14:07:59 -05:00
Eric Banks 52067f0549 Handle merge conflicts 2013-01-06 12:29:12 -05:00
Eric Banks bf25e151ff Handle long->int precision in Bayesian estimate 2013-01-06 12:26:32 -05:00
Mark DePristo b403c269e9 Make multi-threaded progress meter daemon unit test more robust 2013-01-05 12:59:18 -05:00
Mark DePristo 69bf70c42e Cleanup and more unit tests for RecalibrationTables in BQSR
-- Added unit tests for combining RecalibrationTables.  As a side effect now has serious tests for incrementDatumOrPutIfNecessary
-- Removed unnecessary enum.index system from RecalibrationTables.
-- Moved what were really static utility methods out of RecalibrationEngine and into RecalUtils.
2013-01-05 12:50:27 -05:00
Chris Hartl 9df30880cb Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-04 17:15:22 -05:00
Joel Thibault 01738e70c3 Archive the experimental Active Region Traversals 2013-01-04 17:05:31 -05:00
Chris Hartl 41bc416b65 Remove AAL and update MD5s. 2013-01-04 16:46:14 -05:00
Eric Banks bce6fce58d Resolving merge conflicts after Mark's latest push 2013-01-04 14:46:39 -05:00
Eric Banks dd7f5e2be7 Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests. 2013-01-04 14:43:11 -05:00
Joel Thibault ab5526b372 More TODOs 2013-01-04 14:09:02 -05:00
Tad Jordan fe06912a87 Removed sorting by row from walkers 2013-01-04 11:52:33 -05:00
Mark DePristo 810e2da1d4 Cleanup and unit tests for EventType and ReadRecalibrationInfo in BQSR
-- Added unit tests for EventType and ReadRecalibrationInfo
-- Simplified interface of EventType.  Previously this enum carried an index with it, but this is redundant with the enum.ordinal function.  Now just using that function instead.
2013-01-04 11:39:25 -05:00
Mark DePristo 1ba8d47a81 Unit tests for ProgressMeterDaemon 2013-01-04 11:39:24 -05:00
Mark DePristo fbee4c11f1 Unit tests for ProgressMeterData 2013-01-04 11:39:23 -05:00
Joel Thibault 319d651e4a Initial updates for ActiveRegionShard 2013-01-03 17:00:13 -05:00
Joel Thibault e7553545ef Initial updates for ReadShard 2013-01-03 17:00:13 -05:00
Joel Thibault 14a3ac0e3c Enable the use of alternate shards 2013-01-03 17:00:13 -05:00
Joel Thibault 47e620dfbc Create BAM index to test shard boundaries 2013-01-03 17:00:12 -05:00
Tad Jordan c1ba12d71a Added unit test for outputting sorted GATKReport Tables
- Made few small modifications to code
- Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table
2013-01-03 16:53:59 -05:00
Joel Thibault dcb7735d3c Active Region extensions must stay on contig 2013-01-02 14:46:24 -05:00
Chris Hartl 09199366b7 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-02 14:44:49 -05:00
Chris Hartl e1d09ab0db QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests *should* all pass. 2013-01-02 14:41:29 -05:00
Joel Thibault a15f368bdc Re-enable testIsActiveRangeLow/High 2013-01-02 11:57:50 -05:00
Mark DePristo 12f4c6307e AutoFormattingTime cleanup and complete unittests
-- Underlying system now uses long nano times to be more consistent with standard java practice
-- Updated a few places in the code that were converting from nanoseconds to double seconds to use the new nanoseconds interface directly
-- Bringing us to 100% test coverage with clover with AutoFormattingTimeUnitTest
2013-01-02 11:29:25 -05:00
Joel Thibault 429567cd3f Rename to TraverseActiveRegionsUnitTest 2013-01-01 19:20:30 -05:00
Joel Thibault 57d38aac8a Temporarily disable due to unknown contracts problem 2013-01-01 19:20:04 -05:00
Joel Thibault 7748b3816f Delete the test BAI file as well as the BAM 2013-01-01 19:20:02 -05:00
Joel Thibault 5afeb465aa TODOs 2013-01-01 19:19:17 -05:00
Eric Banks 75d5b88a3d Enabling the Recal Report unit test (which looks like it was never ever enabled) 2012-12-26 15:35:50 -05:00
Mark DePristo 04cc75aaec Minor cleanup and expansion of the RecalDatum unit tests 2012-12-24 13:35:58 -05:00
Mark DePristo 7bf1f67273 BQSR optimization: read group x quality score calibration table is thread-local
-- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table.  Radically reduces the contention for RecalDatum in this very highly used table
-- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables.  Updated combine in RecalibrationReport to use this table combiner function
-- Made several core functions in RecalDatum into final methods for performance
-- Added RecalibrationTestUtils, a home for recalibration testing utilities
2012-12-24 13:35:58 -05:00
Mark DePristo 7d250a789a ArtificialReadPileupTestProvider now creates GATKSamRecords with good header values 2012-12-24 13:35:57 -05:00
Mark DePristo 295455eee2 NanoScheduler optimizations and simplification
-- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc.  All hideously complex.
-- This new model is vastly simply.  The reducer basically knows nothing about the control mechanisms in the NanoScheduler.  It just supports multi-threaded reduce.  The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down.  The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1.  The master thread gets the final reduce result when its free by the latch reaching 0.  It's all super super simple.
-- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.
2012-12-24 13:35:57 -05:00
Mark DePristo bf81db40f7 NanoScheduler reducer optimizations
-- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access.  If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work.  They removes one major source of blocking contention in the NanoScheduler
2012-12-24 13:35:57 -05:00
Mark DePristo 161487b4a4 MapResult compareTo() is now unit tested
-- Thanks clover!
2012-12-24 13:35:57 -05:00
Mark DePristo 7796ba7601 Minor optimizations for NanoScheduler
-- Reducer.maybeReleaseLatch is no longer synchronized
-- NanoScheduler only prints progress every 100 or so map calls
2012-12-24 13:35:56 -05:00
Mark DePristo 0f04485c24 NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue
-- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue.
-- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N.  This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step
-- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code.
-- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc)
-- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs
2012-12-24 13:35:56 -05:00
Tad Jordan b491c177ff Added functionality of outputting sorted GATKReport Tables
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
2012-12-20 14:02:21 -05:00
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Mark DePristo 1ca13f9581 Fundamentally better model for the NanoScheduler
-- Now each map job reads a value, performs map, and does as much reducing as possible.  This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc.  All of this is accomplished using exactly NCT% of the CPU of the machine.
-- Has the additional value of actually simplifying the code
-- Resolves a long-standing annoyance with the nano scheduler.
2012-12-19 09:31:31 -05:00
Joel Thibault a29df3e094 oops 2012-12-18 19:03:12 -05:00
Joel Thibault ee22c1bf44 More TODOs 2012-12-18 18:47:43 -05:00
Joel Thibault 2b1db519d7 Add reads which overstep a boundary by a single base 2012-12-18 18:47:43 -05:00
Joel Thibault 9828b2990f Reads off the end of a contig fail SAM validation when using actual BAMs 2012-12-18 18:47:43 -05:00
Joel Thibault 72e2394b26 Create actual BAM 2012-12-18 18:47:43 -05:00
Joel Thibault d69d1f8988 Fun with varargs 2012-12-18 18:47:42 -05:00
Joel Thibault 1158c1529f Refactor region/read comparisons 2012-12-18 18:47:42 -05:00
Yossi Farjoun 19dd2d628a some changes.
some changes.
2012-12-14 17:21:32 -05:00
Eric Banks 696bf95fba Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer). 2012-12-13 23:28:30 +00:00
Ami Levy-Moonshine 2f99569dda change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field) 2012-12-10 22:48:15 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Mark DePristo dbf721968d PrintReads large-scale test to protect against another major low-level performance issue 2012-12-05 21:36:27 -05:00
Mark DePristo 465694078e Major performance improvement to the GATK engine
-- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers.  Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests.
-- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x.  For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement).
-- Took this opportunity to improve the GATK ProgressMeter.  NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress.  This removes all inner loop calls to the GATK timers.
-- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402
2012-12-05 14:49:22 -05:00
Mark DePristo 2b601571e7 Better error handling in NanoScheduler
-- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown.  Errors, like out of memory, would cause the whole system to die.  This bugfix resolves that issue
2012-12-05 14:49:22 -05:00
Eric Banks 5fed9df295 Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh). 2012-12-03 12:18:20 -05:00
Eric Banks b6839b3049 Added checking in the GATK for mis-encoded quality scores.
The check is performed by a Read Transformer that samples (currently set to once
every 1000 reads so that we don't hurt overall GATK performance) from the input
reads and checks to make sure that none of the base quals is too high (> Q60). If
we encounter such a base then we fail with a User Error.

* Can be over-ridden with --allow_potentially_misencoded_quality_scores.
* Also, the user can choose to fix his quals on the fly (presumably using PrintReads
  to write out a fixed bam) with the --fix_misencoded_quality_scores argument.

Added unit tests.
2012-12-03 11:18:41 -05:00
Joel Thibault c76c808268 Reads are required to be sorted
- Remove the extended_only case because it's outside intervals
2012-11-28 13:59:58 -05:00
Joel Thibault 198923b597 Add ActiveRegionReadState handling 2012-11-28 13:59:57 -05:00
Mark DePristo 7e4b9c9e6e Fix failing unit tests for VariantContextUtilsUnitTest
-- Previous version was adding multiple samples with the same name to the variant context
2012-11-27 14:26:23 -05:00
Joel Thibault 9bfe39411e Equal overlap should match right/later region 2012-11-27 13:03:13 -05:00
Joel Thibault d83ad906ef Add profile range contract 2012-11-27 13:03:13 -05:00
Joel Thibault cc550b4145 Add a read and interval on a different contig 2012-11-27 13:03:13 -05:00
Eric Banks 9531e58445 Merged bug fix from Stable into Unstable 2012-11-27 11:00:50 -05:00
Eric Banks 4543ece088 Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case. 2012-11-27 11:00:33 -05:00
Eric Banks a82ec7ad80 Merged bug fix from Stable into Unstable 2012-11-27 10:27:08 -05:00
Eric Banks e199562c25 I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever). 2012-11-27 10:26:17 -05:00
Eric Banks 405f3c675d Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc. 2012-11-27 01:07:00 -05:00
Eric Banks 4f7fa3009a I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA. 2012-11-26 11:34:59 -05:00
Mark DePristo 48f271c5bd Adding 80% support for multi-allelic variants
-- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here.  It's not perfect and needs some discussion on how to handle het/alt variants
-- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)
2012-11-21 17:24:59 -05:00
Joel Thibault c68bc95db6 Initial read mapping tests
- Failing tests are commented out
2012-11-21 17:16:46 -05:00
Joel Thibault 3ad9128800 Add some reads
- Move intervals and reads to init
- Update intervals and reads
2012-11-21 17:16:46 -05:00
Joel Thibault 3fa3b00f4a Add ActiveRegion tests and refactor 2012-11-21 17:16:45 -05:00
Joel Thibault e8defcb20d Test multiple bases and intervals 2012-11-21 17:16:45 -05:00
Joel Thibault c08b782743 Count isActive calls directly 2012-11-21 17:16:45 -05:00
Eric Banks 72e2d569c5 The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality. 2012-11-20 22:41:57 -05:00
Eric Banks ff87642a91 Enable cycle covariate unit tests 2012-11-20 22:29:56 -05:00
Eric Banks 937ac7290f Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results. 2012-11-20 16:13:29 -05:00
Joel Thibault b70fd4a242 Initial testing of the Active Region Traversal contract
- TODO: many more tests and test cases
2012-11-15 10:08:00 -05:00
Eric Banks e9183d9fe0 Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec. 2012-11-08 15:07:47 -05:00
David Roazen 6185e8c432 Allow large-scale tests 5 hours each to run 2012-11-01 17:48:58 -04:00
Mark DePristo 872abddfce Add custom TestNGTestTransformer that adds a maximum test runtime of 10 minutes to all testng tests
-- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread
-- Needed to debug locking issues
-- Needed to debug excessively long running integrationtests
-- Added build.xml maximum runtime for all testng tests of 10 hours.  We will ultimately fail the build if it goes on for more than 10 hours
2012-11-01 15:34:12 -04:00
Mark DePristo 1444cd753b Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered.  Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state.  This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
2012-11-01 15:34:04 -04:00
Mark DePristo 9cd04c335c Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data
-- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases.  This is now done by default, unless requested explicitly to preserve the original bases.
-- This is really the correct place to do this for a variety of reasons.  First, you don't need to work about upper casing bases throughout the code.  Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves.  Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase()
-- Added unit tests to ensure this functionality works correct.
-- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case.
-- Added contracts to ensure this is the case.
-- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code
2012-11-01 15:34:03 -04:00
Eric Banks 47a0f5859e Don't run these tests if not GAKT lite 2012-10-31 22:56:38 -04:00
Eric Banks f8af8a2355 Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite. 2012-10-31 21:28:07 -04:00
Eric Banks 2aa28abe0a Fixing md5s to reflect the new HapMap file 2012-10-30 14:27:10 -04:00
Eric Banks b6a1967f12 Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it. 2012-10-29 21:47:09 -04:00
Eric Banks 43625f652e Shoot, mixed up the md5s last time. 2012-10-27 19:43:46 -04:00
Eric Banks 682a72faf7 Hmm, thought I got all the md5s last time. Apparently not. 2012-10-26 16:10:12 -04:00
Mark DePristo 251983b8fb Add GATK-wide command line argument to control the maximum runtime allowed for the GATK
-- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded.  By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached.  This means that all output files are closed properly, etc.
-- Emits an info message that looks like "INFO  10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s".  Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update
-- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling?
-- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.
2012-10-26 13:18:34 -04:00
Eric Banks ed11b7dab2 Fix UG parallelization test 2012-10-26 12:10:44 -04:00
Eric Banks 7a706ed345 Fix some of the broken integration tests 2012-10-26 11:23:44 -04:00
Eric Banks ebebec7fdb Accidentally left one test disabled 2012-10-26 02:15:32 -04:00
Eric Banks a53e03d525 Do not let reduced reads get removed in the contamination down-sampling 2012-10-26 02:13:04 -04:00
Eric Banks bf3d61ce82 The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it. 2012-10-26 01:04:51 -04:00
Eric Banks 91f2c847a3 Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test. 2012-10-26 00:57:40 -04:00
Eric Banks e93ff3ea6e Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore. 2012-10-25 12:45:23 -04:00
Eric Banks c53c55da12 Re-enable tests 2012-10-25 09:37:08 -04:00
Eric Banks e6652f7777 Added integration test for contamination down-sampling 2012-10-25 09:36:05 -04:00
Mark DePristo 6e421a72d6 Add more exhaustive unit tests for input errors to NanoScheduler
-- Resolves issue GSA-515 / Nanoscheduler GSA-605 / Seems that -nct may deadlock as not reproducible
-- It seems that it's not an input error problem (or at least cannot be provoked with unit tests)
-- I'll keep an eye on this later
2012-10-23 20:11:29 -04:00
Mark DePristo f838815343 Updating MD5s for confidence ref site estimation in IndependentAllelesDiploidExactAFCalc
-- Included logic to only add priors for alleles with sufficient evidence to be called polymorphic.  If no alleles are poly make sure to add priors of first allele
2012-10-23 06:47:53 -04:00
Mark DePristo 15b28e61cd Retiring TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine
-- There's been no report of problems with the nano scheduled version of TraverseLoci and TraverseReads, so I'm removing the old versions since they are no longer needed
-- Removing unnecessary intermediate base classes
-- GSA-515 / Nanoscheduler GSA-549 / https://jira.broadinstitute.org/browse/GSA-549
2012-10-22 16:55:06 -04:00
Mark DePristo 90f59803fd MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic.  That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele.  The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating.  This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime.  Might be worth it through
2012-10-22 13:47:56 -04:00
Khalid Shakir 97dc3664c9 Fixed yet another NPE related to the ArgumentTypeDescriptor vs. ArgumentMatchValue. Added integration test based on GSA-621. 2012-10-22 12:05:32 -04:00
Mark DePristo eb6c9a1a79 Disable EfficiencyMonitoringThreadFactoryUnitTest
-- This is no longer a core GATK activity, and the tests need to run for so long (2 min each) that it's just too painful to run them.  Should be re-eabled if we come to care about this capability again, or if we can run these tests all in parallel in the future.
2012-10-21 12:43:46 -04:00
Mark DePristo d21e42608a Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default 2012-10-21 12:43:46 -04:00
Mark DePristo 6b6caf8e3a Bugfix for indel DP calculations using reduced reads
-- Adding tests for SNP and indel calling on reduced BAM
2012-10-21 12:42:32 -04:00
Ryan Poplin a647f1e076 Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore. 2012-10-20 16:38:18 -04:00
Ryan Poplin b4e69239dd In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map. 2012-10-18 14:31:15 -04:00
Mauricio Carneiro b57df6cac8 Bringing CMI changes into the main GATK repo.
Merge remote-tracking branch 'cmi/master'
2012-10-17 15:23:19 -04:00
Mark DePristo 9bcefadd4e Refactor ExactCallLogger into a separate class
-- Update minor integration tests with NanoSchedule due to qual accuracy update
2012-10-16 13:30:09 -04:00
Ryan Poplin 31be807664 Updating missed integration test. 2012-10-15 22:31:52 -04:00
Ryan Poplin d27ae67bb6 Updating the multi-step UG integration test. 2012-10-15 22:30:01 -04:00
kshakir 213cc00abe Refactored argument matching to support other plugins in addition to file lists.
Added plugin support for sending Queue status messages.
Argument parsing can store subclasses of java.io.File, for example RemoteFile.
2012-10-15 15:10:45 -04:00
Mauricio Carneiro 80d92e0c63 Allowing the GATK to have non-required outputs
Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation).

This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out.

[#DEV-10 transition:31 resolution:1]
2012-10-15 13:49:08 -04:00
Ryan Poplin 25be94fbb8 Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10. 2012-10-15 13:24:32 -04:00
Mark DePristo dcf8af42a8 Finalizing IndependentAllelesDiploidExactAFCalc
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef.  Not ideal.  Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly.  ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
2012-10-15 08:21:03 -04:00
Christopher Hartl 6b9987cf1b Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2012-10-12 00:48:42 -04:00
Eric Banks 36a26a7da6 md5s failed because I forgot to add --no_cmdline_in_header so it is different depending on where you run from. Fixed. 2012-10-07 08:35:55 -04:00
Eric Banks a5aaa14aaa Fix for GSA-601: Indels dropped during liftover. This was a true bug that was an effect of the switch over to the non-null representation of alleles in the VariantContext. Unfortunately, this tool didn't have integration tests - but it does now. 2012-10-07 01:19:52 -04:00
Eric Banks 5d6aad67e2 Fix for bug reported on forums: VariantsToTable does not handle lists and nested arrays correctly. Added an integration test to cover printing of PLs. 2012-10-07 00:01:27 -04:00
Eric Banks e7798ddd2a Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles. 2012-10-06 23:02:36 -04:00
Eric Banks e8a6460a33 After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too. 2012-10-05 16:37:42 -04:00
Yossi Farjoun dc4dcb4140 fixed AD annotation for a ReducedReads BAM file. Added an integration test for this case with a new reduced BAM in private/testdata 2012-10-05 14:20:07 -04:00
Mark DePristo b6e20e083a Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact
-- Will be removed.  Only commiting now to fix public -> private dependency
2012-10-03 20:16:38 -07:00
Mark DePristo f6a2ca6e7f Fixes / TODOs for meaningful results with AFCalculationResult
-- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity).  Forced me to disable reasonable contracts.  Needs to be thought through
-- exactCallsLog should be optional
-- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.
2012-10-03 19:55:12 -07:00
Mark DePristo 50e4a832ea Generalize framework for evaluating the performance and scaling of the ExactAF models to tri-allelic variants
-- Wow, big performance problems with multi-allelic exact model!
2012-10-03 19:55:11 -07:00
Mark DePristo 17ca543937 More ExactModel cleanup
-- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array.  This is way heavy-weight compared to just making the array each time.
-- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object.  That's the place it should really live
-- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP.  Added TODOs
2012-10-03 19:55:11 -07:00
Mark DePristo f8ef4332de Count the number of evaluations in AFResult; expand unit tests
-- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples
-- Added unittests for priors (flat and human)
-- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)
2012-10-03 19:55:11 -07:00
Mark DePristo 33c7841c4d Add tests for non-informative samples in ExactAFCalculationModel 2012-10-03 19:55:11 -07:00
Mark DePristo de941ddbbe Cleanup Exact model, better unit tests
-- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic).  More assert statements to ensure quality of the result.
-- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts.  Made mutation functions all protected
-- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function.  reset is a protected method now, so it's all cleaner and nicer this way
-- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef
2012-10-03 19:55:11 -07:00
Mark DePristo 3e01a76590 Clean up AlleleFrequencyCalculation classes
-- Added a true base class that only does truly common tasks (like manage call logging)
   -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract
   -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact
   -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation
   -- All unit tests pass
2012-10-03 19:55:11 -07:00
Christopher Hartl 1be8a88909 Changes:
1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest.
 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested.
 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test.
 4) Updates to PlinkBed-handling python scripts and utilities.
 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.
2012-10-03 16:02:42 -04:00
Christopher Hartl 2508b0f5a7 Merged bug fix from Stable into Unstable 2012-09-29 00:57:43 -04:00
Christopher Hartl 365f1d2429 hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items:
1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts.
 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases.
 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.
2012-09-29 00:55:31 -04:00
Christopher Hartl 55cdf4f9b7 Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release. 2012-09-27 00:13:32 -04:00
Eric Banks 74bb4e2739 Fixing the VariantContextUtilsUnitTest 2012-09-22 23:24:55 -04:00
Eric Banks 25e3ea879a Oops, missed this test before when updating md5s 2012-09-22 22:16:35 -04:00
David Roazen f6a22e5f50 ExperimentalReadShardBalancerUnitTest was being skipped; fixed
TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.

This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.

Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
2012-09-22 01:56:39 -04:00