gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	b8c0b05785	Add contract to ensure that getAdapterBoundary returns the right result -- Also renamed the function to getAdaptorBoundary for consistency across the codebase	2013-01-25 16:05:17 -05:00
Mark DePristo	e445c71161	LIBS optimization for adapter clipping -- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS -- Added unittests to cover adapter clipping	2013-01-25 16:05:17 -05:00
Ami Levy-Moonshine	b4447cdca2	In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample). Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.	2013-01-25 15:49:51 -05:00
Mark DePristo	592f90aaef	ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region -- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878 -- Misc. commenting of the code -- Updated ActiveRegionExtension to include a min active region size -- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now	2013-01-24 13:48:00 -05:00
Mark DePristo	0c94e3d96e	Adaptively compute the band pass filter from the sigma, up to a maximum size of 50 bp -- Previously we allowed band pass filter size to be specified along with the sigma. But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel. This means that if you use a smaller kernel you get a small band size and therefore faster ART -- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp	2013-01-24 13:47:59 -05:00
Mark DePristo	9e43a2028d	Making band pass filter size, sigma, active region max size and extension all accessible from the command line	2013-01-24 13:47:59 -05:00
Mark DePristo	ee8039bf25	Fix trivial call in unit test	2013-01-23 13:51:58 -05:00
Mark DePristo	8e8126506b	Renaming IncrementalActivityProfile to ActivityProfile -- Also adding a work in progress functionality to make it easy to visualize activity profiles and active regions in IGV	2013-01-23 13:46:01 -05:00
Mark DePristo	e917f56df8	Remove old ActivityProfile and old BandPassActivityProfile	2013-01-23 13:46:01 -05:00
Mark DePristo	7fd27a5167	Add band pass filtering activity profile -- Based on the new incremental activity profile -- Unit Tested! Fixed a few bugs with the old band pass filter -- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties -- Add new UnitTest for BandPassIncrementalActivityProfile -- Added normalizeFromRealSpace to MathUtils -- Cleanup unused code in new activity profiles	2013-01-23 13:46:01 -05:00
Mark DePristo	eb60235dcd	Working version of incremental active region traversals -- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version. This means that ART walkers will now take much less memory than previously. On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory. Over the whole chr20 only 5K reads were ever held in ART at one time. -- Fixed bug in the way active regions worked with shard boundaries. The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries. This behavior was actually encoded in the unit tests, so those needed to be updated as well. -- Changed the way that preset regions work in ART. The new contract ensures that you get exactly the regions you requested. the isActive function is still called, but its result has no impact on the regions. With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions -- Added a few misc. useful functions to IncrementalActivityProfile	2013-01-23 13:46:00 -05:00
Mark DePristo	e050f649fd	IncrementalActivityProfile, complete with extensive unit tests -- This is an activity profile compatible with fetching its implied active regions incrementally, as activity profile states are added	2013-01-23 13:45:21 -05:00
Mark DePristo	8d9b0f1bd5	Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile -- Required before I jump in an redo the entire activity profile so it's can be run imcrementally -- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class. -- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality. Almost all of the misc. walker changes are due to this name update -- Code cleanup and docs for TraverseActiveRegions -- Expanded unit tests for ActivityProfile and ActivityProfileState	2013-01-23 13:45:21 -05:00
Mark DePristo	42b807a5fe	Unit tests for ActivityProfileResult	2013-01-23 13:45:20 -05:00
Mauricio Carneiro	7b8b064165	Last manual license update (hopefully) if everyone updates their git hook accordingly, this will be the last time I have to manually run the script. GSATDG-5	2013-01-18 16:13:07 -05:00
Mark DePristo	738c24a3b1	Add tests to ensure that all insertion reads appear in the active region traversal	2013-01-16 16:25:36 -05:00
Mark DePristo	2a42b47e4a	Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART -- UnitTests now include combinational tiling of reads within and spanning shard boundaries -- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads -- Updating HC and CountReadsInActiveRegions integration tests	2013-01-16 15:30:00 -05:00
Mark DePristo	4d0e7b50ec	ArtificialBAMBuilder utility class for creating streams of GATKSAMRecords with a variety of properties -- Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start) -- This stream can be handed back to the caller immediately, or written to an indexed BAM file -- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)	2013-01-16 15:29:59 -05:00
Eric Banks	ec1cfe6732	Oops, forgot to add 1 of my files	2013-01-16 15:05:49 -05:00
Eric Banks	e47a389b26	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-16 14:59:11 -05:00
Eric Banks	d18dbcbac1	Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base. Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...	2013-01-16 14:55:33 -05:00
Khalid Shakir	4ffb43079f	Re-committing the following changes from Dec 18: Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms. Updated SelectHeaders to print out full interval arguments. Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.	2013-01-16 12:43:15 -05:00
Eric Banks	392b5cbcdf	The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases. This way walkers won't see anything except the standard bases plus Ns in the reference. Added option to turn off this feature (to maintain backwards compatibility). As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...	2013-01-16 10:22:43 -05:00
Mark DePristo	3c37ea014b	Retire original TraverseActiveRegion, leaving only the new optimized version -- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests	2013-01-15 10:24:45 -05:00
Mark DePristo	39bc9e999d	Add a test to LocusIteratorByState to ensure that we aren't holding reads anywhere -- Run an iterator with 100Ks of reads, each carrying MBs of byte[] data, through LIBS, all starting at the same position. Will crash with an out-of-memory error if we're holding reads anywhere in the system. -- Is there a better way to test this behavior?	2013-01-14 16:30:16 -05:00
Mark DePristo	5a5422e4f8	Refactor PerSampleReadStates into a separate class -- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager -- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager -- Make LocusIteratorByState final	2013-01-14 16:30:16 -05:00
Mark DePristo	19288b007d	LIBS bugfix: kept reads now only (correctly) includes reads that at least passed the reservoir -- Added unit tests to ensure this behavior is correct	2013-01-14 16:30:16 -05:00
Mark DePristo	83fcc06e28	LIBS optimizations and performance tools -- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM -- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls -- Moved updateReadStates to ReadStateManager -- Removed the unnecessary wrapping of an iterator in ReadStateManager -- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples. Now those are just map native iterations -- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks. The new piece is code is much clearer and avoids several unnecessary function calls	2013-01-14 16:30:15 -05:00
Mark DePristo	ec05ecef60	getAdaptorBoundary returns an int, not an Integer, as this was taking 30% of the allocation effort for LIBS	2013-01-14 16:30:15 -05:00
Mark DePristo	e88dae2758	LocusIteratorByState operates natively on GATKSAMRecords now -- Updated code to reflect this new typing	2013-01-11 15:17:18 -05:00
Mark DePristo	94cb50d3d6	Retire LegacyLocusIteratorByState -- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK	2013-01-11 15:17:18 -05:00
Mark DePristo	cc0c1b752a	Delete old LocusIteratorByState, leaving only new LIBS and legacy	2013-01-11 15:17:18 -05:00
Mark DePristo	bd03511e35	Updating AlignmentStateMachinePerformance to include some more useful performance assessments	2013-01-11 15:17:18 -05:00
Mark DePristo	9e23c592e6	ReadBackedPileup cleanup -- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl -- ReadBackedPileupImpl was literally a shell class after we removed extended events. A few bits of code cleanup and we reduced a bunch of class complexity in the gatk -- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed -- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well. -- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups. -- Added commons-collections for IteratorChain	2013-01-11 15:17:18 -05:00
Mark DePristo	b9a33d3c66	Split original and optimized ART into largely independent pieces -- Allows us to cleanly run old and new art, which now have different traversal behavior (on purpose). Split unit tests as well.	2013-01-11 15:17:18 -05:00
Mark DePristo	02130dfde7	Cleanup ART -- Initialize routine captures essential information for running the traversal	2013-01-11 15:17:17 -05:00
Mark DePristo	9b2be795a7	Initial working version of new ActiveRegionTraversal based on the LocusIteratorByState read stream -- Implemented as a subclass of TraverseActiveRegions -- Passes all unit tests -- Will be very slow -- needs logical fixes	2013-01-11 15:17:17 -05:00
Mark DePristo	8b83f4d6c7	Near final cleanup of PileupElement -- All functions documented and unit tested -- New constructor interface -- Cleanup some uses of old / removed functionality	2013-01-11 15:17:17 -05:00
Mark DePristo	fb9eb3d4ee	PileupElement and LIBS cleanup -- function to create pileup elements in AlignmentStateMachine and LIBS -- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing	2013-01-11 15:17:17 -05:00
Mark DePristo	2f2a592c8e	Contracts and documentation for AlignmentStateMachine and LocusIteratorByState -- Add more unit tests for both as well	2013-01-11 15:17:17 -05:00
Mark DePristo	cc1d259cac	Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement -- Added unit tests for this behavior. Updated users of this code	2013-01-11 15:17:17 -05:00
Mark DePristo	2c38310868	Create LIBS using new AlignmentStateMachine infrastructure -- Optimizations to AlignmentStateMachine -- Properly count deletions. Added unit test for counting routines -- AlignmentStateMachine.java is no longer recursive -- Traversals now use new LIBS, not the old one	2013-01-11 15:17:17 -05:00
Mark DePristo	80d9b7011c	Complete rewrite of low-level machinery of LIBS, not hooked up -- AlignmentStateMachine does what SAMRecordAlignmentState should really do. It's correct in that it's more accurate than the LIB_position tests themselves. This is a non-broken, correct implementation. Needs cleanup, contracts, etc. -- This version is like 6x slower than the original implementation (according to the google caliper benchmark here). Obvious optimizations for future commit	2013-01-11 15:17:16 -05:00
Mark DePristo	0ac4352614	LIBS can now (optionally) track the unique reads it uses from the underlying read iterator -- This capability is essential to provide an ordered set of used reads to downstream users of LIBS, such as ART, who want an efficient way to get the reads used in LIBS -- Vastly expanded the multi-read, multi-sample LIBS unit tests to make sure this capability is working -- Added createReadStream to ArtificialSAMUtils that makes it relatively easy to create multi-read, multi-sample read streams for testing	2013-01-11 15:17:16 -05:00
Mark DePristo	b3ecfbfce8	Refactor LIBS into component parts, expand unit tests, some code cleanup -- Split out all of the inner classes of LIBS into separate independent classes -- Split / add unit tests for many of these components. -- Radically expand unit tests for SAMRecordAlignmentState (the lowest level piece of code) making sure at least some of it works -- No need to change unit tests or integration tests. No change in functionality. -- Added (currently disabled) code to track all submitted reads to LIBS, but this isn't accessible or tested	2013-01-11 15:17:16 -05:00
Mark DePristo	2e5d38fd0e	Updating to latest google caliper code	2013-01-11 15:17:16 -05:00
Mark DePristo	b2990497e2	Refactor LIBS into utils.locusiterator before refactoring	2013-01-11 15:17:16 -05:00
Mauricio Carneiro	2a4ccfe6fd	Updated all JAVA file licenses accordingly GSATDG-5	2013-01-10 17:06:41 -05:00
Eric Banks	4fa439d89e	Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now	2013-01-09 11:06:10 -05:00
Eric Banks	b099e2b4ae	Moving integration tests to protected	2013-01-08 09:34:08 -05:00
Eric Banks	35d9bd377c	Moved (nearly) all Walkers from public to protected and removed GATKLite utils	2013-01-07 14:42:40 -05:00
Eric Banks	b4e7b3d691	Fixed precision problem in the Bayesian calculation of Qemp: we need to cap below max integer because the MathUtils code add +1. Added unit tests for handling large number of observations.	2013-01-07 13:07:36 -05:00
Eric Banks	ef638489d5	Fixing BQSR gatherer test to keep up to date with latest changes	2013-01-06 14:07:59 -05:00
Eric Banks	52067f0549	Handle merge conflicts	2013-01-06 12:29:12 -05:00
Eric Banks	bf25e151ff	Handle long->int precision in Bayesian estimate	2013-01-06 12:26:32 -05:00
Mark DePristo	b403c269e9	Make multi-threaded progress meter daemon unit test more robust	2013-01-05 12:59:18 -05:00
Mark DePristo	69bf70c42e	Cleanup and more unit tests for RecalibrationTables in BQSR -- Added unit tests for combining RecalibrationTables. As a side effect now has serious tests for incrementDatumOrPutIfNecessary -- Removed unnecessary enum.index system from RecalibrationTables. -- Moved what were really static utility methods out of RecalibrationEngine and into RecalUtils.	2013-01-05 12:50:27 -05:00
Chris Hartl	9df30880cb	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-04 17:15:22 -05:00
Joel Thibault	01738e70c3	Archive the experimental Active Region Traversals	2013-01-04 17:05:31 -05:00
Chris Hartl	41bc416b65	Remove AAL and update MD5s.	2013-01-04 16:46:14 -05:00
Eric Banks	bce6fce58d	Resolving merge conflicts after Mark's latest push	2013-01-04 14:46:39 -05:00
Eric Banks	dd7f5e2be7	Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests.	2013-01-04 14:43:11 -05:00
Joel Thibault	ab5526b372	More TODOs	2013-01-04 14:09:02 -05:00
Tad Jordan	fe06912a87	Removed sorting by row from walkers	2013-01-04 11:52:33 -05:00
Mark DePristo	810e2da1d4	Cleanup and unit tests for EventType and ReadRecalibrationInfo in BQSR -- Added unit tests for EventType and ReadRecalibrationInfo -- Simplified interface of EventType. Previously this enum carried an index with it, but this is redundant with the enum.ordinal function. Now just using that function instead.	2013-01-04 11:39:25 -05:00
Mark DePristo	1ba8d47a81	Unit tests for ProgressMeterDaemon	2013-01-04 11:39:24 -05:00
Mark DePristo	fbee4c11f1	Unit tests for ProgressMeterData	2013-01-04 11:39:23 -05:00
Joel Thibault	319d651e4a	Initial updates for ActiveRegionShard	2013-01-03 17:00:13 -05:00
Joel Thibault	e7553545ef	Initial updates for ReadShard	2013-01-03 17:00:13 -05:00
Joel Thibault	14a3ac0e3c	Enable the use of alternate shards	2013-01-03 17:00:13 -05:00
Joel Thibault	47e620dfbc	Create BAM index to test shard boundaries	2013-01-03 17:00:12 -05:00
Tad Jordan	c1ba12d71a	Added unit test for outputting sorted GATKReport Tables - Made few small modifications to code - Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table	2013-01-03 16:53:59 -05:00
Joel Thibault	dcb7735d3c	Active Region extensions must stay on contig	2013-01-02 14:46:24 -05:00
Chris Hartl	09199366b7	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-02 14:44:49 -05:00
Chris Hartl	e1d09ab0db	QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests should all pass.	2013-01-02 14:41:29 -05:00
Joel Thibault	a15f368bdc	Re-enable testIsActiveRangeLow/High	2013-01-02 11:57:50 -05:00
Mark DePristo	12f4c6307e	AutoFormattingTime cleanup and complete unittests -- Underlying system now uses long nano times to be more consistent with standard java practice -- Updated a few places in the code that were converting from nanoseconds to double seconds to use the new nanoseconds interface directly -- Bringing us to 100% test coverage with clover with AutoFormattingTimeUnitTest	2013-01-02 11:29:25 -05:00
Joel Thibault	429567cd3f	Rename to TraverseActiveRegionsUnitTest	2013-01-01 19:20:30 -05:00
Joel Thibault	57d38aac8a	Temporarily disable due to unknown contracts problem	2013-01-01 19:20:04 -05:00
Joel Thibault	7748b3816f	Delete the test BAI file as well as the BAM	2013-01-01 19:20:02 -05:00
Joel Thibault	5afeb465aa	TODOs	2013-01-01 19:19:17 -05:00
Eric Banks	75d5b88a3d	Enabling the Recal Report unit test (which looks like it was never ever enabled)	2012-12-26 15:35:50 -05:00
Mark DePristo	04cc75aaec	Minor cleanup and expansion of the RecalDatum unit tests	2012-12-24 13:35:58 -05:00
Mark DePristo	7bf1f67273	BQSR optimization: read group x quality score calibration table is thread-local -- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table. Radically reduces the contention for RecalDatum in this very highly used table -- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables. Updated combine in RecalibrationReport to use this table combiner function -- Made several core functions in RecalDatum into final methods for performance -- Added RecalibrationTestUtils, a home for recalibration testing utilities	2012-12-24 13:35:58 -05:00
Mark DePristo	7d250a789a	ArtificialReadPileupTestProvider now creates GATKSamRecords with good header values	2012-12-24 13:35:57 -05:00
Mark DePristo	295455eee2	NanoScheduler optimizations and simplification -- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc. All hideously complex. -- This new model is vastly simply. The reducer basically knows nothing about the control mechanisms in the NanoScheduler. It just supports multi-threaded reduce. The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down. The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1. The master thread gets the final reduce result when its free by the latch reaching 0. It's all super super simple. -- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.	2012-12-24 13:35:57 -05:00
Mark DePristo	bf81db40f7	NanoScheduler reducer optimizations -- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access. If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work. They removes one major source of blocking contention in the NanoScheduler	2012-12-24 13:35:57 -05:00
Mark DePristo	161487b4a4	MapResult compareTo() is now unit tested -- Thanks clover!	2012-12-24 13:35:57 -05:00
Mark DePristo	7796ba7601	Minor optimizations for NanoScheduler -- Reducer.maybeReleaseLatch is no longer synchronized -- NanoScheduler only prints progress every 100 or so map calls	2012-12-24 13:35:56 -05:00
Mark DePristo	0f04485c24	NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue -- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue. -- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N. This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step -- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code. -- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc) -- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs	2012-12-24 13:35:56 -05:00
Tad Jordan	b491c177ff	Added functionality of outputting sorted GATKReport Tables - Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables - Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables	2012-12-20 14:02:21 -05:00
David Roazen	07b369ca7e	Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package This is an intermediate commit so that there is a record of these changes in our commit history. Next step is to isolate the test classes as well, and then move the entire package to the Picard repository and replace it with a jar in our repo. -Removed all dependencies on org.broadinstitute.sting (still need to do the test classes, though) -Had to split some of the utility classes into "GATK-specific" vs generic methods (eg., GATKVCFUtils vs. VCFUtils) -Placement of some methods and choice of exception classes to replace the StingExceptions and UserExceptions may need to be tweaked until everyone is happy, but this can be done after the move.	2012-12-19 10:25:22 -05:00
Mark DePristo	1ca13f9581	Fundamentally better model for the NanoScheduler -- Now each map job reads a value, performs map, and does as much reducing as possible. This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc. All of this is accomplished using exactly NCT% of the CPU of the machine. -- Has the additional value of actually simplifying the code -- Resolves a long-standing annoyance with the nano scheduler.	2012-12-19 09:31:31 -05:00
Joel Thibault	a29df3e094	oops	2012-12-18 19:03:12 -05:00
Joel Thibault	ee22c1bf44	More TODOs	2012-12-18 18:47:43 -05:00
Joel Thibault	2b1db519d7	Add reads which overstep a boundary by a single base	2012-12-18 18:47:43 -05:00
Joel Thibault	9828b2990f	Reads off the end of a contig fail SAM validation when using actual BAMs	2012-12-18 18:47:43 -05:00
Joel Thibault	72e2394b26	Create actual BAM	2012-12-18 18:47:43 -05:00
Joel Thibault	d69d1f8988	Fun with varargs	2012-12-18 18:47:42 -05:00
Joel Thibault	1158c1529f	Refactor region/read comparisons	2012-12-18 18:47:42 -05:00
Yossi Farjoun	19dd2d628a	some changes. some changes.	2012-12-14 17:21:32 -05:00
Eric Banks	696bf95fba	Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer).	2012-12-13 23:28:30 +00:00
Ami Levy-Moonshine	2f99569dda	change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field)	2012-12-10 22:48:15 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
Eric Banks	574d5b467f	Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype.	2012-12-09 02:09:34 -05:00
Mark DePristo	dbf721968d	PrintReads large-scale test to protect against another major low-level performance issue	2012-12-05 21:36:27 -05:00
Mark DePristo	465694078e	Major performance improvement to the GATK engine -- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers. Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests. -- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x. For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement). -- Took this opportunity to improve the GATK ProgressMeter. NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress. This removes all inner loop calls to the GATK timers. -- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402	2012-12-05 14:49:22 -05:00
Mark DePristo	2b601571e7	Better error handling in NanoScheduler -- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown. Errors, like out of memory, would cause the whole system to die. This bugfix resolves that issue	2012-12-05 14:49:22 -05:00
Eric Banks	5fed9df295	Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh).	2012-12-03 12:18:20 -05:00
Eric Banks	b6839b3049	Added checking in the GATK for mis-encoded quality scores. The check is performed by a Read Transformer that samples (currently set to once every 1000 reads so that we don't hurt overall GATK performance) from the input reads and checks to make sure that none of the base quals is too high (> Q60). If we encounter such a base then we fail with a User Error. * Can be over-ridden with --allow_potentially_misencoded_quality_scores. * Also, the user can choose to fix his quals on the fly (presumably using PrintReads to write out a fixed bam) with the --fix_misencoded_quality_scores argument. Added unit tests.	2012-12-03 11:18:41 -05:00
Joel Thibault	c76c808268	Reads are required to be sorted - Remove the extended_only case because it's outside intervals	2012-11-28 13:59:58 -05:00
Joel Thibault	198923b597	Add ActiveRegionReadState handling	2012-11-28 13:59:57 -05:00
Mark DePristo	7e4b9c9e6e	Fix failing unit tests for VariantContextUtilsUnitTest -- Previous version was adding multiple samples with the same name to the variant context	2012-11-27 14:26:23 -05:00
Joel Thibault	9bfe39411e	Equal overlap should match right/later region	2012-11-27 13:03:13 -05:00
Joel Thibault	d83ad906ef	Add profile range contract	2012-11-27 13:03:13 -05:00
Joel Thibault	cc550b4145	Add a read and interval on a different contig	2012-11-27 13:03:13 -05:00
Eric Banks	9531e58445	Merged bug fix from Stable into Unstable	2012-11-27 11:00:50 -05:00
Eric Banks	4543ece088	Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case.	2012-11-27 11:00:33 -05:00
Eric Banks	a82ec7ad80	Merged bug fix from Stable into Unstable	2012-11-27 10:27:08 -05:00
Eric Banks	e199562c25	I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever).	2012-11-27 10:26:17 -05:00
Eric Banks	405f3c675d	Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc.	2012-11-27 01:07:00 -05:00
Eric Banks	4f7fa3009a	I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA.	2012-11-26 11:34:59 -05:00
Mark DePristo	48f271c5bd	Adding 80% support for multi-allelic variants -- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here. It's not perfect and needs some discussion on how to handle het/alt variants -- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)	2012-11-21 17:24:59 -05:00
Joel Thibault	c68bc95db6	Initial read mapping tests - Failing tests are commented out	2012-11-21 17:16:46 -05:00
Joel Thibault	3ad9128800	Add some reads - Move intervals and reads to init - Update intervals and reads	2012-11-21 17:16:46 -05:00
Joel Thibault	3fa3b00f4a	Add ActiveRegion tests and refactor	2012-11-21 17:16:45 -05:00
Joel Thibault	e8defcb20d	Test multiple bases and intervals	2012-11-21 17:16:45 -05:00
Joel Thibault	c08b782743	Count isActive calls directly	2012-11-21 17:16:45 -05:00
Eric Banks	72e2d569c5	The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality.	2012-11-20 22:41:57 -05:00
Eric Banks	ff87642a91	Enable cycle covariate unit tests	2012-11-20 22:29:56 -05:00
Eric Banks	937ac7290f	Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results.	2012-11-20 16:13:29 -05:00
Joel Thibault	b70fd4a242	Initial testing of the Active Region Traversal contract - TODO: many more tests and test cases	2012-11-15 10:08:00 -05:00
Eric Banks	e9183d9fe0	Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec.	2012-11-08 15:07:47 -05:00
David Roazen	6185e8c432	Allow large-scale tests 5 hours each to run	2012-11-01 17:48:58 -04:00
Mark DePristo	872abddfce	Add custom TestNGTestTransformer that adds a maximum test runtime of 10 minutes to all testng tests -- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread -- Needed to debug locking issues -- Needed to debug excessively long running integrationtests -- Added build.xml maximum runtime for all testng tests of 10 hours. We will ultimately fail the build if it goes on for more than 10 hours	2012-11-01 15:34:12 -04:00
Mark DePristo	1444cd753b	Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome -- The logic for determining active regions was a bit broken in the HC when intervals were used in the system -- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR -- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive -- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future. -- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC -- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions -- Remove excessively expensive calls to check bases are upper cased in ReferenceContext -- Update md5s after a lot of manual review and discussion with Ryan	2012-11-01 15:34:04 -04:00
Mark DePristo	9cd04c335c	Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data -- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases. -- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase() -- Added unit tests to ensure this functionality works correct. -- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case. -- Added contracts to ensure this is the case. -- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code	2012-11-01 15:34:03 -04:00
Eric Banks	47a0f5859e	Don't run these tests if not GAKT lite	2012-10-31 22:56:38 -04:00
Eric Banks	f8af8a2355	Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite.	2012-10-31 21:28:07 -04:00
Eric Banks	2aa28abe0a	Fixing md5s to reflect the new HapMap file	2012-10-30 14:27:10 -04:00
Eric Banks	b6a1967f12	Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it.	2012-10-29 21:47:09 -04:00
Eric Banks	43625f652e	Shoot, mixed up the md5s last time.	2012-10-27 19:43:46 -04:00
Eric Banks	682a72faf7	Hmm, thought I got all the md5s last time. Apparently not.	2012-10-26 16:10:12 -04:00
Mark DePristo	251983b8fb	Add GATK-wide command line argument to control the maximum runtime allowed for the GATK -- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc. -- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update -- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling? -- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.	2012-10-26 13:18:34 -04:00
Eric Banks	ed11b7dab2	Fix UG parallelization test	2012-10-26 12:10:44 -04:00
Eric Banks	7a706ed345	Fix some of the broken integration tests	2012-10-26 11:23:44 -04:00
Eric Banks	ebebec7fdb	Accidentally left one test disabled	2012-10-26 02:15:32 -04:00
Eric Banks	a53e03d525	Do not let reduced reads get removed in the contamination down-sampling	2012-10-26 02:13:04 -04:00
Eric Banks	bf3d61ce82	The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it.	2012-10-26 01:04:51 -04:00
Eric Banks	91f2c847a3	Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test.	2012-10-26 00:57:40 -04:00

1 2 3 4 5 ...

1301 Commits (f11c8d22d47218c7d0dfbdfb5c19cbdd336a5df4)