gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	94cb50d3d6	Retire LegacyLocusIteratorByState -- Left in the remaining infrastructure for David to remove, but the legacy downsampler is no longer a functional option in the GATK	2013-01-11 15:17:18 -05:00
Mark DePristo	cc0c1b752a	Delete old LocusIteratorByState, leaving only new LIBS and legacy	2013-01-11 15:17:18 -05:00
Mark DePristo	bd03511e35	Updating AlignmentStateMachinePerformance to include some more useful performance assessments	2013-01-11 15:17:18 -05:00
Mark DePristo	9e23c592e6	ReadBackedPileup cleanup -- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl -- ReadBackedPileupImpl was literally a shell class after we removed extended events. A few bits of code cleanup and we reduced a bunch of class complexity in the gatk -- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed -- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well. -- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups. -- Added commons-collections for IteratorChain	2013-01-11 15:17:18 -05:00
Mark DePristo	b9a33d3c66	Split original and optimized ART into largely independent pieces -- Allows us to cleanly run old and new art, which now have different traversal behavior (on purpose). Split unit tests as well.	2013-01-11 15:17:18 -05:00
Mark DePristo	02130dfde7	Cleanup ART -- Initialize routine captures essential information for running the traversal	2013-01-11 15:17:17 -05:00
Mark DePristo	9b2be795a7	Initial working version of new ActiveRegionTraversal based on the LocusIteratorByState read stream -- Implemented as a subclass of TraverseActiveRegions -- Passes all unit tests -- Will be very slow -- needs logical fixes	2013-01-11 15:17:17 -05:00
Mark DePristo	8b83f4d6c7	Near final cleanup of PileupElement -- All functions documented and unit tested -- New constructor interface -- Cleanup some uses of old / removed functionality	2013-01-11 15:17:17 -05:00
Mark DePristo	fb9eb3d4ee	PileupElement and LIBS cleanup -- function to create pileup elements in AlignmentStateMachine and LIBS -- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing	2013-01-11 15:17:17 -05:00
Mark DePristo	2f2a592c8e	Contracts and documentation for AlignmentStateMachine and LocusIteratorByState -- Add more unit tests for both as well	2013-01-11 15:17:17 -05:00
Mark DePristo	cc1d259cac	Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement -- Added unit tests for this behavior. Updated users of this code	2013-01-11 15:17:17 -05:00
Mark DePristo	2c38310868	Create LIBS using new AlignmentStateMachine infrastructure -- Optimizations to AlignmentStateMachine -- Properly count deletions. Added unit test for counting routines -- AlignmentStateMachine.java is no longer recursive -- Traversals now use new LIBS, not the old one	2013-01-11 15:17:17 -05:00
Mark DePristo	80d9b7011c	Complete rewrite of low-level machinery of LIBS, not hooked up -- AlignmentStateMachine does what SAMRecordAlignmentState should really do. It's correct in that it's more accurate than the LIB_position tests themselves. This is a non-broken, correct implementation. Needs cleanup, contracts, etc. -- This version is like 6x slower than the original implementation (according to the google caliper benchmark here). Obvious optimizations for future commit	2013-01-11 15:17:16 -05:00
Mark DePristo	0ac4352614	LIBS can now (optionally) track the unique reads it uses from the underlying read iterator -- This capability is essential to provide an ordered set of used reads to downstream users of LIBS, such as ART, who want an efficient way to get the reads used in LIBS -- Vastly expanded the multi-read, multi-sample LIBS unit tests to make sure this capability is working -- Added createReadStream to ArtificialSAMUtils that makes it relatively easy to create multi-read, multi-sample read streams for testing	2013-01-11 15:17:16 -05:00
Mark DePristo	b3ecfbfce8	Refactor LIBS into component parts, expand unit tests, some code cleanup -- Split out all of the inner classes of LIBS into separate independent classes -- Split / add unit tests for many of these components. -- Radically expand unit tests for SAMRecordAlignmentState (the lowest level piece of code) making sure at least some of it works -- No need to change unit tests or integration tests. No change in functionality. -- Added (currently disabled) code to track all submitted reads to LIBS, but this isn't accessible or tested	2013-01-11 15:17:16 -05:00
Mark DePristo	2e5d38fd0e	Updating to latest google caliper code	2013-01-11 15:17:16 -05:00
Mark DePristo	b2990497e2	Refactor LIBS into utils.locusiterator before refactoring	2013-01-11 15:17:16 -05:00
Mauricio Carneiro	2a4ccfe6fd	Updated all JAVA file licenses accordingly GSATDG-5	2013-01-10 17:06:41 -05:00
Eric Banks	4fa439d89e	Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now	2013-01-09 11:06:10 -05:00
Eric Banks	b099e2b4ae	Moving integration tests to protected	2013-01-08 09:34:08 -05:00
Eric Banks	35d9bd377c	Moved (nearly) all Walkers from public to protected and removed GATKLite utils	2013-01-07 14:42:40 -05:00
Eric Banks	b4e7b3d691	Fixed precision problem in the Bayesian calculation of Qemp: we need to cap below max integer because the MathUtils code add +1. Added unit tests for handling large number of observations.	2013-01-07 13:07:36 -05:00
Eric Banks	ef638489d5	Fixing BQSR gatherer test to keep up to date with latest changes	2013-01-06 14:07:59 -05:00
Eric Banks	52067f0549	Handle merge conflicts	2013-01-06 12:29:12 -05:00
Eric Banks	bf25e151ff	Handle long->int precision in Bayesian estimate	2013-01-06 12:26:32 -05:00
Mark DePristo	b403c269e9	Make multi-threaded progress meter daemon unit test more robust	2013-01-05 12:59:18 -05:00
Mark DePristo	69bf70c42e	Cleanup and more unit tests for RecalibrationTables in BQSR -- Added unit tests for combining RecalibrationTables. As a side effect now has serious tests for incrementDatumOrPutIfNecessary -- Removed unnecessary enum.index system from RecalibrationTables. -- Moved what were really static utility methods out of RecalibrationEngine and into RecalUtils.	2013-01-05 12:50:27 -05:00
Chris Hartl	9df30880cb	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-04 17:15:22 -05:00
Joel Thibault	01738e70c3	Archive the experimental Active Region Traversals	2013-01-04 17:05:31 -05:00
Chris Hartl	41bc416b65	Remove AAL and update MD5s.	2013-01-04 16:46:14 -05:00
Eric Banks	bce6fce58d	Resolving merge conflicts after Mark's latest push	2013-01-04 14:46:39 -05:00
Eric Banks	dd7f5e2be7	Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests.	2013-01-04 14:43:11 -05:00
Joel Thibault	ab5526b372	More TODOs	2013-01-04 14:09:02 -05:00
Tad Jordan	fe06912a87	Removed sorting by row from walkers	2013-01-04 11:52:33 -05:00
Mark DePristo	810e2da1d4	Cleanup and unit tests for EventType and ReadRecalibrationInfo in BQSR -- Added unit tests for EventType and ReadRecalibrationInfo -- Simplified interface of EventType. Previously this enum carried an index with it, but this is redundant with the enum.ordinal function. Now just using that function instead.	2013-01-04 11:39:25 -05:00
Mark DePristo	1ba8d47a81	Unit tests for ProgressMeterDaemon	2013-01-04 11:39:24 -05:00
Mark DePristo	fbee4c11f1	Unit tests for ProgressMeterData	2013-01-04 11:39:23 -05:00
Joel Thibault	319d651e4a	Initial updates for ActiveRegionShard	2013-01-03 17:00:13 -05:00
Joel Thibault	e7553545ef	Initial updates for ReadShard	2013-01-03 17:00:13 -05:00
Joel Thibault	14a3ac0e3c	Enable the use of alternate shards	2013-01-03 17:00:13 -05:00
Joel Thibault	47e620dfbc	Create BAM index to test shard boundaries	2013-01-03 17:00:12 -05:00
Tad Jordan	c1ba12d71a	Added unit test for outputting sorted GATKReport Tables - Made few small modifications to code - Replaced the two arguments in GATKReportTable constructor with an enum used to specify way of sorting the table	2013-01-03 16:53:59 -05:00
Joel Thibault	dcb7735d3c	Active Region extensions must stay on contig	2013-01-02 14:46:24 -05:00
Chris Hartl	09199366b7	Merge branch 'master' of github.com:broadinstitute/gsa-unstable	2013-01-02 14:44:49 -05:00
Chris Hartl	e1d09ab0db	QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests should all pass.	2013-01-02 14:41:29 -05:00
Joel Thibault	a15f368bdc	Re-enable testIsActiveRangeLow/High	2013-01-02 11:57:50 -05:00
Mark DePristo	12f4c6307e	AutoFormattingTime cleanup and complete unittests -- Underlying system now uses long nano times to be more consistent with standard java practice -- Updated a few places in the code that were converting from nanoseconds to double seconds to use the new nanoseconds interface directly -- Bringing us to 100% test coverage with clover with AutoFormattingTimeUnitTest	2013-01-02 11:29:25 -05:00
Joel Thibault	429567cd3f	Rename to TraverseActiveRegionsUnitTest	2013-01-01 19:20:30 -05:00
Joel Thibault	57d38aac8a	Temporarily disable due to unknown contracts problem	2013-01-01 19:20:04 -05:00
Joel Thibault	7748b3816f	Delete the test BAI file as well as the BAM	2013-01-01 19:20:02 -05:00
Joel Thibault	5afeb465aa	TODOs	2013-01-01 19:19:17 -05:00
Eric Banks	75d5b88a3d	Enabling the Recal Report unit test (which looks like it was never ever enabled)	2012-12-26 15:35:50 -05:00
Mark DePristo	04cc75aaec	Minor cleanup and expansion of the RecalDatum unit tests	2012-12-24 13:35:58 -05:00
Mark DePristo	7bf1f67273	BQSR optimization: read group x quality score calibration table is thread-local -- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table. Radically reduces the contention for RecalDatum in this very highly used table -- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables. Updated combine in RecalibrationReport to use this table combiner function -- Made several core functions in RecalDatum into final methods for performance -- Added RecalibrationTestUtils, a home for recalibration testing utilities	2012-12-24 13:35:58 -05:00
Mark DePristo	7d250a789a	ArtificialReadPileupTestProvider now creates GATKSamRecords with good header values	2012-12-24 13:35:57 -05:00
Mark DePristo	295455eee2	NanoScheduler optimizations and simplification -- The previous model was to enqueue individual map jobs (with a resolution of 1 map job per map call), to track the number of map calls submitted via a counter and a semaphore, and to use this information in each map job and reduce to control the number of map jobs, when reduce was complete, etc. All hideously complex. -- This new model is vastly simply. The reducer basically knows nothing about the control mechanisms in the NanoScheduler. It just supports multi-threaded reduce. The NanoScheduler enqueues exactly nThread jobs to be run, which continually loop reading, mapping, and reducing until they run out of material to read, when they shut down. The master thread of the NS just holds a CountDownLatch, initialized to nThreads, and when each thread exits it reduces the latch by 1. The master thread gets the final reduce result when its free by the latch reaching 0. It's all super super simple. -- Because this model uses vastly fewer synchronization primitives within the NS itself, it's naturally much faster at getting things done, without any of the overhead obvious in profiles of BQSR -nct 2.	2012-12-24 13:35:57 -05:00
Mark DePristo	bf81db40f7	NanoScheduler reducer optimizations -- reduceAsMuchAsPossible no longer blocks threads via synchronization, but instead uses an explicit lock to manage access. If the lock is already held (because some thread is doing reduce) then the thread attempting to reduce immediately exits the call and continues doing productive work. They removes one major source of blocking contention in the NanoScheduler	2012-12-24 13:35:57 -05:00
Mark DePristo	161487b4a4	MapResult compareTo() is now unit tested -- Thanks clover!	2012-12-24 13:35:57 -05:00
Mark DePristo	7796ba7601	Minor optimizations for NanoScheduler -- Reducer.maybeReleaseLatch is no longer synchronized -- NanoScheduler only prints progress every 100 or so map calls	2012-12-24 13:35:56 -05:00
Mark DePristo	0f04485c24	NanoScheduler optimization: don't use a PriorityBlockingQueue for the MapResultsQueue -- Created a separate, limited interface MapResultsQueue object that previously was set to the PriorityBlockingQueue. -- The MapResultsQueue is now backed by a synchronized ExpandingArrayList, since job ids are integers incrementing from 0 to N. This means we avoid the n log n sort in the priority queue which was generating a lot of cost in the reduce step -- Had to update ReducerUnitTest because the test itself was brittle, and broken when I changed the underlying code. -- A few bits of minor code cleanup through the system (removing unused constructors, local variables, etc) -- ExpandingArrayList called ensureCapacity so that we increase the size of the arraylist once to accommodate the upcoming size needs	2012-12-24 13:35:56 -05:00
Tad Jordan	b491c177ff	Added functionality of outputting sorted GATKReport Tables - Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables - Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables	2012-12-20 14:02:21 -05:00
David Roazen	07b369ca7e	Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package This is an intermediate commit so that there is a record of these changes in our commit history. Next step is to isolate the test classes as well, and then move the entire package to the Picard repository and replace it with a jar in our repo. -Removed all dependencies on org.broadinstitute.sting (still need to do the test classes, though) -Had to split some of the utility classes into "GATK-specific" vs generic methods (eg., GATKVCFUtils vs. VCFUtils) -Placement of some methods and choice of exception classes to replace the StingExceptions and UserExceptions may need to be tweaked until everyone is happy, but this can be done after the move.	2012-12-19 10:25:22 -05:00
Mark DePristo	1ca13f9581	Fundamentally better model for the NanoScheduler -- Now each map job reads a value, performs map, and does as much reducing as possible. This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc. All of this is accomplished using exactly NCT% of the CPU of the machine. -- Has the additional value of actually simplifying the code -- Resolves a long-standing annoyance with the nano scheduler.	2012-12-19 09:31:31 -05:00
Joel Thibault	a29df3e094	oops	2012-12-18 19:03:12 -05:00
Joel Thibault	ee22c1bf44	More TODOs	2012-12-18 18:47:43 -05:00
Joel Thibault	2b1db519d7	Add reads which overstep a boundary by a single base	2012-12-18 18:47:43 -05:00
Joel Thibault	9828b2990f	Reads off the end of a contig fail SAM validation when using actual BAMs	2012-12-18 18:47:43 -05:00
Joel Thibault	72e2394b26	Create actual BAM	2012-12-18 18:47:43 -05:00
Joel Thibault	d69d1f8988	Fun with varargs	2012-12-18 18:47:42 -05:00
Joel Thibault	1158c1529f	Refactor region/read comparisons	2012-12-18 18:47:42 -05:00
Yossi Farjoun	19dd2d628a	some changes. some changes.	2012-12-14 17:21:32 -05:00
Eric Banks	696bf95fba	Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer).	2012-12-13 23:28:30 +00:00
Ami Levy-Moonshine	2f99569dda	change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field)	2012-12-10 22:48:15 -05:00
David Roazen	46edab6d6a	Use the new downsampling implementation by default -Switch back to the old implementation, if needed, with --use_legacy_downsampler -LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and the original LocusIteratorByState becomes LegacyLocusIteratorByState -Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer, with the old one renamed to LegacyReadShardBalancer -Performance improvements: locus traversals used to be 20% slower in the new downsampling implementation, now they are roughly the same speed. -Tests show a very high level of concordance with UG calls from the previous implementation, with some new calls and edge cases that still require more examination. -With the new implementation, can now use -dcov with ReadWalkers to set a limit on the max # of reads per alignment start position per sample. Appropriate value for ReadWalker dcov may be in the single digits for some tools, but this too requires more investigation.	2012-12-10 09:44:50 -05:00
Eric Banks	574d5b467f	Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype.	2012-12-09 02:09:34 -05:00
Mark DePristo	dbf721968d	PrintReads large-scale test to protect against another major low-level performance issue	2012-12-05 21:36:27 -05:00
Mark DePristo	465694078e	Major performance improvement to the GATK engine -- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers. Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests. -- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x. For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement). -- Took this opportunity to improve the GATK ProgressMeter. NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress. This removes all inner loop calls to the GATK timers. -- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402	2012-12-05 14:49:22 -05:00
Mark DePristo	2b601571e7	Better error handling in NanoScheduler -- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown. Errors, like out of memory, would cause the whole system to die. This bugfix resolves that issue	2012-12-05 14:49:22 -05:00
Eric Banks	5fed9df295	Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh).	2012-12-03 12:18:20 -05:00
Eric Banks	b6839b3049	Added checking in the GATK for mis-encoded quality scores. The check is performed by a Read Transformer that samples (currently set to once every 1000 reads so that we don't hurt overall GATK performance) from the input reads and checks to make sure that none of the base quals is too high (> Q60). If we encounter such a base then we fail with a User Error. * Can be over-ridden with --allow_potentially_misencoded_quality_scores. * Also, the user can choose to fix his quals on the fly (presumably using PrintReads to write out a fixed bam) with the --fix_misencoded_quality_scores argument. Added unit tests.	2012-12-03 11:18:41 -05:00
Joel Thibault	c76c808268	Reads are required to be sorted - Remove the extended_only case because it's outside intervals	2012-11-28 13:59:58 -05:00
Joel Thibault	198923b597	Add ActiveRegionReadState handling	2012-11-28 13:59:57 -05:00
Mark DePristo	7e4b9c9e6e	Fix failing unit tests for VariantContextUtilsUnitTest -- Previous version was adding multiple samples with the same name to the variant context	2012-11-27 14:26:23 -05:00
Joel Thibault	9bfe39411e	Equal overlap should match right/later region	2012-11-27 13:03:13 -05:00
Joel Thibault	d83ad906ef	Add profile range contract	2012-11-27 13:03:13 -05:00
Joel Thibault	cc550b4145	Add a read and interval on a different contig	2012-11-27 13:03:13 -05:00
Eric Banks	9531e58445	Merged bug fix from Stable into Unstable	2012-11-27 11:00:50 -05:00
Eric Banks	4543ece088	Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case.	2012-11-27 11:00:33 -05:00
Eric Banks	a82ec7ad80	Merged bug fix from Stable into Unstable	2012-11-27 10:27:08 -05:00
Eric Banks	e199562c25	I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever).	2012-11-27 10:26:17 -05:00
Eric Banks	405f3c675d	Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc.	2012-11-27 01:07:00 -05:00
Eric Banks	4f7fa3009a	I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA.	2012-11-26 11:34:59 -05:00
Mark DePristo	48f271c5bd	Adding 80% support for multi-allelic variants -- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here. It's not perfect and needs some discussion on how to handle het/alt variants -- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)	2012-11-21 17:24:59 -05:00
Joel Thibault	c68bc95db6	Initial read mapping tests - Failing tests are commented out	2012-11-21 17:16:46 -05:00
Joel Thibault	3ad9128800	Add some reads - Move intervals and reads to init - Update intervals and reads	2012-11-21 17:16:46 -05:00
Joel Thibault	3fa3b00f4a	Add ActiveRegion tests and refactor	2012-11-21 17:16:45 -05:00
Joel Thibault	e8defcb20d	Test multiple bases and intervals	2012-11-21 17:16:45 -05:00
Joel Thibault	c08b782743	Count isActive calls directly	2012-11-21 17:16:45 -05:00
Eric Banks	72e2d569c5	The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality.	2012-11-20 22:41:57 -05:00
Eric Banks	ff87642a91	Enable cycle covariate unit tests	2012-11-20 22:29:56 -05:00
Eric Banks	937ac7290f	Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results.	2012-11-20 16:13:29 -05:00
Joel Thibault	b70fd4a242	Initial testing of the Active Region Traversal contract - TODO: many more tests and test cases	2012-11-15 10:08:00 -05:00
Eric Banks	e9183d9fe0	Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec.	2012-11-08 15:07:47 -05:00
David Roazen	6185e8c432	Allow large-scale tests 5 hours each to run	2012-11-01 17:48:58 -04:00
Mark DePristo	872abddfce	Add custom TestNGTestTransformer that adds a maximum test runtime of 10 minutes to all testng tests -- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread -- Needed to debug locking issues -- Needed to debug excessively long running integrationtests -- Added build.xml maximum runtime for all testng tests of 10 hours. We will ultimately fail the build if it goes on for more than 10 hours	2012-11-01 15:34:12 -04:00
Mark DePristo	1444cd753b	Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome -- The logic for determining active regions was a bit broken in the HC when intervals were used in the system -- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR -- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive -- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future. -- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC -- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions -- Remove excessively expensive calls to check bases are upper cased in ReferenceContext -- Update md5s after a lot of manual review and discussion with Ryan	2012-11-01 15:34:04 -04:00
Mark DePristo	9cd04c335c	Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data -- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases. -- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase() -- Added unit tests to ensure this functionality works correct. -- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case. -- Added contracts to ensure this is the case. -- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code	2012-11-01 15:34:03 -04:00
Eric Banks	47a0f5859e	Don't run these tests if not GAKT lite	2012-10-31 22:56:38 -04:00
Eric Banks	f8af8a2355	Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite.	2012-10-31 21:28:07 -04:00
Eric Banks	2aa28abe0a	Fixing md5s to reflect the new HapMap file	2012-10-30 14:27:10 -04:00
Eric Banks	b6a1967f12	Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it.	2012-10-29 21:47:09 -04:00
Eric Banks	43625f652e	Shoot, mixed up the md5s last time.	2012-10-27 19:43:46 -04:00
Eric Banks	682a72faf7	Hmm, thought I got all the md5s last time. Apparently not.	2012-10-26 16:10:12 -04:00
Mark DePristo	251983b8fb	Add GATK-wide command line argument to control the maximum runtime allowed for the GATK -- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc. -- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update -- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling? -- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.	2012-10-26 13:18:34 -04:00
Eric Banks	ed11b7dab2	Fix UG parallelization test	2012-10-26 12:10:44 -04:00
Eric Banks	7a706ed345	Fix some of the broken integration tests	2012-10-26 11:23:44 -04:00
Eric Banks	ebebec7fdb	Accidentally left one test disabled	2012-10-26 02:15:32 -04:00
Eric Banks	a53e03d525	Do not let reduced reads get removed in the contamination down-sampling	2012-10-26 02:13:04 -04:00
Eric Banks	bf3d61ce82	The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it.	2012-10-26 01:04:51 -04:00
Eric Banks	91f2c847a3	Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test.	2012-10-26 00:57:40 -04:00
Eric Banks	e93ff3ea6e	Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore.	2012-10-25 12:45:23 -04:00
Eric Banks	c53c55da12	Re-enable tests	2012-10-25 09:37:08 -04:00
Eric Banks	e6652f7777	Added integration test for contamination down-sampling	2012-10-25 09:36:05 -04:00
Mark DePristo	6e421a72d6	Add more exhaustive unit tests for input errors to NanoScheduler -- Resolves issue GSA-515 / Nanoscheduler GSA-605 / Seems that -nct may deadlock as not reproducible -- It seems that it's not an input error problem (or at least cannot be provoked with unit tests) -- I'll keep an eye on this later	2012-10-23 20:11:29 -04:00
Mark DePristo	f838815343	Updating MD5s for confidence ref site estimation in IndependentAllelesDiploidExactAFCalc -- Included logic to only add priors for alleles with sufficient evidence to be called polymorphic. If no alleles are poly make sure to add priors of first allele	2012-10-23 06:47:53 -04:00
Mark DePristo	15b28e61cd	Retiring TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine -- There's been no report of problems with the nano scheduled version of TraverseLoci and TraverseReads, so I'm removing the old versions since they are no longer needed -- Removing unnecessary intermediate base classes -- GSA-515 / Nanoscheduler GSA-549 / https://jira.broadinstitute.org/browse/GSA-549	2012-10-22 16:55:06 -04:00
Mark DePristo	90f59803fd	MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels -- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles -- All constructores, factory methods, and test builders and their users updated to provide just a single argument -- Updating MD5s for integration tests that change due to genotyping more alleles -- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic. That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele. The effect can be large (new qual = old qual / 2 in one case here). -- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating. This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime. Might be worth it through	2012-10-22 13:47:56 -04:00
Khalid Shakir	97dc3664c9	Fixed yet another NPE related to the ArgumentTypeDescriptor vs. ArgumentMatchValue. Added integration test based on GSA-621.	2012-10-22 12:05:32 -04:00
Mark DePristo	eb6c9a1a79	Disable EfficiencyMonitoringThreadFactoryUnitTest -- This is no longer a core GATK activity, and the tests need to run for so long (2 min each) that it's just too painful to run them. Should be re-eabled if we come to care about this capability again, or if we can run these tests all in parallel in the future.	2012-10-21 12:43:46 -04:00
Mark DePristo	d21e42608a	Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default	2012-10-21 12:43:46 -04:00
Mark DePristo	6b6caf8e3a	Bugfix for indel DP calculations using reduced reads -- Adding tests for SNP and indel calling on reduced BAM	2012-10-21 12:42:32 -04:00
Ryan Poplin	a647f1e076	Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore.	2012-10-20 16:38:18 -04:00
Ryan Poplin	b4e69239dd	In order to be considered an informative read in the PerReadAlleleLikelihoodMap it has to be informative compared to all other alleles not just the worst allele. Also, fixing a bug when there is only one allele in the map.	2012-10-18 14:31:15 -04:00
Mauricio Carneiro	b57df6cac8	Bringing CMI changes into the main GATK repo. Merge remote-tracking branch 'cmi/master'	2012-10-17 15:23:19 -04:00
Mark DePristo	9bcefadd4e	Refactor ExactCallLogger into a separate class -- Update minor integration tests with NanoSchedule due to qual accuracy update	2012-10-16 13:30:09 -04:00
Ryan Poplin	31be807664	Updating missed integration test.	2012-10-15 22:31:52 -04:00
Ryan Poplin	d27ae67bb6	Updating the multi-step UG integration test.	2012-10-15 22:30:01 -04:00
kshakir	213cc00abe	Refactored argument matching to support other plugins in addition to file lists. Added plugin support for sending Queue status messages. Argument parsing can store subclasses of java.io.File, for example RemoteFile.	2012-10-15 15:10:45 -04:00
Mauricio Carneiro	80d92e0c63	Allowing the GATK to have non-required outputs Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation). This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out. [#DEV-10 transition:31 resolution:1]	2012-10-15 13:49:08 -04:00
Ryan Poplin	25be94fbb8	Increasing the precision of MathUtils.approximateLog10SumLog10 from 1E-3 to 1E-4. Genotyper integration tests change as a result. Expanding the unit tests of MathUtils.log10sumLog10.	2012-10-15 13:24:32 -04:00
Mark DePristo	dcf8af42a8	Finalizing IndependentAllelesDiploidExactAFCalc -- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors -- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef. Not ideal. Finally caught by some tests, but good god it almost made it into the code -- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0 -- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly. ID'd some of the bugs below -- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests -- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.	2012-10-15 08:21:03 -04:00
Christopher Hartl	6b9987cf1b	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-10-12 00:48:42 -04:00
Eric Banks	36a26a7da6	md5s failed because I forgot to add --no_cmdline_in_header so it is different depending on where you run from. Fixed.	2012-10-07 08:35:55 -04:00
Eric Banks	a5aaa14aaa	Fix for GSA-601: Indels dropped during liftover. This was a true bug that was an effect of the switch over to the non-null representation of alleles in the VariantContext. Unfortunately, this tool didn't have integration tests - but it does now.	2012-10-07 01:19:52 -04:00
Eric Banks	5d6aad67e2	Fix for bug reported on forums: VariantsToTable does not handle lists and nested arrays correctly. Added an integration test to cover printing of PLs.	2012-10-07 00:01:27 -04:00
Eric Banks	e7798ddd2a	Fix for JIRA GSA-598: AD field not handled properly by CombineVariants. It was also not handled by SelectVariants either. We now strip the AD field out whenever combining/selecting makes it invalid due to a changing of the number of ALT alleles.	2012-10-06 23:02:36 -04:00
Eric Banks	e8a6460a33	After merging with Yossi's fix I can confirm that the AD is fixed when going through the HC too. Added similar fixes to DP and FS annotations too.	2012-10-05 16:37:42 -04:00
Yossi Farjoun	dc4dcb4140	fixed AD annotation for a ReducedReads BAM file. Added an integration test for this case with a new reduced BAM in private/testdata	2012-10-05 14:20:07 -04:00
Mark DePristo	b6e20e083a	Copied DiploidExactAFCalc to placeholder OptimizedDiploidExact -- Will be removed. Only commiting now to fix public -> private dependency	2012-10-03 20:16:38 -07:00
Mark DePristo	f6a2ca6e7f	Fixes / TODOs for meaningful results with AFCalculationResult -- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity). Forced me to disable reasonable contracts. Needs to be thought through -- exactCallsLog should be optional -- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.	2012-10-03 19:55:12 -07:00
Mark DePristo	50e4a832ea	Generalize framework for evaluating the performance and scaling of the ExactAF models to tri-allelic variants -- Wow, big performance problems with multi-allelic exact model!	2012-10-03 19:55:11 -07:00
Mark DePristo	17ca543937	More ExactModel cleanup -- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time. -- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live -- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs	2012-10-03 19:55:11 -07:00
Mark DePristo	f8ef4332de	Count the number of evaluations in AFResult; expand unit tests -- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples -- Added unittests for priors (flat and human) -- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)	2012-10-03 19:55:11 -07:00
Mark DePristo	33c7841c4d	Add tests for non-informative samples in ExactAFCalculationModel	2012-10-03 19:55:11 -07:00
Mark DePristo	de941ddbbe	Cleanup Exact model, better unit tests -- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result. -- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected -- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way -- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef	2012-10-03 19:55:11 -07:00
Mark DePristo	3e01a76590	Clean up AlleleFrequencyCalculation classes -- Added a true base class that only does truly common tasks (like manage call logging) -- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract -- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact -- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation -- All unit tests pass	2012-10-03 19:55:11 -07:00
Christopher Hartl	1be8a88909	Changes: 1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest. 2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested. 3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test. 4) Updates to PlinkBed-handling python scripts and utilities. 5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.	2012-10-03 16:02:42 -04:00
Christopher Hartl	2508b0f5a7	Merged bug fix from Stable into Unstable	2012-09-29 00:57:43 -04:00
Christopher Hartl	365f1d2429	hmk123's error on the forum came from the reference context occasionally lacking bases needed for validating the reference bases in the variant context. (no @Window for VariantsToBinaryPed). This bugfix adresses this and other minor items: 1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts. 2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases. 3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.	2012-09-29 00:55:31 -04:00
Christopher Hartl	55cdf4f9b7	Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release.	2012-09-27 00:13:32 -04:00
Eric Banks	74bb4e2739	Fixing the VariantContextUtilsUnitTest	2012-09-22 23:24:55 -04:00
Eric Banks	25e3ea879a	Oops, missed this test before when updating md5s	2012-09-22 22:16:35 -04:00
David Roazen	f6a22e5f50	ExperimentalReadShardBalancerUnitTest was being skipped; fixed TestNG skips tests when an exception occurs in a data provider, which is what was happening here. This was due to an AWFUL AWFUL use of a non-final static for ReadShard.MAX_READS. This is fine if you assume only one instance of SAMDataSource, but with multiple tests creating multiple SAMDataSources, and each one overwriting ReadShard.MAX_READS, you have a recipe for problems. As a result of this the test ran fine individually, but not as part of the unit test suite. Quick fix for now to get the tests running -- this "mutable static" interface should really be refactored away though, when I have time.	2012-09-22 01:56:39 -04:00
David Roazen	133085469f	Experimental, downsampler-friendly read shard balancer -Only used when experimental downsampling is enabled -Persists read iterators across shards, creating a new set only when we've exhausted the current BAM file region(s). This prevents the engine from revisiting regions discarded by the downsamplers / filters, as could happen in the old implementation. -SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip out all related code when the engine fork is collapsed. -Defensive implementation that assumes BAM file regions coming out of the BAM Schedule can overlap; should be able to improve performance if we can prove they cannot possibly overlap. -Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back once confidence in the code is gained	2012-09-21 22:17:58 -04:00
Mark DePristo	5d758bf97f	Better run a shorter test -- should take 3 minutes total	2012-09-20 18:54:14 -04:00
Mark DePristo	b5fa848255	Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output -- See https://jira.broadinstitute.org/browse/GSA-573 -- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread. -- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.	2012-09-20 18:45:16 -04:00
Mark DePristo	90b7df46cf	Add invocation count and shorter timeout to NanoSchedulerUnitTest	2012-09-20 18:45:16 -04:00
Mark DePristo	ba9e95a8fe	Revert "Reorganized NanoScheduler so that main thread does the reduces" Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting. This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.	2012-09-20 18:45:15 -04:00
Mark DePristo	7425ab9637	Reorganized NanoScheduler so that main thread does the reduces -- Enables us to run -nt 2 -nct 2 and get meaningful output -- Uses a sleep / poll mechanism. Not ideal -- will look into wait / notify instead.	2012-09-20 18:45:15 -04:00
Eric Banks	747694f7c2	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-20 14:14:58 -04:00
Eric Banks	1316b579f0	Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table.	2012-09-20 14:14:34 -04:00
Christopher Hartl	c492185be6	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-09-20 12:56:07 -04:00
Christopher Hartl	d25579deeb	A couple of minor things. 1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type 2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it, as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the annotation classes (e.g. from either UG or the VariantAnnotator). 3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are: 1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation] 2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all) 3) Write your own dumb stuff [difficutly rating: Annoying] What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.	2012-09-20 12:48:13 -04:00
Eric Banks	2e6f533996	Adding both unit and integration tests to cover the previous edge case of mismatched PLs	2012-09-20 11:55:28 -04:00
Mark DePristo	2267b722b2	Proper error handling in NanoScheduler -- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker -- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately. -- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers -- Added NanoSchedulerUnitTest for input errors -- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results. -- Build.xml no longer copies the unittest results verbosely	2012-09-19 17:03:13 -04:00
Mark DePristo	773af05980	Intermediate commit for proper error handling in the NanoScheduler -- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler -- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled) -- Added unit tests for failing input iterator in NanoScheduler (fails) -- Made ErrorThrowing NanoScheduable	2012-09-19 17:03:13 -04:00
Mark DePristo	33fabb8180	Final V3 version of NanoScheduler -- Fixed basic bugs in tracking of input -> map -> reduce jobs -- Simplified classes -- Expanded unit tests	2012-09-19 17:03:12 -04:00
Mark DePristo	76027d17e6	Add a few more UnitTests for InputProducer -- Cleaned up function calls for clarity	2012-09-19 17:03:12 -04:00
Mark DePristo	7605c6bcc4	Done GSA-515 Nanoscheduler / GSA-557 V3 nanoScheduler algorithm -- V3 + V4 algorithm for NanoScheduler. The newer version uses 1 dedicated input thread and n - 1 map/reduce threads. These MapReduceJobs perform map and a greedy reduce. The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one. We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size. -- This new algorithm should consume N00% CPU power for -nct N value. -- Also a cleaner implementation in general -- Vastly expanded unit tests -- Deleted FutureValue and ReduceThread	2012-09-19 17:03:12 -04:00
Mark DePristo	69e418c3f5	Intermediate commit for v3 NanoScheduling algorithm -- This version works but it blocks much more than I'd expect on input. Merging v2 and v3 to make v4 now	2012-09-19 17:03:12 -04:00
Christopher Hartl	546586b70e	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 10:09:42 -04:00
Mark DePristo	91f3204534	VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself -- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere. -- Removed addMissingSamples from VariantcontextUtils, and calls to it -- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples -- Added unit tests for this behavior in VariantContextWritersUnitTest	2012-09-12 06:46:26 -04:00
Christopher Hartl	5d19fca649	A couple of bug-fixy changes. 1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning, and does the appropriate subsetting. Added integration tests for this. 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).	2012-09-11 23:01:00 -04:00
David Roazen	6fad0f25bb	Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest	2012-09-11 10:47:09 -04:00
Mark DePristo	e25e617d1a	Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency -- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use. So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.	2012-09-11 07:38:34 -04:00
Mark DePristo	2e94a0a201	Refactor TraversalEngine to extract the progress meter functions -- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves. -- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before -- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer -- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler -- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N). -- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well -- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it. -- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.	2012-09-10 20:14:13 -04:00
David Roazen	d2f3d6d22f	Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)" This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.	2012-09-10 15:52:39 -04:00
Menachem Fromer	0b717e2e2e	Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)	2012-09-10 15:32:41 -04:00
Eric Banks	ac8a4dfc2d	The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now).	2012-09-10 15:04:06 -04:00
Mark DePristo	f25bf0f927	EfficiencyMonitoringThreadFactoryUnitTests thing keeps timing out unnecessary	2012-09-07 11:03:00 -04:00
Mark DePristo	bf87de8a25	UnitTests for ReducerThread and InputProducer -- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order	2012-09-07 09:51:32 -04:00
Mark DePristo	8c0e3b1e0c	UnitTests for InputProducer	2012-09-07 09:15:16 -04:00
Mark DePristo	c503884958	GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper -- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry. -- Restructured the NS code for clarity. Docs everywhere. -- This is considered version 1.0	2012-09-07 09:15:16 -04:00
Mark DePristo	9d12935986	Intermediate commit for new hyper parallel NanoScheduler -- There's a logic bug now but I'll go to squash it...	2012-09-07 09:15:16 -04:00
David Roazen	cb84a6473f	Downsampling: experimental engine integration -Off by default; engine fork isolates new code paths from old code paths, so no integration tests change yet -Experimental implementation is currently BROKEN due to a serious issue involving file spans. No one can/should use the experimental features until I've patched this issue. -There are temporarily two independent versions of LocusIteratorByState. Anyone changing one version should port the change to the other (if possible), and anyone adding unit tests for one version should add the same unit tests for the other (again, if possible). This situation will hopefully be extremely temporary, and last only until the experimental implementation is proven.	2012-09-06 15:03:27 -04:00
Mark DePristo	5ab5d8dee8	Give EfficiencyMonitoringThreadFactoryUnitTest longer to complete its tests	2012-09-05 22:08:34 -04:00
Mark DePristo	1b064805ed	Renaming -cnt to -nct for consistency	2012-09-05 21:13:19 -04:00
Mark DePristo	228bac75e4	By default do only NT tests in integration tests	2012-09-05 20:57:49 -04:00
Mark DePristo	225f3a0ebe	Update integration test system to allow us to differentiate between testing data and cpu parallelism	2012-09-05 16:35:00 -04:00
Mark DePristo	a997c99806	Initial NanoScheduler with input producer thread	2012-09-05 15:45:24 -04:00

... 2 3 4 5 6 ...

1321 Commits (8a442d3c9f33dcda2e5ffececd06f0c5359b0df2)