gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Christopher Hartl	55cdf4f9b7	Commit changes in Variants To Binary Ped to the stable repository to be available prior to next release.	2012-09-27 00:13:32 -04:00
Eric Banks	74bb4e2739	Fixing the VariantContextUtilsUnitTest	2012-09-22 23:24:55 -04:00
Eric Banks	25e3ea879a	Oops, missed this test before when updating md5s	2012-09-22 22:16:35 -04:00
David Roazen	f6a22e5f50	ExperimentalReadShardBalancerUnitTest was being skipped; fixed TestNG skips tests when an exception occurs in a data provider, which is what was happening here. This was due to an AWFUL AWFUL use of a non-final static for ReadShard.MAX_READS. This is fine if you assume only one instance of SAMDataSource, but with multiple tests creating multiple SAMDataSources, and each one overwriting ReadShard.MAX_READS, you have a recipe for problems. As a result of this the test ran fine individually, but not as part of the unit test suite. Quick fix for now to get the tests running -- this "mutable static" interface should really be refactored away though, when I have time.	2012-09-22 01:56:39 -04:00
David Roazen	133085469f	Experimental, downsampler-friendly read shard balancer -Only used when experimental downsampling is enabled -Persists read iterators across shards, creating a new set only when we've exhausted the current BAM file region(s). This prevents the engine from revisiting regions discarded by the downsamplers / filters, as could happen in the old implementation. -SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip out all related code when the engine fork is collapsed. -Defensive implementation that assumes BAM file regions coming out of the BAM Schedule can overlap; should be able to improve performance if we can prove they cannot possibly overlap. -Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back once confidence in the code is gained	2012-09-21 22:17:58 -04:00
Mark DePristo	5d758bf97f	Better run a shorter test -- should take 3 minutes total	2012-09-20 18:54:14 -04:00
Mark DePristo	b5fa848255	Fix GSA-515 Nanoscheduler GSA-573 -nt and -nct interact badly w.r.t. output -- See https://jira.broadinstitute.org/browse/GSA-573 -- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread. -- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.	2012-09-20 18:45:16 -04:00
Mark DePristo	90b7df46cf	Add invocation count and shorter timeout to NanoSchedulerUnitTest	2012-09-20 18:45:16 -04:00
Mark DePristo	ba9e95a8fe	Revert "Reorganized NanoScheduler so that main thread does the reduces" Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting. This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.	2012-09-20 18:45:15 -04:00
Mark DePristo	7425ab9637	Reorganized NanoScheduler so that main thread does the reduces -- Enables us to run -nt 2 -nct 2 and get meaningful output -- Uses a sleep / poll mechanism. Not ideal -- will look into wait / notify instead.	2012-09-20 18:45:15 -04:00
Eric Banks	747694f7c2	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-20 14:14:58 -04:00
Eric Banks	1316b579f0	Bad news folks: BQSR scatter-gather was totally busted; you absolutely cannot trust any BQSR table that was a product of SG (for any version of BQSR). I fixed BQSR-gathering, rewrote (and enabled) the unit test, and confirmed that outputs are now identical whether or not SG is used to create the table.	2012-09-20 14:14:34 -04:00
Christopher Hartl	c492185be6	Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable	2012-09-20 12:56:07 -04:00
Christopher Hartl	d25579deeb	A couple of minor things. 1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type 2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it, as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the annotation classes (e.g. from either UG or the VariantAnnotator). 3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are: 1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation] 2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all) 3) Write your own dumb stuff [difficutly rating: Annoying] What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.	2012-09-20 12:48:13 -04:00
Eric Banks	2e6f533996	Adding both unit and integration tests to cover the previous edge case of mismatched PLs	2012-09-20 11:55:28 -04:00
Mark DePristo	2267b722b2	Proper error handling in NanoScheduler -- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker -- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately. -- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers -- Added NanoSchedulerUnitTest for input errors -- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results. -- Build.xml no longer copies the unittest results verbosely	2012-09-19 17:03:13 -04:00
Mark DePristo	773af05980	Intermediate commit for proper error handling in the NanoScheduler -- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler -- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled) -- Added unit tests for failing input iterator in NanoScheduler (fails) -- Made ErrorThrowing NanoScheduable	2012-09-19 17:03:13 -04:00
Mark DePristo	33fabb8180	Final V3 version of NanoScheduler -- Fixed basic bugs in tracking of input -> map -> reduce jobs -- Simplified classes -- Expanded unit tests	2012-09-19 17:03:12 -04:00
Mark DePristo	76027d17e6	Add a few more UnitTests for InputProducer -- Cleaned up function calls for clarity	2012-09-19 17:03:12 -04:00
Mark DePristo	7605c6bcc4	Done GSA-515 Nanoscheduler / GSA-557 V3 nanoScheduler algorithm -- V3 + V4 algorithm for NanoScheduler. The newer version uses 1 dedicated input thread and n - 1 map/reduce threads. These MapReduceJobs perform map and a greedy reduce. The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one. We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size. -- This new algorithm should consume N00% CPU power for -nct N value. -- Also a cleaner implementation in general -- Vastly expanded unit tests -- Deleted FutureValue and ReduceThread	2012-09-19 17:03:12 -04:00
Mark DePristo	69e418c3f5	Intermediate commit for v3 NanoScheduling algorithm -- This version works but it blocks much more than I'd expect on input. Merging v2 and v3 to make v4 now	2012-09-19 17:03:12 -04:00
Christopher Hartl	546586b70e	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-12 10:09:42 -04:00
Mark DePristo	91f3204534	VCF/BCF writers once again automatically write out no-call genotypes for samples in the VCFHeader but not in the VC itself -- Turns out this was consuming 30% of the UG runtime, and causing problems elsewhere. -- Removed addMissingSamples from VariantcontextUtils, and calls to it -- Updated VCF / BCF writers to automatically write out a diploid no call for missing samples -- Added unit tests for this behavior in VariantContextWritersUnitTest	2012-09-12 06:46:26 -04:00
Christopher Hartl	5d19fca649	A couple of bug-fixy changes. 1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning, and does the appropriate subsetting. Added integration tests for this. 2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).	2012-09-11 23:01:00 -04:00
David Roazen	6fad0f25bb	Merge Eric's LocusIteratorByStateUnitTest changes into LocusIteratorByStateExperimentalUnitTest	2012-09-11 10:47:09 -04:00
Mark DePristo	e25e617d1a	Fixes GSA-515 Nanoscheduler GSA-560 / Fix display of NanoScheduler and MonitoringEfficiency -- Now prints out a single combined NanoScheduler runtime profile report across all nano schedulers in use. So now if you run with -nt 4 you'll get one combined NanoScheduler profiler across all 4 instances of the NanoScheduler within TraverseXNano.	2012-09-11 07:38:34 -04:00
Mark DePristo	2e94a0a201	Refactor TraversalEngine to extract the progress meter functions -- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves. -- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before -- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer -- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler -- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N). -- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well -- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it. -- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.	2012-09-10 20:14:13 -04:00
David Roazen	d2f3d6d22f	Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)" This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.	2012-09-10 15:52:39 -04:00
Menachem Fromer	0b717e2e2e	Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)	2012-09-10 15:32:41 -04:00
Eric Banks	ac8a4dfc2d	The comprehensive LIBS unit test is now truly comprehensive (or it would be if LIBS wasn't busted). The test can handle a read with any arbitrary legal CIGAR and iterates over the elements/bases in time with the real LIBS, failing if there are any differences. I've left the few hard-coded CIGARs in there for now with a note to move to all possible permutations once we move to fix LIBS (otherwise the tests would fail now).	2012-09-10 15:04:06 -04:00
Mark DePristo	f25bf0f927	EfficiencyMonitoringThreadFactoryUnitTests thing keeps timing out unnecessary	2012-09-07 11:03:00 -04:00
Mark DePristo	bf87de8a25	UnitTests for ReducerThread and InputProducer -- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order	2012-09-07 09:51:32 -04:00
Mark DePristo	8c0e3b1e0c	UnitTests for InputProducer	2012-09-07 09:15:16 -04:00
Mark DePristo	c503884958	GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper -- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry. -- Restructured the NS code for clarity. Docs everywhere. -- This is considered version 1.0	2012-09-07 09:15:16 -04:00
Mark DePristo	9d12935986	Intermediate commit for new hyper parallel NanoScheduler -- There's a logic bug now but I'll go to squash it...	2012-09-07 09:15:16 -04:00
David Roazen	cb84a6473f	Downsampling: experimental engine integration -Off by default; engine fork isolates new code paths from old code paths, so no integration tests change yet -Experimental implementation is currently BROKEN due to a serious issue involving file spans. No one can/should use the experimental features until I've patched this issue. -There are temporarily two independent versions of LocusIteratorByState. Anyone changing one version should port the change to the other (if possible), and anyone adding unit tests for one version should add the same unit tests for the other (again, if possible). This situation will hopefully be extremely temporary, and last only until the experimental implementation is proven.	2012-09-06 15:03:27 -04:00
Mark DePristo	5ab5d8dee8	Give EfficiencyMonitoringThreadFactoryUnitTest longer to complete its tests	2012-09-05 22:08:34 -04:00
Mark DePristo	1b064805ed	Renaming -cnt to -nct for consistency	2012-09-05 21:13:19 -04:00
Mark DePristo	228bac75e4	By default do only NT tests in integration tests	2012-09-05 20:57:49 -04:00
Mark DePristo	225f3a0ebe	Update integration test system to allow us to differentiate between testing data and cpu parallelism	2012-09-05 16:35:00 -04:00
Mark DePristo	a997c99806	Initial NanoScheduler with input producer thread	2012-09-05 15:45:24 -04:00
Mark DePristo	03dd470ec1	Test for progressFunction in NanoScheduler; bugfix for single threaded fast path	2012-09-05 15:45:23 -04:00
Mark DePristo	8cdeb51b78	Cleanup printProgress in TraversalEngine -- Separate updating cumulative traversal metrics from printing progress. There's now an updateCumulativeMetrics function and a printProgress() that only takes a current position -- printProgress now soles relies on the time since the last progress to decide if it will print or not. No longer uses the number of cycles, since this isn't reliable in the case of nano scheduling -- GenomeAnalysisEngine now maintains a pointer to the master cumulative metrics. getCumulativeMetrics never returns null, which was handled in some parts of the code but not others. -- Update all of the traversals to use the new updateCumulativeMetrics, printProgress model -- Added progress callback to nano scheduler. Every bufferSize elements this callback is invoked, allowing us to smoothly update the progress meter in the NanoScheduler -- Rename MapFunction to NanoSchedulerMap and the same for reduce.	2012-09-05 15:45:23 -04:00
Mark DePristo	6a5a70cdf1	Done GSA-539: SimpleTimer should use System.nanoTime for nanoSecond resolution	2012-09-05 15:45:23 -04:00
Mark DePristo	6055101df8	NanoScheduler no longer groups inputs, each map() call is interlaced now -- Maximizes the efficiency of the threads -- Simplifies interface (yea!) -- Reduces number of combinatorial tests that need to be performed	2012-09-05 15:45:22 -04:00
Mark DePristo	e3b4cc02aa	Done GSA-282: Unindexed traversals crash if a read goes off the end of a contig -- Already fixed in the codebase. Added unindexed bam and integration tests to ensure this is fine going forward.	2012-09-05 15:45:22 -04:00
Christopher Hartl	d795437202	- New UserExceptions added for when ReadFilters or Walkers specified on the command line are not found. When -rf xxxx cannot find the class corresponding to xxxx, all read filters are printed in a better formatted way, with links to their gatk docs. - VariantAnnotatorEngine changed to call genotype annotations even if pilups and allele -> likelihood mappings are not present. Current genotype annotations altered to check for null pilupes and null mappings.	2012-09-04 16:41:44 -04:00
Mark DePristo	0892f2b8b2	Closing GSA-287:LocusReferenceView doesn't do very well in the case where contigs land off the end of the reference -- Confirmed that reads spanning off the end of the chromosome don't cause an exception by adding integration test for a single read that starts 7 bases from the end of chromosome 1 and spans 90 bases or so off. Added pileup integration test to ensure this behavior continues to work	2012-09-03 20:18:56 -04:00
Mark DePristo	cf91d894e4	Fix build problems with tests	2012-08-31 13:42:41 -04:00
Eric Banks	ac0c44720b	I started to put together a set of unit tests for the PileupElement creation functionality of LocusIteratorByState and found pretty quickly that it's definitely still busted for indels. The data provider is nowhere near comprehensive yet, but I need to sit back and think about how to really test some of the functionality of LIBS. Committing what I have for now because at the very least it'll be helpful going forward (failing tests are commented out with TODO).	2012-08-30 22:49:13 -04:00
Mark DePristo	39400c56a9	Update md5s for VQSR, as VQSLOD is now a double and gets the standard double precision treatment in VCF	2012-08-30 19:41:49 -04:00
Mark DePristo	1212dfd2ef	Reduce the number of test combinations in ReadBasedREferenceOrderedView	2012-08-30 19:41:49 -04:00
Mark DePristo	7d95176539	Bugfix to compareTo and equals in GenomeLoc -- Yes, GenomeLoc.compareTo was broken. The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs. -- Updated GenomeLoc.compareTo function to account for stop. Updated GATK code where necessary to fix resulting problems that depended on this. -- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs	2012-08-30 19:41:49 -04:00
Mark DePristo	21dd70ed36	Test to ensure that ReadBasedReferenceOrderedView produces stateless objects -- Stateless objects are required for nano-scheduling. This means you can take the RefMetaDataTracker provided by ReadBasedReferenceOrderedView, store it way, get another from the same view, and the original one behaves the same.	2012-08-30 10:15:11 -04:00
Mark DePristo	ce3d1f89ea	ReadShard are no longer allowed to span multiple contigs -- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads. The old implementation simply failed in this case. The new code handles this correctly by forcing shards to have all of their data on a single contig. -- Added a PrintReads integration test to ensure this behavior is correct -- Adding test BAMs that have < 200 reads and span across contig boundaries	2012-08-30 10:15:11 -04:00
Mark DePristo	1200848bbf	Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers -- Deleted ReadMetaDataTracker -- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view	2012-08-30 10:15:10 -04:00
Mark DePristo	972be8b4a4	Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers -- ReadMetaDataTracker is dead! Long live the RefMetaDataTracker. Read walkers will soon just take RefMetaDataTracker objects. In this commit they take a class that trivially extends them -- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class. -- This new implementation produces thread-safe objects (i.e., holds no points to shared state). Suitable for use (to be tested) with nano scheduling -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely. -- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView -- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference. Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface. If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself. -- This commit breaks the new read walker BQSR, but Ryan knows this is coming -- Subsequent commit will be retiring / fixing ValidateRODForReads	2012-08-30 10:15:10 -04:00
Mark DePristo	8fc6a0a68b	Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker	2012-08-30 10:13:06 -04:00
Ryan Poplin	6d6ca090c6	RecalDatums now hold doubles so the test for equality needs an epsilon.	2012-08-28 16:00:52 -04:00
Eric Banks	67d348a31d	Retiring the alignment walkers and related integration test since we don't want to support them anymore.	2012-08-28 10:16:49 -04:00
Mark DePristo	0f4acaae1b	Update MD5s with new FS score	2012-08-28 08:06:47 -04:00
Mark DePristo	63a9ae817a	Ensure thread-safety of CachingIndexedFastaSequenceFile -- Cosmetic cleanup of ReadReferenceView -- TraverseReadsNano provides the reference context, since it's thread-safe -- Cleanup CachingIndexedFastaSequenceFile. Add docs, remove unnecessary setters -- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.	2012-08-27 12:11:54 -04:00
Mark DePristo	e5b1f1c7f4	Add simple main function to unit test so we can run the nano scheduler test from the command line	2012-08-27 12:11:54 -04:00
Mark DePristo	faacacd6c0	Increase runtime of nano scheduler tests to 1 min	2012-08-26 08:42:58 -04:00
Mark DePristo	846e0c11bc	Add TimeOuts to new threading tests, in case there's a underlying deadlock	2012-08-26 08:18:43 -04:00
Mark DePristo	275a5e5439	More tests for NanoScheduler -- Add more contracts -- Test in the UnitTest that the reduce is being called in the correct order	2012-08-25 17:21:11 -04:00
Christopher Hartl	db2e88c7cb	Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test.	2012-08-25 12:38:23 -07:00
Mark DePristo	9de8077eeb	Working (efficient?) implementation of NanoScheduler -- Groups inputs for each thread so that we don't have one thread execution per map() call -- Added shutdown function -- Documentation everywhere -- Code cleanup -- Extensive unittests -- At this point I'm ready to integrate it into the engine for CPU parallel read walkers	2012-08-24 15:34:23 -04:00
Mark DePristo	d6e6b30caf	Initial implementation of GSA-515: Nanoscheduler – Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)). Done! CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement. Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job. Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks	2012-08-24 14:07:44 -04:00
Christopher Hartl	f1166d6d00	Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case.	2012-08-23 11:43:19 -07:00
Mark DePristo	63af0cbcba	Cleanup GATK efficiency monitor classes -- Invert logic in GATKArgumentCollection to disable monitoring, not enable. That means monitoring is on by default -- Fix testing error in unit tests -- Rename variables in ThreadAllocation to be clearer	2012-08-22 16:48:02 -04:00
Mark DePristo	e1293f0ef2	GSA-507: Thread monitoring refactored so it can work without a thread factory -- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory. -- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode -- MicroScheduler now handles management of the efficiency monitor. Includes master thread in monitor, meaning that reduce is now included for both schedulers	2012-08-22 16:48:01 -04:00
Mark DePristo	f876c51277	Separately track time spent doing user and system CPU work -- Allows us to ID (by proxy) time spent doing IO -- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State -- Reliable unit tests across mac and unix	2012-08-22 16:48:01 -04:00
Guillermo del Angel	1aa856e0e3	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-22 15:53:47 -04:00
Guillermo del Angel	e29469eeeb	Forgot to update 2 integration test md5's (in this cases, changes are legit because of the code revamp of AD, it's simpler if AD is not output when a site is not variant, as genotype DP conveys the same information)	2012-08-22 15:53:33 -04:00
Eric Banks	2409aa9bfd	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-22 12:54:43 -04:00
Eric Banks	94540ccc27	Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case.	2012-08-22 12:54:29 -04:00
Guillermo del Angel	901f47d8af	Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still	2012-08-22 11:38:51 -04:00
Eric Banks	c7ce3e1cf5	Merged bug fix from Stable into Unstable	2012-08-22 00:24:40 -04:00
Eric Banks	03017855e4	WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test.	2012-08-22 00:24:01 -04:00
Christopher Hartl	ba8622ff0d	number of stashed changes are lurking in here. In order of importance: - Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker. - Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length. - Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR) - InsertSizeDistribution changed to use the new gatk report output (it was previously broken) - RetrogeneDiscovery changed to be compatible with the new gatk report - A maxIndelSize argument added to SelectVariants - ByTranscriptEvaluator rewritten for cleanliness - VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL - Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; ) Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this also fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".	2012-08-21 07:08:58 -04:00
Eric Banks	40d5efc804	Fix for Adam K's reported bug: we weren't handling reads that were entirely insertions properly in LIBS. Specifically, the event bases were off-by-one (which was disasterous in Adam's case with a 1bp read). Added a unit test to cover this case.	2012-08-20 23:12:41 -04:00
Mark DePristo	9121b98167	CombineVariants outputs the first non-MISSING qual, not the maximum -- When merging multiple VCF records at a site, the combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value. The previous behavior was to take the max QUAL, which resulted in sometime strange downstream confusion.	2012-08-19 10:29:38 -04:00
Mauricio Carneiro	d16cb68539	Updated and more thorough version of the BadCigar read filter * No reads with Hard/Soft clips in the middle of the cigar * No reads starting with deletions (with or without preceding clips) * No reads ending in deletions (with or without follow-up clips) * No reads that are fully hard or soft clipped * No reads that have consecutive indels in the cigar (II, DD, ID or DI) Also added systematic test for good cigars and iterative test for bad cigars.	2012-08-17 17:05:27 -04:00
Eric Banks	611d9b61e2	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-16 13:05:36 -04:00
Eric Banks	2df04dc48a	Fix for performance problem in GGA mode related to previous --regenotype commit. Instead of trying to hack around the determination of the calculation model when it's not needed, just simply overload the calculateGenotypes() method to add one that does simple genotyping. Re-enabling the Pool Caller integration tests.	2012-08-16 13:05:17 -04:00
Mark DePristo	4e42988c66	GSA-485: Remove repairVCFHeader from GATK codebase -- Removed half-a*ssed attempt to automatically repair VCF files with bad headers, which allowed users to provide a replacement header overwriting the file's actually header on the fly. Not a good idea, really. Eric has promised to create a utility that walks through a VCF file and creates a meaningful header field based on the file's contents (if this ever becomes a priority)	2012-08-16 13:03:13 -04:00
Mark DePristo	a9a1c499fd	Update md5 in VariantRecalibrationWalkers test for BCF2 -- only encoding differences	2012-08-16 13:03:13 -04:00
Mark DePristo	c0a31b2e5b	CombineVariants parallel integration tests -- All tests but one (using old bad VCF3 input) run unmodified with parallel code. -- Disabled UNSAFE_VCF_PROCESSING for all but that test, which changes md5s because the output files have fixed headers -- Minor optimizations to simpleMerge	2012-08-15 21:13:16 -04:00
Mark DePristo	669c43031a	BCF2 optimizations; parallel CombineVariants -- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header. Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently) -- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils. Fixed bug in edge case that never occurred in practice -- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right) -- More ways to access the data in VCFHeader -- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output -- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary. We just guess that on average we need ~10 elements for the attribute map -- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet -- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement	2012-08-15 21:13:16 -04:00
Mark DePristo	dafa7e3885	Temporarily disable StateMonitoringThreadTests while I get them reliably working across platforms	2012-08-15 21:13:16 -04:00
Mark DePristo	d70fd18900	Minor increase in tolerance to sum of states in UnitTest for StateMonitoringThreadFactory	2012-08-15 21:13:15 -04:00
Mark DePristo	ae4d4482ac	Parallel combine variants! -- CombineVariants is now TreeReducible! -- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format	2012-08-15 21:13:15 -04:00
Mark DePristo	9459e6203a	Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates -- Expanded unit tests -- Support for clean logging of results to logger -- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse -- Added docs and contracts to StateMonitoringThreadFactory	2012-08-15 21:13:15 -04:00
Mark DePristo	be3230a1fd	Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates -- Created makeCombinations utility function (very useful!). Moved template from VariantContextTestProvider -- UnitTests for basic functionality	2012-08-15 21:13:15 -04:00
Eric Banks	87e41c83c5	In AlleleCount stratification, check to make sure the AC (or MLEAC) is valid (i.e. not higher than number of chromosomes) and throw a User Error if it isn't. Added a test for bad AC.	2012-08-14 15:02:30 -04:00
Eric Banks	8e3774fb0e	Fixing behavior of the --regenotype argument in SelectVariants to properly run in GenotypeGivenAlleles mode. Added integration tests to cover recent SV changes.	2012-08-14 14:21:42 -04:00
Eric Banks	34b62fa092	Two changes to SelectVariants: 1) don't add DP INFO annotation if DP wasn't used in the input VCF (it was adding DP=0 previously). 2) If MLEAC or MLEAF is present in the original VCF and the number of samples decreases, remove those annotations from the VC.	2012-08-14 12:54:31 -04:00
Mark DePristo	aab417c94d	Fix missing argument in unittest	2012-08-12 13:58:14 -04:00
Ami Levy Moonshine	6fefdaf428	"update integration tests in CombineVariantsIntegrationTest" Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-10 17:00:35 -04:00

1 2 3 4 5 ...

1062 Commits (0a56fe5bc33f9dfd40e25005f2e037bc36e9ffdc)