gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Ryan Poplin	6d6ca090c6	RecalDatums now hold doubles so the test for equality needs an epsilon.	2012-08-28 16:00:52 -04:00
Mark DePristo	63a9ae817a	Ensure thread-safety of CachingIndexedFastaSequenceFile -- Cosmetic cleanup of ReadReferenceView -- TraverseReadsNano provides the reference context, since it's thread-safe -- Cleanup CachingIndexedFastaSequenceFile. Add docs, remove unnecessary setters -- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.	2012-08-27 12:11:54 -04:00
Mark DePristo	e5b1f1c7f4	Add simple main function to unit test so we can run the nano scheduler test from the command line	2012-08-27 12:11:54 -04:00
Mark DePristo	faacacd6c0	Increase runtime of nano scheduler tests to 1 min	2012-08-26 08:42:58 -04:00
Mark DePristo	846e0c11bc	Add TimeOuts to new threading tests, in case there's a underlying deadlock	2012-08-26 08:18:43 -04:00
Mark DePristo	275a5e5439	More tests for NanoScheduler -- Add more contracts -- Test in the UnitTest that the reduce is being called in the correct order	2012-08-25 17:21:11 -04:00
Mark DePristo	9de8077eeb	Working (efficient?) implementation of NanoScheduler -- Groups inputs for each thread so that we don't have one thread execution per map() call -- Added shutdown function -- Documentation everywhere -- Code cleanup -- Extensive unittests -- At this point I'm ready to integrate it into the engine for CPU parallel read walkers	2012-08-24 15:34:23 -04:00
Mark DePristo	d6e6b30caf	Initial implementation of GSA-515: Nanoscheduler – Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)). Done! CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement. Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job. Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks	2012-08-24 14:07:44 -04:00
Mark DePristo	63af0cbcba	Cleanup GATK efficiency monitor classes -- Invert logic in GATKArgumentCollection to disable monitoring, not enable. That means monitoring is on by default -- Fix testing error in unit tests -- Rename variables in ThreadAllocation to be clearer	2012-08-22 16:48:02 -04:00
Mark DePristo	e1293f0ef2	GSA-507: Thread monitoring refactored so it can work without a thread factory -- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory. -- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode -- MicroScheduler now handles management of the efficiency monitor. Includes master thread in monitor, meaning that reduce is now included for both schedulers	2012-08-22 16:48:01 -04:00
Mark DePristo	f876c51277	Separately track time spent doing user and system CPU work -- Allows us to ID (by proxy) time spent doing IO -- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State -- Reliable unit tests across mac and unix	2012-08-22 16:48:01 -04:00
Eric Banks	94540ccc27	Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case.	2012-08-22 12:54:29 -04:00
Mauricio Carneiro	d16cb68539	Updated and more thorough version of the BadCigar read filter * No reads with Hard/Soft clips in the middle of the cigar * No reads starting with deletions (with or without preceding clips) * No reads ending in deletions (with or without follow-up clips) * No reads that are fully hard or soft clipped * No reads that have consecutive indels in the cigar (II, DD, ID or DI) Also added systematic test for good cigars and iterative test for bad cigars.	2012-08-17 17:05:27 -04:00
Mark DePristo	4e42988c66	GSA-485: Remove repairVCFHeader from GATK codebase -- Removed half-a*ssed attempt to automatically repair VCF files with bad headers, which allowed users to provide a replacement header overwriting the file's actually header on the fly. Not a good idea, really. Eric has promised to create a utility that walks through a VCF file and creates a meaningful header field based on the file's contents (if this ever becomes a priority)	2012-08-16 13:03:13 -04:00
Mark DePristo	669c43031a	BCF2 optimizations; parallel CombineVariants -- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header. Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently) -- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils. Fixed bug in edge case that never occurred in practice -- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right) -- More ways to access the data in VCFHeader -- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output -- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary. We just guess that on average we need ~10 elements for the attribute map -- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet -- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement	2012-08-15 21:13:16 -04:00
Mark DePristo	dafa7e3885	Temporarily disable StateMonitoringThreadTests while I get them reliably working across platforms	2012-08-15 21:13:16 -04:00
Mark DePristo	d70fd18900	Minor increase in tolerance to sum of states in UnitTest for StateMonitoringThreadFactory	2012-08-15 21:13:15 -04:00
Mark DePristo	9459e6203a	Clean, documented implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates -- Expanded unit tests -- Support for clean logging of results to logger -- Refactored MyTime into AutoFormattingTime in Utils, out of TraversalEngine, for cleanliness and reuse -- Added docs and contracts to StateMonitoringThreadFactory	2012-08-15 21:13:15 -04:00
Mark DePristo	be3230a1fd	Initial implementation of ThreadFactory that monitors running / blocking / waiting time of threads it creates -- Created makeCombinations utility function (very useful!). Moved template from VariantContextTestProvider -- UnitTests for basic functionality	2012-08-15 21:13:15 -04:00
Mark DePristo	aab417c94d	Fix missing argument in unittest	2012-08-12 13:58:14 -04:00
Mark DePristo	9a0dda71d4	BCF2 optimizations -- All low-level reads throw IOException instead of catching it directly. This allows us to not try/catch in readByte, improving performance by 5% or so -- Optimize encodeTypeDescriptor with final variables. Avoid using Math.min instead do inline comparison -- Inlined willOverflow directly in its single use	2012-08-09 16:36:18 -04:00
Mark DePristo	00858f16a6	Deleting empty unit test for AdaptiveContexts	2012-08-06 12:58:13 -04:00
Ryan Poplin	b8709d8c67	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-06 11:41:28 -04:00
Ryan Poplin	b7eec2fd0e	Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly.	2012-08-05 12:29:10 -04:00
Mark DePristo	e1bba91836	Ready for full-scale evaluation adaptive BQSR contexts -- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees -- Docs, algorithm descriptions, etc so that it makes sense what's going on -- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR -- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)	2012-08-03 16:02:53 -04:00
Mark DePristo	fb5dabce18	Update BCF2 to include a minor version number so we can rev (and report errors) with BCF2 -- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results -- Added new class BCFVersion that consolidates all of the version management of BCF	2012-08-02 17:30:30 -04:00
Mark DePristo	c3c3d18611	Update BCF2 to put PASS as offset 0 not at the end -- Unfortunately this commit breaks backward compatibility with all existing BCF2 files...	2012-08-01 17:09:22 -04:00
Mark DePristo	ccac77d888	Bugfix for incorrect allele counting in IndelSummary -- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs -- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles -- Update a few pieces of code to get previous correct behavior -- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct) -- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default	2012-08-01 15:45:12 -04:00
Mark DePristo	e00ed8bc5e	Cleanup BQSR classes -- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration. It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers. As code becomes embedded throughout GATK its should be refactored to live in utils -- Removed unncessary imports of BQSR in VQSR v3 -- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate -- Update PluginManager to sort the plugins and interfaces. This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.	2012-07-31 08:11:03 -04:00
Ryan Poplin	13591b169f	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-30 12:13:24 -04:00
Eric Banks	7630c929a7	Re-enabling the unit tests for reverse allele clipping	2012-07-29 22:24:56 -04:00
Eric Banks	b07bf1950b	Adding an integration test for another feature that I snuck in during a previous commit: we now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them (this had been turned off because the previous version used Strings to do the uppercasing whereas we stick with byte operations now).	2012-07-29 22:19:49 -04:00
Eric Banks	c4ae9c6cfb	With the new Allele representation we can finally handle complex events (because they aren't so complex anymore). One place this manifests itself is with the strict VCF validation (ValidateVariants used to skip these events but doesn't anymore) so I've added a new test with complex events to the VV integration test.	2012-07-29 19:22:02 -04:00
Eric Banks	99b15b2b3a	Final checkpoint: all tests pass. Note that there were bugs in the PoolGenotypeLikelihoodsUnitTest that needed fixing and eventually led to my needing to disable one of the tests (with a note for Guillermo to look into it). Also note that while I have moved over the GATK to use the new non-null representation of Alleles, I didn't remove all of the now-superfluous code throughout to do padding checking on merges; we'll need to do this on a subsequent push.	2012-07-29 01:07:59 -04:00
Eric Banks	2b1b00ade5	All integration tests and VC/Allele unit tests are passing	2012-07-27 17:03:49 -04:00
Eric Banks	27e7e11ec0	Allele refactoring checkpoint #3 : all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this.	2012-07-27 15:48:40 -04:00
Ryan Poplin	a0890126a8	ActiveRegionWalker's isActive function returns a results object now instead of just a double.	2012-07-27 11:01:39 -04:00
Eric Banks	baf3e33730	Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass.	2012-07-26 23:27:11 -04:00
Eric Banks	32516a2f60	Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point.	2012-07-26 01:50:39 -04:00
Mark DePristo	fcefa61bce	Remove reference dependence in BCF2Codec -- Adding BCF2Codec to VCF.jar and associated unit tests Signed-off-by: Mark DePristo <depristo@broadinstitute.org>	2012-07-25 08:56:38 -04:00
Mark DePristo	19a257a5c1	Multiple bugfixes -- VariantFiltration now properly sets passFilters in VC -- BCF2 writer now properly decodes lazy BCF genotype data that it uses. Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug! Signed-off-by: Mark DePristo <depristo@broadinstitute.org>	2012-07-25 08:56:38 -04:00
Mark DePristo	2ca5fc62a2	Support for MISSING BCF2 type -- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid. Updated our codebase to support this construct. Heng said he'll update the BCF2 quick reference. -- Enabled integration test reading Heng's ex2.bcf file -- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK. Turns out there's a single record in the 1000G SV call set that doesn't have the right length -- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X -- Added integration test reading 1000G SV chr1 calls (from Chris)	2012-07-19 16:14:26 -04:00
Eric Banks	f657b8bda8	Complete overhaul of the BQSRv2 integration tests. Much more comprehensive. Still need to deal with a few tests that need some modifications before I'm done, but I'll take care of that sometime tomorrow.	2012-07-17 00:32:34 -04:00
Mark DePristo	5b0ade67c8	Updates to VCF processing for better BCF processing -- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order]. Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy -- Updating GATK code to use the appropriate header function (this is why so many files have changed) -- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this) -- Bugfix for adding contig lines to BCF2 header dictionary -- VCFHeader metaData no longer sorted internally. The system now maintains the data in header order, and only sorts output as requested in API -- VCFWriter and BCF2Writer now explictly sort their header lines -- Don't allow filters to be added that are PASS in the contract	2012-07-08 15:44:33 -07:00
Mauricio Carneiro	e93b025b39	Fixing unit test with the new clipping behavior for weird cigars, we no longer can assert the final number of bases in the unit test, so I'm taking this bit off the unit test.	2012-07-06 12:08:09 -04:00
Mauricio Carneiro	17efbbf8b1	Fixed ReadClipperUnitTest The behavior of the clipping on weird cigar strings such as 1I1S1H and 9S56H has changed, and the test has to change accordingly.	2012-07-03 16:38:51 -04:00
Eric Banks	0b37d44b0d	Optimizations for the RecalDatum to make BQSR (Count Covariates) much faster. Needs some cleanup.	2012-07-03 13:05:11 -04:00
Eric Banks	031322ff00	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-07-03 00:12:59 -04:00
Eric Banks	a4670113bd	Refactored/renamed the nested integer array; cleaned up code a bit.	2012-07-03 00:12:33 -04:00
Mark DePristo	1b0a775773	Disabling bcf2 reading from samtools because it's 1 basis; updating select variants integrationtest	2012-07-02 15:55:42 -04:00

1 2 3 4 5 ...

374 Commits (5a9610d87591fb9327e6fac552bdf26cba28a6b3)