gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Mark DePristo	d6e42d839c	Fixes GSA-558 GATK ReadShards don't handle unmapped reads correctly.	2012-09-10 20:14:14 -04:00
Mark DePristo	641c6a361e	Fix nasty memory leak in new data thread x cpu thread parallelism -- Basically you cannot safely use instance specific ThreadLocal variables, as these cannot be safely cleaned up. The old implementation kept pointers to old writers, with huge tribble block indexes, and eventually we crashed out of integration tests -- See http://weblogs.java.net/blog/jjviana/archive/2010/06/10/threadlocal-thread-pool-bad-idea-or-dealing-apparent-glassfish-memor for more information -- New implementation uses a borrow/return schedule with a list of N TraversalEngines managed by the MicroScheduler directly.	2012-09-10 20:14:14 -04:00
Mark DePristo	195cf6df7e	Attempting to fix out of memory errors with new traversal engine creator	2012-09-10 20:14:14 -04:00
Mark DePristo	f713d400e2	Fixed GSA-515 Nanoscheduler GSA-555 / Make NT and NCT work together -- Can now say -nt 4 and -nct 4 to get 16 threads running for you! -- TraversalEngines are now ThreadLocal variables in the MicroScheduler. -- Misc. code cleanup, final variables, some contracts.	2012-09-10 20:14:14 -04:00
Mark DePristo	233f70f8ba	Final cleanup of TraversalProgressMeters, moved to utils.progressmeter -- TraversalProgressMeter now completely generalized, named ProgressMeter in utils.progressmeter. Now just takes "nRecordsProcessed" as an argument to print reads. Completely removes dependence on complex data structures from TraversalProgressMeter. Can be used to measure progress on any task with processing units in genomic locations. -- a fairly simple, class with no dependency on GATK engine or other features. -- Currently only used by the TraversalEngine / MicroScheduler but could be used for any purpose now, really.	2012-09-10 20:14:14 -04:00
Mark DePristo	2e94a0a201	Refactor TraversalEngine to extract the progress meter functions -- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves. -- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before -- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer -- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler -- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N). -- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well -- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it. -- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.	2012-09-10 20:14:13 -04:00
David Roazen	d2f3d6d22f	Revert "Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)" This reverts commit 075c56060e0ffcce39631693ef39cf5f8c3a4d5a.	2012-09-10 15:52:39 -04:00
Menachem Fromer	0b717e2e2e	Separated out the DoC calculations from the XHMM pipeline, so that CalcDepthOfCoverage can be used for calculating joint coverage on a per-base accounting over multiple samples (e.g., family samples)	2012-09-10 15:32:41 -04:00
Eric Banks	d7499e0642	Updating the rank sum test documentation	2012-09-09 22:17:36 -04:00
Eric Banks	8ca205f1a9	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-07 14:26:06 -04:00
Eric Banks	b1677fc719	Fixed JIRA GSA-520 for Guillermo: when intervals with zero coverage were present, DiagnoseTargets was trying to merge them with the next interval (even if non-overlapping) which would cause problems later on when it checked to make sure that intervals were strictly overlapping.	2012-09-07 14:25:57 -04:00
Geraldine Van der Auwera	3f2a4379af	Added forum API version stub to base URL for posting GATKDocs This will prevent bugs from occurring when Vanilla make changes to the API as described here: http://vanillaforums.com/blog/api#configuration Based on the bug that broke the website Guide section on 9/6/12, the GATKDocs posting system will probably break in the next release if this is not applied as a bug fix.	2012-09-07 11:49:02 -04:00
Eric Banks	ed3d9b050f	Merge branch 'master' of ssh://gsa2/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-09-07 11:45:09 -04:00
Eric Banks	3dc248a49d	Adding another test	2012-09-07 11:41:38 -04:00
Ryan Poplin	81b27f9db2	auto-merging to latest version	2012-09-07 11:36:47 -04:00
Eric Banks	41a8a304a0	Catch masked OutOfMemory errors as User Errors	2012-09-07 11:27:00 -04:00
Mark DePristo	d62eca5d92	Update GATKPerformanceOverTime to measure -nt and -nct	2012-09-07 10:47:29 -04:00
Mark DePristo	bf87de8a25	UnitTests for ReducerThread and InputProducer -- Uncovered bug in ReducerThread in detecting abnormal case where jobs are coming in out of order	2012-09-07 09:51:32 -04:00
Mark DePristo	c503884958	GSA-515 Nanoscheduler GSA-551 / Optimize nanoScheduling performance of UnifiedGenotyper -- I've rewritten the entire NS framework to use a producer / consumer model for input -> map and from map -> reduce. This is allowing us to scale reasonably efficiently up to 4 threads (see figure). Future work on the nano scheduler will be itemized in a separate JIRA entry. -- Restructured the NS code for clarity. Docs everywhere. -- This is considered version 1.0	2012-09-07 09:15:16 -04:00
Mark DePristo	9d12935986	Intermediate commit for new hyper parallel NanoScheduler -- There's a logic bug now but I'll go to squash it...	2012-09-07 09:15:16 -04:00
Eric Banks	576c7280d9	Extensions to the ErrorThrowing framework for testing purposes	2012-09-06 22:03:18 -04:00
David Roazen	cb84a6473f	Downsampling: experimental engine integration -Off by default; engine fork isolates new code paths from old code paths, so no integration tests change yet -Experimental implementation is currently BROKEN due to a serious issue involving file spans. No one can/should use the experimental features until I've patched this issue. -There are temporarily two independent versions of LocusIteratorByState. Anyone changing one version should port the change to the other (if possible), and anyone adding unit tests for one version should add the same unit tests for the other (again, if possible). This situation will hopefully be extremely temporary, and last only until the experimental implementation is proven.	2012-09-06 15:03:27 -04:00
Eric Banks	6df6c1abd5	Fix for PBT to stop NPE when there are no likelihoods present	2012-09-06 13:14:18 -04:00
Mark DePristo	1b064805ed	Renaming -cnt to -nct for consistency	2012-09-05 21:13:19 -04:00
Mark DePristo	574a8f710b	Add static boolean controlled output of individual map call timing to nanoSecond resolution	2012-09-05 17:40:02 -04:00
Mark DePristo	e11915aa0a	GSA-515 Nanoscheduler GSA-550 ThreadSafeMapReduce shouldn't be super interface of TreeReducible	2012-09-05 17:37:56 -04:00
Mark DePristo	c5f1ceaa95	All read and loci traversals go through NanoScheduler now -- The NanoScheduler is doing a good job at tracking important information like time spent in map/reduce/input etc. -- Can be disabled with static boolean in MicroScheduler if we have problems -- See GSA-515 Nanoscheduler GSA-549 Retire TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine	2012-09-05 16:38:21 -04:00
Mark DePristo	dddf148a59	Fixed bug in ThreadAllocation getTotalNumberOfThreads -- It isnt data + cpu its data * cpu threads.	2012-09-05 16:35:32 -04:00
Mark DePristo	9bf1d138d9	New GATK argument interface for data and cpu threads -- Closes GSA-515 Nanoscheduler GSA-542 Good interface to nanoScheduler -- Old -nt means dataThreads -- New -cnt (--num_cpu_threads_per_data_thread) gives you n cpu threads for each data thread in the system -- Cleanup logic for handling data and cpu threading in HMS, LMS, and MS -- GATKRunReport reports the total number of threads in use by the GATK, not just the nt value -- Removed the io,cpu tags for nt. Stupid system if you ask me. Cleaned up the GenomeAnalysisEngine and ThreadAllocation handling to be totally straightforward now	2012-09-05 15:45:24 -04:00
Mark DePristo	1e55475adc	NanoScheduler uses ExecutorService to run input reader thread	2012-09-05 15:45:24 -04:00
Mark DePristo	71d9ebcb0d	Fix bug (introduced by me) that didn't include contig in progress meter	2012-09-05 15:45:24 -04:00
Mark DePristo	c822b7c760	Fix long-standing NPE in LMS due to inappropriate timing of initialization	2012-09-05 15:45:24 -04:00
Mark DePristo	a997c99806	Initial NanoScheduler with input producer thread	2012-09-05 15:45:24 -04:00
Mark DePristo	03dd470ec1	Test for progressFunction in NanoScheduler; bugfix for single threaded fast path	2012-09-05 15:45:23 -04:00
Mark DePristo	8cdeb51b78	Cleanup printProgress in TraversalEngine -- Separate updating cumulative traversal metrics from printing progress. There's now an updateCumulativeMetrics function and a printProgress() that only takes a current position -- printProgress now soles relies on the time since the last progress to decide if it will print or not. No longer uses the number of cycles, since this isn't reliable in the case of nano scheduling -- GenomeAnalysisEngine now maintains a pointer to the master cumulative metrics. getCumulativeMetrics never returns null, which was handled in some parts of the code but not others. -- Update all of the traversals to use the new updateCumulativeMetrics, printProgress model -- Added progress callback to nano scheduler. Every bufferSize elements this callback is invoked, allowing us to smoothly update the progress meter in the NanoScheduler -- Rename MapFunction to NanoSchedulerMap and the same for reduce.	2012-09-05 15:45:23 -04:00
Mark DePristo	d503ed97ab	Mark I NanoScheduling TraverseLoci -- Refactored TraverseLoci into old linear version and nano scheduling version -- Temp. GATK argument to say how many nano threads to use -- Can efficiently scale to 3 threads before blocking on input	2012-09-05 15:45:23 -04:00
Mark DePristo	757e6a0160	Making Pileup thread-safe -- Old version relied on out printstream magically sorting output, new version puts the print in reduce	2012-09-05 15:45:23 -04:00
Mark DePristo	d7105223fe	More debugging output for NanoScheduler when debugging is enabled	2012-09-05 15:45:23 -04:00
Mark DePristo	9823102c0c	TraverseReadsNano supports walker.filter and walker.done -- Instead of returning directly the result of map(), returns a MapResult object with the value and a reduceMe flag. -- Reduce function respects the reduceMe flag -- Code cleanup and more documentation	2012-09-05 15:45:23 -04:00
Mark DePristo	1a8f5fc374	Trivial cleanup of NanoScheduler	2012-09-05 15:45:23 -04:00
Mark DePristo	6a5a70cdf1	Done GSA-539: SimpleTimer should use System.nanoTime for nanoSecond resolution	2012-09-05 15:45:23 -04:00
Mark DePristo	59109d5eeb	NanoScheduler tracks time outside of its execute call	2012-09-05 15:45:23 -04:00
Mark DePristo	800a27c3a7	NanoScheduler tracks time within input, map, and reduce -- Helpful for understanding where the time goes to each bit of the code. -- Controlled by a local static boolean, to avoid the potential overhead in general	2012-09-05 15:45:23 -04:00
Mark DePristo	7087b22ea3	No debugging output (even conditional) for ReadTransformers in PrintReads	2012-09-05 15:45:23 -04:00
Mark DePristo	e01258b261	NanoScheduler now supports printProgress. Bugfixes to printProgress -- TraverseReadsNano prints progress at the end of each traversal unit -- Fix bugs in TraversalEngine printProgress -- Synchronize the method so we don't get multiple logged outputs when two or more HMSs call printProgress before initialization at the start! -- Fix the logic for mustPrint, which actually had the logic of mustNotPrint. Now we see the done log line that was always supposed to be there -- Fix output formatting, as the done() line was incorrectly shifting over the % complete by 1 char as 100.0% didn't fit in %4.1f -- Add clearer doc on -PF argument so that people know that the performance log can be generated to standard out if one wants	2012-09-05 15:45:23 -04:00
Mark DePristo	6055101df8	NanoScheduler no longer groups inputs, each map() call is interlaced now -- Maximizes the efficiency of the threads -- Simplifies interface (yea!) -- Reduces number of combinatorial tests that need to be performed	2012-09-05 15:45:22 -04:00
Yossi Farjoun	ad5fa449e7	fixed a typo in the string comment	2012-09-05 14:46:10 -04:00
Ryan Poplin	84a83fd3f3	fixing typo	2012-09-05 10:41:03 -04:00
Eric Banks	fc06f39411	Fixed docs for Pileup walker	2012-09-05 09:55:34 -04:00
Christopher Hartl	d795437202	- New UserExceptions added for when ReadFilters or Walkers specified on the command line are not found. When -rf xxxx cannot find the class corresponding to xxxx, all read filters are printed in a better formatted way, with links to their gatk docs. - VariantAnnotatorEngine changed to call genotype annotations even if pilups and allele -> likelihood mappings are not present. Current genotype annotations altered to check for null pilupes and null mappings.	2012-09-04 16:41:44 -04:00
Ryan Poplin	9cc1a9931b	Resolving merge conflicts.	2012-09-04 10:47:38 -04:00
Ryan Poplin	c9944d81ef	Skip array needs to also be used in the updateDataForRead function of the delocalized BQSR.	2012-09-04 10:33:37 -04:00
Mark DePristo	c9ea213c9b	Make BaseRecalibration thread-safe -- In the process uncovered two strange things 1 -- qualityScoreByFullCovariateKey was created but never used. Seems like a cache? 2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534	2012-08-31 13:42:42 -04:00
Mark DePristo	27ddebee53	Protect PrintReads from strange state from TraverseReadsUnitTests	2012-08-31 13:42:41 -04:00
Mark DePristo	e028901d54	Fixed bad contract in ReadTransformer	2012-08-31 13:42:41 -04:00
Mark DePristo	cf91d894e4	Fix build problems with tests	2012-08-31 13:42:41 -04:00
Mark DePristo	817ece37a2	General infrastructure for ReadTransformers -- These are like read filters but can be applied either on input, on output, of handled by the walker -- Previous example of BAQ now uses the general framework -- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties! Yeah! -- BQSR now uses this framework. We can now do BQSR on input, on output, or within a walker -- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR -- Currently BQSR is excepting in parallel, which subsequent commit with fix -- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure -- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry -- Many files touched simply due to the refactoring and renaming of classes	2012-08-31 13:42:41 -04:00
Ryan Poplin	ff6ebbf3fd	Resolving merge conflicts.	2012-08-31 11:25:55 -04:00
Mark DePristo	2f749b5e52	Added ThreadSafeMapReduce interface, super of TreeReducible -- A higher level interface to declare parallelism capability of a walker. This interface means that the walker can be multi-threaded, but doesn't necessarily support TreeReducible interface, which forces you to have a combine ReduceType operation that isn't appropriate for parallel read walkers -- Updated ReadWalkers to implement ThreadSafeMapReduce not TreeReducible	2012-08-30 19:41:49 -04:00
Mark DePristo	544740d45d	tasking for n threads should give you n threads in NanoScheduler, not n - 1	2012-08-30 19:41:49 -04:00
Mark DePristo	7a462399ce	Fix GSA-529: Fix RODs for parallel read walkers -- TraverseReadsNano modified to read in all input data before invoking maps, so the input to TraverseReadsNano is a MapData object holding the sam record, the ref context, and the refmetadatatracker. -- Update ValidateRODForReads to be tree reducible, using synchronized map and explicitly sort the output map from locations -> counts in onTraversalDone -- Expanded integration tests to test nt 1, 2, 4.	2012-08-30 19:41:49 -04:00
Mark DePristo	7d95176539	Bugfix to compareTo and equals in GenomeLoc -- Yes, GenomeLoc.compareTo was broken. The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs. -- Updated GenomeLoc.compareTo function to account for stop. Updated GATK code where necessary to fix resulting problems that depended on this. -- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs	2012-08-30 19:41:49 -04:00
Mark DePristo	5a9610d875	ReadShards now default to 10K (up from 1K) reads per samFile up to 250K -- This should help make the inputs for parallel read walkers a little meater, and avoid spinning the shard creation infrastructure so often	2012-08-30 19:41:49 -04:00
Christopher Hartl	5a142fe265	After dicussion with Ryan/Eric, the Structural_Indel variant type is now gone, and has been entirely replaced with the access pattern .isStructuralIndel(). This makes it a strict subtype of indel. I agree that this method is a bit more sensible. In addition, fix for GSA-310. If supplied -rf argument does not match a known read filter, the list of read filters will be printed, and users directed to the documentation for more information.	2012-08-30 17:57:31 -04:00
Mark DePristo	82b2845b9f	Fix: GSA-531 ApplyRecalibration writing to BCF: java.lang.String cannot be cast to java.lang.Double -- LOD must be added a double to attributes, not as string, so that it can be written out as BCF	2012-08-30 16:59:57 -04:00
Ryan Poplin	7b366d4049	misc cleanup in active region traversal.	2012-08-30 11:01:01 -04:00
Mark DePristo	ce3d1f89ea	ReadShard are no longer allowed to span multiple contigs -- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads. The old implementation simply failed in this case. The new code handles this correctly by forcing shards to have all of their data on a single contig. -- Added a PrintReads integration test to ensure this behavior is correct -- Adding test BAMs that have < 200 reads and span across contig boundaries	2012-08-30 10:15:11 -04:00
Mark DePristo	53376b9423	Part III of GSA-462: Consistent RODBinding access across Ref and Read trackers -- shardSpan is only calculated when there some ROD is live in the GATK. No sense in paying the cost per read when you don't need it -- Update contract to allow null span or unmapped span (good catch unittests!)	2012-08-30 10:15:10 -04:00
Mark DePristo	1200848bbf	Part II of GSA-462: Consistent RODBinding access across Ref and Read trackers -- Deleted ReadMetaDataTracker -- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view	2012-08-30 10:15:10 -04:00
Mark DePristo	972be8b4a4	Part I of GSA-462: Consistent RODBinding access across Ref and Read trackers -- ReadMetaDataTracker is dead! Long live the RefMetaDataTracker. Read walkers will soon just take RefMetaDataTracker objects. In this commit they take a class that trivially extends them -- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class. -- This new implementation produces thread-safe objects (i.e., holds no points to shared state). Suitable for use (to be tested) with nano scheduling -- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely. -- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView -- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference. Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface. If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself. -- This commit breaks the new read walker BQSR, but Ryan knows this is coming -- Subsequent commit will be retiring / fixing ValidateRODForReads	2012-08-30 10:15:10 -04:00
Mark DePristo	8fc6a0a68b	Cleanup RefMetaDataTracker before refactoring ReadMetaDataTracker	2012-08-30 10:13:06 -04:00
Ryan Poplin	b85ded8389	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-30 10:11:48 -04:00
Ryan Poplin	57d997f06f	Fixing bug from when FragmentUtils merging function moved over to the soft clipped start instead of the unclipped start	2012-08-30 10:10:43 -04:00
Ryan Poplin	f9bab37015	Merged bug fix from Stable into Unstable	2012-08-30 09:21:24 -04:00
Ryan Poplin	eb63221875	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable	2012-08-30 09:19:35 -04:00
Ryan Poplin	81d5eca975	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-30 09:10:56 -04:00
Ryan Poplin	35baf0b155	This along with Mauricio's previous commit (thanks!) fixes GSA-522. There are no longer any modifications to reads in the map calls of ActiveRegion walkers. Added the bam which identified this error as a new integration test.	2012-08-30 09:07:36 -04:00
Eric Banks	1acf0f0b2c	Fixing bug in fasta .fai generation: trim the contig names to the first whitespace if one appears. We now generate indexes identical to samtools.	2012-08-29 22:36:27 -04:00
Eric Banks	4d38befe86	Merged bug fix from Stable into Unstable	2012-08-29 15:13:56 -04:00
Eric Banks	150a969279	Be careful with String manipulation when constructing alleles in SomaticIndelDetector	2012-08-29 15:13:28 -04:00
Eric Banks	ce55ba98f4	Don't try to left align indels in unmapped reads (which for some reason can still have CIGARs) because the ref context is null.	2012-08-29 15:01:11 -04:00
Ryan Poplin	4ea38bbfe8	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-29 11:39:30 -04:00
Mauricio Carneiro	69b56e11c8	ReadClipper won't modify the original read Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows	2012-08-29 11:33:19 -04:00
Ryan Poplin	e12ae65d33	Changing the commenting style in the BQSR	2012-08-29 11:27:45 -04:00
Ryan Poplin	18eca3544e	Initial commit of the delocalized BQSR written as a read walker.	2012-08-28 15:24:20 -04:00
Eric Banks	e74c527d47	Register the depricated walkers as depricated starting in v2.2 so that users get a helpful error message	2012-08-28 10:19:18 -04:00
Eric Banks	67d348a31d	Retiring the alignment walkers and related integration test since we don't want to support them anymore.	2012-08-28 10:16:49 -04:00
Mark DePristo	2996693c9f	FisherStrand now computed with and without filtering low-qual bases, and least significant pvalue is kept -- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens) -- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the preference for introducing an indel vs. a mismatch. -- This implementation allows us to have our cake and eat it to by computing both p-values, and taking the maximum one (i.e., least significant). -- No integration tests updated yet -- still exploring the consequences of this change	2012-08-28 08:06:47 -04:00
Eric Banks	bedcdbdc5f	Fixing merge conflict	2012-08-27 12:16:51 -04:00
Eric Banks	3d476487c6	LIBS is totally busted for deletions. Putting a check in AD for bad pileup event bases so that we don't produce busted alleles. We must fix LIBS ASAP.	2012-08-27 12:13:12 -04:00
Mark DePristo	63a9ae817a	Ensure thread-safety of CachingIndexedFastaSequenceFile -- Cosmetic cleanup of ReadReferenceView -- TraverseReadsNano provides the reference context, since it's thread-safe -- Cleanup CachingIndexedFastaSequenceFile. Add docs, remove unnecessary setters -- Expand CachingIndexedFastaSequenceFileUnitTest to test explicitly multi-threaded safety.	2012-08-27 12:11:54 -04:00
Khalid Shakir	2d1ea7124b	One less Queue command line requirement: -tempDir now defaults to .queue/tmp. Also moved queueScatterGather to .queue/scatterGather.	2012-08-27 12:04:50 -04:00
Mark DePristo	68c5142d2d	numThreads > 1 any time you have -nt > 1 silly	2012-08-26 14:36:13 -04:00
Mark DePristo	fde9824765	Optimizations for parallel read walkers -- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone -- Nicer debugging output in NanoScheduler -- ReadShard has a getBufferSize() method now	2012-08-25 17:21:12 -04:00
Mark DePristo	5066b14335	Parallel FlagStat	2012-08-25 17:21:12 -04:00
Mark DePristo	af540888f1	Limited version of parallel read walkers -- Currently doesn't support accessing reference or ROD data -- Parallel versions of PrintReads and CountReads	2012-08-25 17:21:12 -04:00
Mark DePristo	e060b148e2	Minor cleanup of TraverseReads	2012-08-25 17:21:11 -04:00
Mark DePristo	275a5e5439	More tests for NanoScheduler -- Add more contracts -- Test in the UnitTest that the reduce is being called in the correct order	2012-08-25 17:21:11 -04:00
Christopher Hartl	6db0988898	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-25 15:40:32 -04:00
Christopher Hartl	db2e88c7cb	Fix for badIndelLength() throwing NPE at non-indel sites. Added integration test.	2012-08-25 12:38:23 -07:00
Mark DePristo	59b5913b54	Merged bug fix from Stable into Unstable	2012-08-25 14:53:22 -04:00
Mark DePristo	dcc972a557	Usability cleanup for BQSR -- I'm seeing a lot of people trying to use BinaryTagCovariate in the community. They really shouldn't do this, so I moved it to private. -- Throw an exception if its required bintag argument is missing -- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate -- Show the type (Required, Experimental, Standard) of the covariates when running --list	2012-08-25 14:53:00 -04:00
Christopher Hartl	b59948709f	Code improvements re: JIRA GSA-510. Trio class migrated into the Samples package - because the trio structure is so ubiquitously used, it makes sense, I think, to have a class which imposes the structure on the samples. Existing functions which slightly duplicated the getTrios() method look like they have bugs. These functions are now deprecated. A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption. MVLikelihoodRatio now uses the trio methods from SampleDB.	2012-08-25 08:48:27 -07:00
Mark DePristo	0996bbd548	Comments for Chris on cleanup	2012-08-24 16:04:58 -04:00
Mark DePristo	649b82ce85	Merge branch 'nanoScheduler' Conflicts: private/scala/qscript/org/broadinstitute/sting/queue/qscripts/performance/GATKPerformanceOverTime.scala	2012-08-24 15:59:36 -04:00
Mark DePristo	9de8077eeb	Working (efficient?) implementation of NanoScheduler -- Groups inputs for each thread so that we don't have one thread execution per map() call -- Added shutdown function -- Documentation everywhere -- Code cleanup -- Extensive unittests -- At this point I'm ready to integrate it into the engine for CPU parallel read walkers	2012-08-24 15:34:23 -04:00
Christopher Hartl	752f44c332	Code cleanup in MVLR and SelectVariants. Should fix JIRA GSA-509 and GSA-510	2012-08-24 12:25:11 -07:00
Mark DePristo	d6e6b30caf	Initial implementation of GSA-515: Nanoscheduler – Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)). Done! CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement. Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job. Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks	2012-08-24 14:07:44 -04:00
Eric Banks	0545664f91	Fix ClassCastException seen in Tableau errors	2012-08-24 13:45:48 -04:00
Eric Banks	740520c23b	Fix BQSR docs	2012-08-24 13:20:10 -04:00
Ryan Poplin	5f8574bd15	Fixing typo in error message.	2012-08-24 10:48:41 -04:00
Mark DePristo	1999b95754	Work around for GSA-513: ClassCastException in VariantEval	2012-08-23 18:14:49 -04:00
Christopher Hartl	f1166d6d00	Spotted a potential bug where sample IDs passed in from the meta data were only checked against the sample IDs in the VCF header if the input file happened to be a meta data file rather than a fam file. Added a check for fam files as well, and added an integration test to cover each case.	2012-08-23 11:43:19 -07:00
Mark DePristo	857b11b26f	Done with GSA-506: Add nt and efficiency information to GATKRunReport -- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO -- See https://jira.broadinstitute.org/browse/GSA-506 for more details	2012-08-23 09:59:53 -04:00
Mark DePristo	0b735884db	Cleanup code in VariantContext	2012-08-23 09:59:53 -04:00
Eric Banks	e5df91aa23	Looks like the @WalkerName annotation doesn't work with the GATK docs, so I'm renaming the walkers.	2012-08-22 20:17:39 -04:00
Mark DePristo	63af0cbcba	Cleanup GATK efficiency monitor classes -- Invert logic in GATKArgumentCollection to disable monitoring, not enable. That means monitoring is on by default -- Fix testing error in unit tests -- Rename variables in ThreadAllocation to be clearer	2012-08-22 16:48:02 -04:00
Mark DePristo	e1293f0ef2	GSA-507: Thread monitoring refactored so it can work without a thread factory -- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory. -- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode -- MicroScheduler now handles management of the efficiency monitor. Includes master thread in monitor, meaning that reduce is now included for both schedulers	2012-08-22 16:48:01 -04:00
Mark DePristo	f876c51277	Separately track time spent doing user and system CPU work -- Allows us to ID (by proxy) time spent doing IO -- Refactor StateMonitoryingThreadFactory to use it's own enum, not Thread.State -- Reliable unit tests across mac and unix	2012-08-22 16:48:01 -04:00
Mark DePristo	18060f237b	Add thread efficiency monitoring to GATK HMS -- See https://jira.broadinstitute.org/browse/GSA-502 -- New command line argument -mt enables thread monitoring -- If enabled, HMS uses StateMonitoringThreadFactory to create monitored threads, and prints out an efficiency report when HMS exits, telling the user information like: for BQSR – known to be inefficient locking INFO 17:10:33,195 StateMonitoringThreadFactory - Number of activeThreads used: 8 INFO 17:10:33,196 StateMonitoringThreadFactory - Total runtime 90.3 m INFO 17:10:33,196 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.72 ( 64.8 m) INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent running is 0.26 ( 23.7 m) INFO 17:10:33,197 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.02 ( 112.8 s) INFO 17:10:33,197 StateMonitoringThreadFactory - Efficiency of multi-threading: 26.19% of time spent doing productive work for CountLoci INFO 17:06:12,777 StateMonitoringThreadFactory - Number of activeThreads used: 8 INFO 17:06:12,777 StateMonitoringThreadFactory - Total runtime 43.5 m INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent blocked is 0.00 ( 4.2 s) INFO 17:06:12,778 StateMonitoringThreadFactory - Fraction of time spent running is 1.00 ( 43.3 m) INFO 17:06:12,779 StateMonitoringThreadFactory - Fraction of time spent waiting is 0.00 ( 6.0 s) INFO 17:06:12,779 StateMonitoringThreadFactory - Efficiency of multi-threading: 99.61% of time spent doing productive work	2012-08-22 16:48:01 -04:00
Ryan Poplin	fe3069b278	Merged bug fix from Stable into Unstable	2012-08-22 14:40:34 -04:00
Ryan Poplin	e5cfdb4811	Bug fix for popular _Duplicate allele added to VariantContext_ error reported on the forum. It seems to be due to lower case bases in the reference being treated as reference mismatches. We would try to turn these mismatches into SNP events, for example c/C. We now uppercase the result from IndexedFastaSequenceFile.getSubsequenceAt()	2012-08-22 14:39:35 -04:00
Ryan Poplin	63213e8eb5	Expanding the HaplotypeCaller integration tests to cover a wider range of data	2012-08-22 14:18:44 -04:00
Eric Banks	944e1c299d	Docs for --keepOriginalAC were wrong in SelectVariants	2012-08-22 13:07:13 -04:00
Eric Banks	2409aa9bfd	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-22 12:54:43 -04:00
Eric Banks	94540ccc27	Using the simple VCBuilder constructor and then subsequently trying to modify attributes was throwing a NPE. This is easily solved (without a performance hit) by initializing the attributes map to an immutable Collections.emptyMap(). Added unit test to cover this case.	2012-08-22 12:54:29 -04:00
Guillermo del Angel	901f47d8af	Final step (for now) in VA refactoring: update MD5's because, a) since it's not guaranteed that we'll iterate through reads/pileups in the same order, the rank sum dithering will change annotations, b) FS uses new generic threshold to distinguish uninformative reads (it used to use ad-hoc thresholds), c) AD definition changed and throws away uninformative reads, d) shortened general ploidy integration tests for quicker debugging. May have missed some MD5's in the update so there may be lingering test failures still	2012-08-22 11:38:51 -04:00
Guillermo del Angel	7df0abf49b	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-22 11:36:41 -04:00
Eric Banks	9e76e8aa0b	Just noticed that the efficient conversion to uppercase method is redundant since it's already implemented efficiently in Picard; let's just have a single implementation.	2012-08-22 11:26:08 -04:00
Christopher Hartl	20601f034e	Updating the checkType() function to include the new StructuralIndel variant type. Fixes outstanding broken integration test.	2012-08-22 07:33:10 -07:00
Eric Banks	c7ce3e1cf5	Merged bug fix from Stable into Unstable	2012-08-22 00:24:40 -04:00
Eric Banks	03017855e4	WTF - why is support for whole-read insertions all messed up in LIBS? I've pushed a temporary patch for now (the right solution should certainly not be implemented in stable; LIBS needs to be better thought out). Added another unit test.	2012-08-22 00:24:01 -04:00
Mark DePristo	6ce8016ae7	GSA-491: Add hidden tag to GATK that propagates to the GATK logs	2012-08-21 14:44:18 -04:00
Guillermo del Angel	6a8cf1c84a	Enable and adapt HaplotypeScore and MappingQualityZero as active region annotations now that we have per-read likelihoods passed in to annotations	2012-08-21 14:35:40 -04:00
Guillermo del Angel	d0644b3565	Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-21 10:35:23 -04:00
Ryan Poplin	94e7f677ad	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-21 10:21:47 -04:00
Guillermo del Angel	418ace463a	More merge conflict resolution	2012-08-21 10:15:52 -04:00
Ryan Poplin	10961db3ce	Another round of FindBugs fixes. Object returns its internal reference to an externally mutable array. Very dangerous.	2012-08-21 09:35:55 -04:00
Ryan Poplin	605acaae9c	Another round of FindBugs fixes. Object internally stores a reference to an externally mutable array. Very dangerous.	2012-08-21 09:33:58 -04:00
Ryan Poplin	55b7949d68	Another round of FindBugs fixes. Comparator doesn't implement Serializable.	2012-08-21 09:20:55 -04:00
Christopher Hartl	ba8622ff0d	number of stashed changes are lurking in here. In order of importance: - Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker. - Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length. - Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR) - InsertSizeDistribution changed to use the new gatk report output (it was previously broken) - RetrogeneDiscovery changed to be compatible with the new gatk report - A maxIndelSize argument added to SelectVariants - ByTranscriptEvaluator rewritten for cleanliness - VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL - Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; ) Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this also fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".	2012-08-21 07:08:58 -04:00
Eric Banks	3dfe8df262	Merged bug fix from Stable into Unstable	2012-08-20 23:12:58 -04:00
Eric Banks	40d5efc804	Fix for Adam K's reported bug: we weren't handling reads that were entirely insertions properly in LIBS. Specifically, the event bases were off-by-one (which was disasterous in Adam's case with a 1bp read). Added a unit test to cover this case.	2012-08-20 23:12:41 -04:00
Eric Banks	286b658fab	Re-enabling parallelism in the BaseRecalibrator now that the release is out.	2012-08-20 21:25:14 -04:00
Guillermo del Angel	7bbd2a7a20	Fixing merge conflicts	2012-08-20 20:38:25 -04:00
Guillermo del Angel	2041cb853c	New implementation of AD - ignore now non-informative reads based on per-read likelihoods	2012-08-20 20:31:34 -04:00
Ryan Poplin	77fbaec044	Another round of FindBugs fixes. Class implements its own compareTo() but uses base Object.equals() which can lead to unpredictable behavior.	2012-08-20 16:55:00 -04:00
Ryan Poplin	5e28bca630	Another round of FindBugs fixes. Should be static inner class.	2012-08-20 16:15:48 -04:00
Ryan Poplin	5db3bd6fd2	Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2012-08-20 15:28:57 -04:00
Ryan Poplin	464d49509a	Pulling out common caller arguments into its own StandardCallerArgumentCollection base class so that every caller isn't exposed to the unused arguments from every other caller.	2012-08-20 15:28:39 -04:00

1 2 3 4 5 ...

2513 Commits (ee2f12e2ac5c4e04d7e99135ee17f4faf4d731be)