gatk-3.8

Commit Graph

Author	SHA1	Message	Date
droazen	8d5b4af8ca	Binomial and Multinomial interfaces for probability and coefficients in log and real space. Passed all unit tests. BinomialCumulativeProbability was reformatted to follow the now standard parameter order. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6057 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:55:15 +00:00
droazen	4abb7c424b	implementation of the Gamma function and log10 Binomial / Multinomial coefficients. Unit tests for gamma and binomial passed with honors. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6056 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 22:55:09 +00:00
ebanks	745935ffc2	No longer used - instead see the ConstrainedMateFixingManager class git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6030 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-22 19:38:17 +00:00
ebanks	420d8feff6	No one should be calling the createHeader method(s) directly, but instead should be going through the full readHeader method (because it first sets the version); therefore I made them package protected and merged them. Updated the various unit tests that were using createHeader and were dangerously assuming that the header version was defaulting to 4.0 (which it no longer does). git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5934 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-03 02:17:37 +00:00
chartl	84c2c5d7e6	Stop running away from my commits, test modules. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5919 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-02 13:05:53 +00:00
chartl	511cd48d7a	There is an edge case ( \|Set1\| = 5, \|Set2\| = 4) where the exact p-value exceeds the range of the normal distribution we want to invert. For the edge cases, this happens exactly at the mean, and so this can be safely replaced with a z value of 0.0 git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5915 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 17:30:09 +00:00
chartl	a79967d9af	After extensive testing of MannWhitneyU: - Verified that exact calculations do agree with R's dwilcox() - Verified that exact calculations do not agree with R's wilcox.test + This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values) - Can now specify MWU to calculate cumulative exact tests, rather than point probabilities - Z-scores are now calculated properly for exact tests + Previously, z-values calculated by inverting normal CDF from U-statistic PDF + Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate + Additional tests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a	2011-06-01 15:51:27 +00:00
depristo	136c8c7900	ClipReads now supports HARDCLIP_BASES, though in fact this turned out to be not necessary for my desired tests. In the process of developing the HARDCLIP mode, I added some proper ReadUtils unit tests, which would ideally be expanded to include other ReadUtil functions, as added git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5890 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-27 11:42:22 +00:00
hanna	5dca1e4d2e	Make IntervalIntegrationTest aware of the new alignments in the MV1994.bam testset. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5852 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 19:59:47 +00:00
chartl	7ff5375493	Removing build-killing dependency on a private package. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5851 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 18:13:15 +00:00
chartl	0b07373909	Incorporating old feedback from eric: @deprecated methods should not be @deprecated, but rather protected, and the test's package moved to where it can access those test methods. Also allows for the slightly more awesome name "MWUnitTest" git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5850 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-23 18:06:05 +00:00
depristo	a18b0152df	Contracts for SimpleTimer, as well as UnitTests git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5841 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 19:45:31 +00:00
depristo	f608ed6d5a	Removed old (and unused) reporting system, now that Kiran's VE reporting system is working. Refactors dictionary creation error messages into UserExceptions git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5836 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-22 18:42:52 +00:00
depristo	e234589240	Contracts for GenomeLocParser and GenomeLoc are now fully implemented. GenomeLocs can officially have any start/stop values from -Inf - +Inf. Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser. General code cleanup throughout the subsystem. All validation code for GLs is now centralized, and all I/O systems now validate their inputs. Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals. Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20). Many interval processing routines were moved to IntervalUtils, as this is their natural home. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-21 02:01:59 +00:00
depristo	e16bc2cbd9	Contracts for Java now write for GenomeLoc and GenomeLocParser. The semantics of GenomeLoc are now much clearer. It is no longer allowed to create invalid GenomeLocs -- you can only create them with well formed start, end, and contigs, with respect to the mater dictionary. Where one previously created an invalid GenomeLoc, and asked is this valid, you must now provide the raw arguments to helper functions to assess this. Providing bad arguments to GenomeLoc generates UserExceptions now. Added utilty functions contigIsInDictionary and indexIsInDictionary to help with this. Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone() Removed misc. unnecessary imports Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-20 15:43:27 +00:00
hanna	f275be6968	A 'fat shard' finder. Cranks through the indices of a BAM file or list of BAM files looking for outliers (outliers right now are defined naively as shards whose sizes are more than 5 stddevs away from the mean). Runs in 13 minutes per chromosome on 707 low pass whole genome BAMs -- not great, but much faster than running UG on the same region to discover anomalies. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5782 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-10 12:56:47 +00:00
kshakir	7d21350a17	Fixed import. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5780 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-09 18:07:40 +00:00
kshakir	28b897d5de	Fixed O(N^2) operation when scattering interval files. Cleaned up intervals contig count function. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5768 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-05 03:32:35 +00:00
kshakir	8ad547e6c2	Fixed another interval bug where dividing up N intervals into N parts wasn't working. Minor updates to the FCPTest to match the changes due to using the old indel caller. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5766 348d0f76-0448-11de-a6fe-93d51630548a	2011-05-04 20:49:35 +00:00
kshakir	f619dd3ca7	Refactored IntervalUtils used to parse and scatter intervals for Queue. Scattering non-contig interval lists by number of loci in the intervals instead of just number of intervals. Queue caches the list of locs and how to split them up instead of reloading them from disk repeatedly. TODO: general purpose function to divide data evenly. Skip over comments when parsing picard analysis files. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5687 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-26 00:06:00 +00:00
kshakir	8619f49d20	Added a utility method to retrieve the contig lengths for WG chunking. Added a rudimentary GATKReportParser for parsing VE3 results. Re-enabled the FCPTest using VE3, the GATKRP, and the PicardAggregationUtils. The tag type for .rod files is DBSNP, not ROD. More explicit return types on implicit methods. Added null checks for implicit string to/from file conversions. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5668 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-20 19:22:21 +00:00
kshakir	4bb573b1f5	Centralizing a bunch of Broad specific utility functions from code scattered in GSA-Firehose, PipelineTest, custom QScripts, etc. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5631 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-13 21:29:02 +00:00
hanna	32d502c122	Enable BAM OTF index writing by default. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5594 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-07 23:44:25 +00:00
kshakir	45ebbf725c	Instead of always merging Picard interval files they are optionally merged by Sting Utils. Disabled the MFCP while the FCP gets an update. Minor updates to email messages for upcoming scala 2.9. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5588 348d0f76-0448-11de-a6fe-93d51630548a	2011-04-06 21:12:05 +00:00
kshakir	fc8acd503e	Enabled the parameterize option for debugging PipelineTest MD5s. Fixed escaping expressions that have more than one space between arguments. Updated example to match the wiki. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5516 348d0f76-0448-11de-a6fe-93d51630548a	2011-03-26 00:41:47 +00:00
ebanks	05fac8583d	Following up Mark's recent commit: hooking up the --maxPositionalMoveAllowed argument into the indel realigner and through to the SAM writer. We now ensure that no read is realigned more than N bases (200 by default, which is nowhere close to realistically possible). If anyone ever sees a warning message about this with the default value then please let me know because I need to see it for myself. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5331 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-28 04:40:54 +00:00
depristo	ce51ffb56e	Oops, old local paths committed on accident. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5200 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-04 23:35:56 +00:00
depristo	29f3ad72f3	SAMFileWriter that allows the user to move reads, but only a bit, in an incoming coordinated sorted BAM files. Does some local reordering and local mate fixing, under specified constrained. These constrains allow us to make a special -- under testing for Eric, who promised to try this out a bit, expand test cases and integration tests -- but soon to be the default and only model of the realigner that only moves reads with ISIZE < 3000 that directly emits a coordinate sorted, mate fixed validating BAM file without needing FixMates externally. Preliminary testing shows this runs in a totally fine amount of memory and produces equivalent results to the previous version. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5199 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-04 22:27:05 +00:00
hanna	96241c6637	More testng fallout: fixing another seemingly 'random' issue arising from an alternate test ordering. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5160 348d0f76-0448-11de-a6fe-93d51630548a	2011-02-01 15:25:50 +00:00
kshakir	d4f744a4d4	Checking if the interval files exist before using them to calculate the minimum scatter parts. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5143 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-31 18:07:34 +00:00
kshakir	2ef66af903	Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high. Moved the BamListWriter from FCP to ListWriterFunction in the Queue core. Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s. Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 23:33:58 +00:00
depristo	f29bb0639b	Documentation and cleanup of the distributed GATK implementation. Detailed documentation -- given that Matt will be extending the system in the near future -- about how the locking and processing trackers work. Added error trapping to note that distributed, shared-memory parallelism isn't yet implemented, instead of just not working silently. General utility function for the analysis of distributedGATK operation in the analysis directory git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5106 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-28 03:40:09 +00:00
depristo	f522eb2848	Previous tests were just too big... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5095 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-27 13:48:38 +00:00
hanna	4a33cdacde	Some basic integration tests detecting breakage in OTF BAM index generation. Doing it manually for the moment so that there's at least something testing this capability; will followup eventually with Mark to see whether we can shape the VCF index generation code in such a way that it supports BAM index testing as well. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5093 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 23:48:04 +00:00
depristo	be697d96f9	An apparently robust implementation of the file locking for distributed computation, using Lucene's file creation locking approach. It is worth trying out for those with large-scale, high-cost data sets. Details and discussion at group meeting on Wednesday. Some cleanup still needed. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5079 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-26 13:45:40 +00:00
kshakir	9923e05e0a	Moved MD5 utils from WalkerTest to BaseTest for use by PipelineTests. Moved VariantEval validation from FCPTest to PipelineTest. Cleaned up some duplicate code for writing temp files during tests. Moved FCPTest to playground namespace to match move for FCP.q. Added a basic HelloWorldPipelineTest for the HelloWorld QScript. Moved duplicated error handling from JobRunners into the FunctionEdge. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5068 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-25 04:11:49 +00:00
depristo	c50f39a147	V3 of the distributed GATK. High-efficiency implementation. Support for status tracking for debugging and display. Still not safe for production use due to NFS filelock problem. V4 will use alternative file locking mechanism git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5063 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-24 16:45:07 +00:00
depristo	a51061fd96	Improved distributed processing analytics. Still not 100% ready for prime-time. More improvements incoming. Iterator claim now supports requests to obtain in a single atomic claim (one lock) multiple sequential shards, which radically reduces overhead. However, deadlocking is still possible... git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5061 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-23 16:17:25 +00:00
depristo	9b1b8d46aa	Performance tracking of GenomeLocProcessingTrackers, as well as a marker for where to put tracker in HierarchicalMicroScheduler git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5051 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-21 22:24:42 +00:00
depristo	85553cf5cb	V2 cleaner, easily testing, shared memory and distributed GATK job management. Serious unit testing. Very much cleaner processing. Some code cleanup remains in removing now unused classes but the system is ready for general testing. Confirmed that one can run the UG 100 ways parallel without error, but edge cases may remain. See documentation at: http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK#Distributed_Parallelism_.28Experimental.29 for examples on how to run this, or the testing Scala script git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5032 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-20 12:58:13 +00:00
depristo	f8ba76d87c	Incremental commit for distributed computation. Appears to work but has potential deadlock situation not yet debugged. Do not use yet. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5010 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-17 21:23:09 +00:00
depristo	a88708ebfa	Moving GLF code to archive git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5006 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-15 22:42:42 +00:00
depristo	af1bce3492	Longer wait time for threading test (5 min now) and an assertion to ensure that all jobs finished. Should probably just remove the longer running test so this becoming a non-issue git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4999 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-14 13:09:31 +00:00
depristo	afbea9ce59	SharedMemory and SharedFile implementations of GenomeLocProcessingTracker, along with serious unit tests that both pass. Slightly inefficient implementation but sufficient for further testing. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4998 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-14 03:14:24 +00:00
depristo	468ef382b7	vastly improved progress meter that estimates % of work done and time until the job finishes and time remaining. Reordered GATK core initialization order -- intervals are created before the scheduler. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4975 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-12 17:32:27 +00:00
hanna	8d2c14b29c	Update Picard / sam-jdk at Tim's request. git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4925 348d0f76-0448-11de-a6fe-93d51630548a	2011-01-03 02:17:25 +00:00
depristo	a3729bd59c	Now I call BeforeMethod correctly git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4872 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-17 22:45:45 +00:00
depristo	b7e4a015c0	static thread cache reset in UnitTest git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4870 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-17 21:53:10 +00:00
depristo	3bbc6a0540	Slightly more thread safe CachingIndexedFastaSequenceFile.java. Likely passes parallel testing git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4869 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-17 21:05:17 +00:00
depristo	4a54f3f230	ThreadLocal version of CachingIndexedFastaSequenceFile. More efficient support for shared memory BAQ calculations git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4865 348d0f76-0448-11de-a6fe-93d51630548a	2010-12-17 15:44:48 +00:00

1 2 3 4 5 ...

257 Commits (8d5b4af8ca2511cb4615b4e9419e5f9df9fab930)