gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Khalid Shakir	22b4466cf5	Added setupRetry() to modify jobs when Queue is run with '-retry' and jobs are about to restart after an error. Implemented a mixin called "RetryMemoryLimit" which will by default double the memory. GridEngine memory request parameter can be selected on the command line via '-resMemReqParam mem_free' or '-resMemReqParam virtual_free'. Java optimizations now enabled by default: - Only 4 GC threads instead of each job using java's default O(number of cores) GC threads. Previously on a machine with N cores if you have N jobs running and java allocates N GC threads by default, then the machines are using up to N^2 threads if all jobs are in heavy GC (thanks elauzier). - Exit if GC spends more than 50% of time in GC (thanks ktibbett). - Exit if GC reclaims lest than 10% of max heap (thanks ktibbett). Added a -noGCOpt command line option to disable new java optimizations.	2012-08-13 15:43:05 -04:00
Eric Banks	0381fd7c83	Hmm, I thought I used the right md5s last time. Let's try again.	2012-08-02 11:25:10 -04:00
Eric Banks	05bf6e3726	Updating md5s in pipeline tests so that they finally pass	2012-08-01 10:27:00 -04:00
Eric Banks	a9f27e5b02	Updated md5s for DPP test	2012-07-17 21:54:46 -04:00
Eric Banks	4e3780fd4f	Updated md5 for PBPP	2012-07-17 15:47:43 -04:00
Khalid Shakir	746a5e95f3	Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding. Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals. IntervalUtils return unmodifiable lists so that utilities don't mutate the collections. Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus. Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values. JobRunInfo handles the null done times when jobs crash with strange errors.	2012-06-27 01:15:22 -04:00
Mauricio Carneiro	bbd46690e6	fixing conflicts	2012-06-26 17:12:24 -04:00
Mauricio Carneiro	91f02dfd85	fixing pipeline tests (sorry, my bad)	2012-06-26 17:10:58 -04:00
Mark DePristo	567dba0f76	Cleanup of VCF header lines and constants, BCF2 bugfixes -- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller -- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers -- VCF parsers now automatically repair standard VCF header lines when reading the header -- Updating integration tests to reflect header changes -- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test -- SelectHeaders now always updates the header to include the contig lines -- SelectVariants add UG header lines when in regenotype mode -- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY -- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs) -- Throw error when VCF has unbounded non-flag values that don't have = value bindings -- By default we no longer allow writing of BCF2 files without contig lines in the header	2012-06-21 15:16:31 -04:00
Mark DePristo	982192e2e4	MD5DB for integrationtest management now writes out a md5mismatches files for clean analysis -- This file is in integrationtests/md5mismatches.txt, and looks like: expected observed test 7fd0d0c2d1af3b16378339c181e40611 2339d841d3c3c7233ebba9a6ace895fd test BeagleOutputToVCF 43865f3f0d975ee2c5912b31393842f8 1b9c4734274edd3142a05033e520beac testBeagleChangesSitesToRef daead9bfab1a5df72c5e3a239366118e 27be14f9fc951c4e714b4540b045c2df testDiffObjects:master=/local/dev/depristo/itest/public/testdata/diffTestMaster.vcf,test=/local/dev/depristo/itest/public/testdata/diffTestTest.vcf,md5=daead9bfab1a5df72c5e3a239366118e -- Associated cleanup with making md5db an instantiated object, rather than a bunch of static methods	2012-06-14 16:42:27 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Khalid Shakir	7c11dde328	Updated DPP test MD5's due to template length (TLEN) changes when Picard was revved.	2012-05-03 14:47:58 -04:00
Khalid Shakir	91cb654791	AggregateMetrics: - By porting from jython to java now accessible to Queue via automatic extension generation. - Better handling for problematic sample names by using PicardAggregationUtils. GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name. CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering. Added SelectHeaders walker for filtering headers for dbGAP submission. Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter. Latest WholeGenomePipeline. Other minor cleanup to utility methods.	2012-04-17 11:45:32 -04:00
Roger Zurawicki	63cf7ec7ec	Added more primitives to GATK Report Column Type - The Integer column type now accepts byte and shorts - Updated Unit Tests and added a new testParse() test Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>	2012-03-28 09:07:54 -04:00
Khalid Shakir	cda1e1b207	Minor manual merge update for List class to Seq interface usage.	2012-02-08 02:24:54 -05:00
Khalid Shakir	ef74363b1b	Merged bug fix from Stable into Unstable	2012-02-08 02:14:26 -05:00
Khalid Shakir	23e7f1bed9	When an interval list specifies overlapping intervals merge them before scattering.	2012-02-08 02:12:16 -05:00
Khalid Shakir	c18beadbdb	Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc. Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.	2012-01-23 16:17:04 -05:00
Khalid Shakir	5793625592	No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice. QScript accessor to QSettings to specify a default runName and other default function settings. Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs. Gathered log files concatenate all log files together into the stdout. InProcessFunctions now have PrintStreams for stdout and stderr. Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml. During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file. In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope. Added more detailed output when running with -l DEBUG. Simplified graphviz visualization for additional debugging. Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List) Minor cleanup to build including sending ant gsalib to R's default libloc.	2012-01-08 12:11:55 -05:00
Mauricio Carneiro	3358c132a8	Updating the MD5s Clipping adaptor boundaries changed the results of CountCovariates which affected the PPP output. a few more loci were visible to locus walkers.	2011-12-21 15:14:05 -05:00
Khalid Shakir	6059ca76e8	Removing cruft that snuck in last commit.	2011-12-16 23:00:16 -05:00
Khalid Shakir	7486696c07	When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter. Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory. Updated various tests for above items and other small fixes.	2011-12-16 18:09:25 -05:00
Mauricio Carneiro	663184ee9d	Added test mode to PPP * in test mode, no @PG tags are output to the final bam file * updated pipeline test to use -test mode. * MD5s updated accordingly	2011-12-12 18:29:06 -05:00
Mauricio Carneiro	a3c3d72313	Added test mode to DPP * in test mode, no @PG tags are output to the final bam file * updated pipeline test to use -test mode. * MD5s are now dependent on BWA version	2011-12-12 18:29:06 -05:00
Mauricio Carneiro	52c64b971f	Updating MD5s -- really dont know why it didn't update before	2011-12-12 09:48:58 -05:00
Mauricio Carneiro	ed91461c49	Data Processing Pipeline Test * Added standard pipeline test for the DPP * Added a full BWA pipeline test for the DPP * Included the extra files for the reference needed by BWA (to be used by DPP and PPP tests)	2011-12-12 00:24:51 -05:00
Mauricio Carneiro	cca8a18608	PPP pipeline test * added a pipeline test to the Pacbio Processing Pipeline. * updated exampleBAM with more complete RG information so we can use it in a wider variety of pipeline tests * added exampleDBSNP.vcf file with only chromosome 1 in the range of the exampleFASTA.fasta reference for pipeline tests	2011-12-11 17:32:21 -05:00
David Roazen	d014c7faf9	Queue now properly escapes all shell arguments in generated shell scripts This has implications for both Qscript authors and CommandLineFunction authors. Qscript authors: You no longer need to (and in fact must not) manually escape String values to avoid interpretation by the shell when setting up Walker parameters. Queue will safely escape all of your Strings for you so that they'll be interpreted literally. Eg., Old way: filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"") New way: filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0") CommandLineFunction authors: If you're writing a one-off CommandLineFunction in a Qscript and don't really care about quoting issues, just keep doing things the direct, simple way: def commandLine = "cat %s \| grep -v \"#\" > %s".format(files, out) If you're writing a CommandLineFunction that will become part of Queue and will be used by other QScripts, however, it's advisable to do things the newer, safer way, ie.: When you construct your commandLine, you should do so ONLY using the API methods required(), optional(), conditional(), and repeat(). These will manage quoting and whitespace separation for you, so you shouldn't insert quotes/extraneous whitespace in your Strings. By default you get both (quoting and whitespace separation), but you can disable either of these via parameters. Eg., override def commandLine = super.commandLine + required("eff") + conditional(verbose, "-v") + optional("-c", config) + required("-i", "vcf") + required("-o", "vcf") + required(genomeVersion) + required(inVcf) + required(">", escape=false) + // This will be shell-interpreted required(outVcf) I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new system, so you'll get free shell escaping when you use those in Qscripts just like with walkers.	2011-12-01 18:13:44 -05:00
Khalid Shakir	fac9932938	Embedding gsalib source and queueJobReport R scripts in the dist and package jars. Moved gsalib and queueJobReport.R to embeddable namespaced locations. Updated packager dependencies/dir to add an @includes which filters the embedded fileset. RScriptExecutor can now JIT compiles the gsalib. RScriptExecutor uses ProcessController and sends the Rscript output to java's stdout when run under -l DEBUG. Refactored ProcessController and IOUtils from Queue to Sting Utils. Added more unit tests to ProcessController along with a utility class to hard stop OutputStreams at a specified byte count. Replaced uses of some IOUtils with Apache Commons IO. ShellJobRunner refactored to use direct ProcessController and now kills jobs on shutdown. Better QGraph responsiveness on shutdown by using Object.wait() instead of Thread.sleep().	2011-10-24 15:58:34 -04:00
Mark DePristo	73f9d1f217	GATK read group requirement iron hand -- The GATK will now throw a user exception if it opens a SAM/BAM file that doesn't have at least one RG defined -- LIBS again throws an error if the complete list of samples isn't provided -- Updating ExmpleCountLociPipeline test to use the well-formated versions of the exampleBAM and exampleFASTA files in testdata, instead of the old broken ones in validation_data. -- Convenience constructors for UserExceptions.MalformedBAM	2011-10-06 08:40:35 -07:00
Mark DePristo	06cb20f2a5	Intermediate commit cleaning up scatter intervals -- Adding unit tests to ensure uniformity of intervals	2011-09-09 12:56:45 -04:00
Khalid Shakir	c4c90c8826	Updates to JobRunners from the Queue developer community and from running the WholeGenomePipeline: - Ability to pass a different resident memory reservation and limits. Useful for large pileups of low pass genome data that sometimes need high -Xmx6g but usually don't exceed 2-3g in actual heap size. - Fixed jobPriority to work for all job runners. Now must be a integer between 0 and 100- even for GridEngine- and will be mapped to the correct values. - Passing parallel environment and job resource requests to LSF and GridEngine. Useful for passing tokens like iodine_io=1 and -pe pe_slots 8 - Refactored GridEngine JobRunner to also provide basic support for other job dispatchers with DRMAA implementations such as Torque/PBS. Should work for basic running but advanced users must pass their own jobNativeArgs from the command line or in customized QScripts until someone maps properties like jobQueue, jobPriority, residentRequest, etc. into a Torque/PBS/etc. dispatcher.	2011-08-22 15:13:27 -04:00
Khalid Shakir	5dcac7b064	GATKReport v0.2: - Floating point column widths are measured correctly - Using fixed width columns instead of white space separated which allows spaces embedded in cell values - Legacy support for parsing white space separated v0.1 tables where the columns may not be fixed width - Enforcing that table descriptions do not contain newlines so that tables can be parsed correctly Replaced GATKReportTableParser with existing functionality in GATKReport	2011-08-03 00:24:47 -04:00
Khalid Shakir	59eb1f4663	Memory limits changed from Int to Double. Updated LSF calls to read memory units from config along with tweaks to select hosts. Moved some common code from GridEngine and LSF to super classes.	2011-07-21 22:57:18 -04:00
Mark DePristo	449bf1b539	Testdata for diffObjects. PipelineTest updated to point to MD5DB.java	2011-07-18 10:47:03 -04:00
Khalid Shakir	b6bc64a0c8	Cleanup of the utils.broad package. Using Picard IoUtils on sample names.	2011-07-01 20:47:03 -04:00
David Roazen	546e7777fa	Re-fixing paths in pipeline tests after example qscripts got moved.	2011-07-01 16:39:10 -04:00
David Roazen	11d4af0e75	Path-related fixes to the private queue pipeline tests.	2011-07-01 13:41:34 -04:00
David Roazen	9644f104c4	Fixes to the queue pipeline tests to account for the new directory structure.	2011-07-01 13:13:24 -04:00
David Roazen	3c9497788e	Reorganized the codebase beneath top-level public and private directories, removing the playground and oneoffprojects directories in the process. Updated build.xml accordingly.	2011-06-28 06:55:19 -04:00

40 Commits (bfbf1686cd0f71c94dea59c84b6c74c71f0ae1af)