gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Khalid Shakir	2d1ea7124b	One less Queue command line requirement: -tempDir now defaults to .queue/tmp. Also moved queueScatterGather to .queue/scatterGather.	2012-08-27 12:04:50 -04:00
Mark DePristo	9eec33ec3b	Complete GSA-497: Let Queue write out runInfo on the fly, after each job group finishes running -- Queue will incrementally now write out its jobReport.txt file whenever jobs finish running (FAIL or DONE) -- This makes it far easier to track what's going on, or to analyze incrementally performance results coming out of Queue -- Generally cleaned up the QJobsReporting code, creating a new clean class QJobsReporter that holds all of the information on what to do log and where to put into, which was previously scattered in QCommandLine and QJobReport	2012-08-21 14:44:18 -04:00
Khalid Shakir	3514fb6e66	Changed the default memory limit from none to 2GB upon suggestions from delangel, carneiro, and depristo.	2012-08-20 21:41:13 -04:00
Mark DePristo	67ebd65512	Bugfix for potential SEGFAULT with JNA getting execution hosts for LSF with multiple hosts	2012-08-17 11:49:01 -04:00
Khalid Shakir	22b4466cf5	Added setupRetry() to modify jobs when Queue is run with '-retry' and jobs are about to restart after an error. Implemented a mixin called "RetryMemoryLimit" which will by default double the memory. GridEngine memory request parameter can be selected on the command line via '-resMemReqParam mem_free' or '-resMemReqParam virtual_free'. Java optimizations now enabled by default: - Only 4 GC threads instead of each job using java's default O(number of cores) GC threads. Previously on a machine with N cores if you have N jobs running and java allocates N GC threads by default, then the machines are using up to N^2 threads if all jobs are in heavy GC (thanks elauzier). - Exit if GC spends more than 50% of time in GC (thanks ktibbett). - Exit if GC reclaims lest than 10% of max heap (thanks ktibbett). Added a -noGCOpt command line option to disable new java optimizations.	2012-08-13 15:43:05 -04:00
Mauricio Carneiro	d446d34227	GATK Error messages now point to the new website instead of GetSatisfaction.	2012-07-20 17:27:11 -04:00
Joel Thibault	9ee58d323a	Pass the original GATK unsafe parameter to the VcfGatherFunction	2012-07-02 16:03:11 -04:00
Khalid Shakir	746a5e95f3	Refactored parsing of Rod/IntervalBinding. Queue S/G now uses all interval arguments passed to CommandLineGATK QFunctions including support for BED/tribble types, XL, ISR, and padding. Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals. IntervalUtils return unmodifiable lists so that utilities don't mutate the collections. Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus. Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values. JobRunInfo handles the null done times when jobs crash with strange errors.	2012-06-27 01:15:22 -04:00
Mark DePristo	96dbd8df63	Fix a nasty script bug in Queue -- If you are using user-defined configurations (configureJobFeatures) and you didn't overwride the analysisName of your jobs, and there were other jobs using the same name, then you got very strange errors at the end of your script. For example, in my script I was using SelectVariants to prepare VCF files, and SelectVariants to generate a useful performance table. Since I forgot to make a special analysisName for my table commands, the generic SV commands were being included in the analysis group, and these were throwing an error since the special features added for the table weren't added to those SV commands	2012-06-14 16:42:26 -04:00
Mark DePristo	f77d2e6965	Renamed NO_HEADER to the more accurate no_cmdline_in_header -- Also no_cmdline_in_header permits us to write contigs into the header, so that the shadow BCF system can work as well	2012-05-24 10:57:08 -04:00
Eric Banks	03d40272c8	Removed old GATKReport code and moved the new stuff in its place.	2012-05-18 01:44:31 -04:00
Eric Banks	a26b04ba17	Extensive refactoring of the GATKReports. This was a beast. The practical differences between version 1.0 and this one (v1.1) are: * the underlying data structure now uses arrays instead of hashes, which should drastically reduce the memory overhead required to create large tables. * no more primary keys; you can still create arbitrary IDs to index into rows, but there is no special cased primary key column in the table. * no more dangerous/ugly table operations supported except to increment a cell's value (if an int) or to concatenate 2 tables. Integration tests change because table headers are different. Old classes are still lying around. Will clean those up in a subsequent commit.	2012-05-18 01:11:26 -04:00
Khalid Shakir	a9da9598f5	Implemented getSamplesFromVCF.	2012-05-03 21:57:57 -04:00
Mark DePristo	43d97c2e00	Rev Tribble to r97, adding binary feature support From tribble logs: Binary feature support in tribble -- Massive refactoring and cleanup -- Many bug fixes throughout -- FeatureCodec is now general, with decode etc. taking a PositionBufferedStream as an argument not a String -- See ExampleBinaryCodec for an example binary codec -- AbstractAsciiFeatureCodec provides to its subclass the same String decode, readHeader functionality before. Old ASCII codecs should inherit from this base class, and will work without additional modifications -- Split AsciiLineReader into a position tracking stream (PositionalBufferedStream). The new AsciiLineReader takes as an argument a PositionalBufferedStream and provides the readLine() functionality of before. Could potentially use optimizations (its a TODO in the code) -- The Positional interface includes some more functionality that's now necessary to support the more general decoding of binary features -- FeatureReaders now work using the general FeatureCodec interface, so they can index binary features -- Bugfixes to LinearIndexCreator off by 1 error in setting the end block position -- Deleted VariantType, since this wasn't used anywhere and it's a particularly clean why of thinking about the problem -- Moved DiploidGenotype, which is specific to Gelitext, to the gelitext package -- TabixReader requires an AsciiFeatureCodec as it's currently only implemented to handle line oriented records -- Renamed AsciiFeatureReader to TribbleIndexedFeatureReader now that it handles Ascii and binary features -- Removed unused functions here and there as encountered -- Fixed build.xml to be truly headless -- FeatureCodec readHeader returns a FeatureCodecHeader obtain that contains a value and the position in the file where the header ends (not inclusive). TribbleReaders now skip the header if the position is set, so its no longer necessary, if one implements the general readHeader(PositionalBufferedStream) version to see header lines in the decode functions. Necessary for binary codecs but a nice side benefit for ascii codecs as well -- Cleaned up the IndexFactory interface so there's a truly general createIndex function that takes the enumerated index type. Added a writeIndex() function that writes an index to disk. -- Vastly expanded the index unit tests and reader tests to really test linear, interval, and tabix indexed files. Updated test.bed, and created a tabix version of it as well. -- Significant BinaryFeaturesTest suite. -- Some test files have indent changes	2012-05-03 07:31:48 -04:00
Mark DePristo	58c470a6c5	Rev'ing Tribble from 53 to 94 -- Other tribble contributors did major refactoring / simplification of tribble, which required some changes to GATK code -- Integrationtests pass without modification, though some very old index files (callable loci beds) were apparently corrupt and no longer tolerated by the newer tribble codebase	2012-05-03 07:31:47 -04:00
Khalid Shakir	91cb654791	AggregateMetrics: - By porting from jython to java now accessible to Queue via automatic extension generation. - Better handling for problematic sample names by using PicardAggregationUtils. GATKReportTable looks up keys using arrays instead of dot-separated strings, which is useful when a sample has a period in the name. CombineVariants has option to suppress the header with the command line, which is now invoked during VCF gathering. Added SelectHeaders walker for filtering headers for dbGAP submission. Generated command line for read filters now correctly prefixes the argument name as --read_filter instead of -read_filter. Latest WholeGenomePipeline. Other minor cleanup to utility methods.	2012-04-17 11:45:32 -04:00
Khalid Shakir	ef74363b1b	Merged bug fix from Stable into Unstable	2012-02-08 02:14:26 -05:00
Khalid Shakir	23e7f1bed9	When an interval list specifies overlapping intervals merge them before scattering.	2012-02-08 02:12:16 -05:00
Mauricio Carneiro	7c7ca0d799	fixing bug with fastq extension * PPP only recognized .fasta and .fq, failing when the user provided a .fastq file. Fixed.	2012-01-24 11:02:15 -05:00
Khalid Shakir	c18beadbdb	Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc. Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.	2012-01-23 16:17:04 -05:00
Christopher Hartl	39e6df5aa9	Fix edge case for very small VCFs	2012-01-19 00:51:28 -05:00
Christopher Hartl	1e037a0ecf	Ensure second-to-last line printed	2012-01-19 00:33:08 -05:00
Christopher Hartl	9946853039	Remove duplicated line	2012-01-19 00:25:22 -05:00
Christopher Hartl	cf9b1d350a	Some minor changes to in-process functions that nobody else uses. CGL now properly ignores no-calls for external VCFs.	2012-01-19 00:20:49 -05:00
David Roazen	b7c65cb089	Merged bug fix from Stable into Unstable	2012-01-18 09:52:47 -05:00
David Roazen	d5199db8ec	Be explicit about setting the snpEff -onlyCoding option in the pipeline When run without an explicit -onlyCoding option, as we've been doing up to now, snpEff automatically sets -onlyCoding to "true" provided that there is at least one transcript marked as "protein_coding", which will always be the case for us in practice (and indeed, all pipeline runs so far with snpEff 2.0.5 have run with -onlyCoding auto-set to "true"). However, given the disastrous effect on annotation quality setting "-onlyCoding false" has, we wish to be explicit with this option rather than relying on snpEff's auto-detection logic.	2012-01-17 20:04:27 -05:00
Khalid Shakir	a9a6516527	Merged bug fix from Stable into Unstable	2012-01-10 16:16:10 -05:00
Khalid Shakir	ef50e77ee2	When running Queue jobs locally, merge the stderr to the stdout log if the error file is NOT specified. Updated VE strats in the HSP for plotting Ka/Ks by AC.	2012-01-10 16:10:25 -05:00
Khalid Shakir	5793625592	No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice. QScript accessor to QSettings to specify a default runName and other default function settings. Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs. Gathered log files concatenate all log files together into the stdout. InProcessFunctions now have PrintStreams for stdout and stderr. Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml. During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file. In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope. Added more detailed output when running with -l DEBUG. Simplified graphviz visualization for additional debugging. Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List) Minor cleanup to build including sending ant gsalib to R's default libloc.	2012-01-08 12:11:55 -05:00
Mark DePristo	0cc5c3d799	General improvements to Queue -- Support for collecting resources info from DRMAA runners -- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4 -- NCoresRequest is a testing queue script for this. -- Added two command line arguments: -- multiCoreJerk: don't request multiple cores for jobs with nt > 1. This was the old behavior but it's really not the best way to run parallel jobs. Now with queue if you run nt = 4 the system requests 4 cores on your host. If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk -- job_parallel_env: parallel environment named used with SGE to request multicore jobs. Equivalent to -pe job_parallel_env NT for NT > 1 jobs	2011-12-20 14:05:09 -05:00
Khalid Shakir	7486696c07	When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter. Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory. Updated various tests for above items and other small fixes.	2011-12-16 18:09:25 -05:00
Mark DePristo	550fb498be	Support for NT testing (default up to 4) for CC and UG -- Added convenience function addJobReportBinding to just new binding to the map (x -> y) as well	2011-12-14 18:45:00 -05:00
David Roazen	1ba03a5e72	Use optional() instead of required() to construct javaMemoryLimit argument in JavaCommandLineFunction	2011-12-05 14:06:00 -05:00
David Roazen	d014c7faf9	Queue now properly escapes all shell arguments in generated shell scripts This has implications for both Qscript authors and CommandLineFunction authors. Qscript authors: You no longer need to (and in fact must not) manually escape String values to avoid interpretation by the shell when setting up Walker parameters. Queue will safely escape all of your Strings for you so that they'll be interpreted literally. Eg., Old way: filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"") New way: filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0") CommandLineFunction authors: If you're writing a one-off CommandLineFunction in a Qscript and don't really care about quoting issues, just keep doing things the direct, simple way: def commandLine = "cat %s \| grep -v \"#\" > %s".format(files, out) If you're writing a CommandLineFunction that will become part of Queue and will be used by other QScripts, however, it's advisable to do things the newer, safer way, ie.: When you construct your commandLine, you should do so ONLY using the API methods required(), optional(), conditional(), and repeat(). These will manage quoting and whitespace separation for you, so you shouldn't insert quotes/extraneous whitespace in your Strings. By default you get both (quoting and whitespace separation), but you can disable either of these via parameters. Eg., override def commandLine = super.commandLine + required("eff") + conditional(verbose, "-v") + optional("-c", config) + required("-i", "vcf") + required("-o", "vcf") + required(genomeVersion) + required(inVcf) + required(">", escape=false) + // This will be shell-interpreted required(outVcf) I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new system, so you'll get free shell escaping when you use those in Qscripts just like with walkers.	2011-12-01 18:13:44 -05:00
David Roazen	fdd90825a1	Queue now outputs a GATK-like header with version number, build timestamp, etc.	2011-11-23 14:28:35 -05:00
Khalid Shakir	c50274e02e	During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice. Moved flanking interval utilies to IntervalUtils with UnitTests.	2011-11-17 13:56:42 -05:00
Mark DePristo	0111e58d4e	Don't generate PDF unless you have -run specified	2011-11-09 14:45:40 -05:00
Mark DePristo	849c0757f2	Bug fix for LocusScatterFunction when no intervals are provided -- Now correctly grabs reference contigs and cuts them all up, rather than NPE as intervalString == null.	2011-11-04 10:55:09 -04:00
Mark DePristo	bd977c2d92	Bug fix to avoid infinite loop in GATKScatterFunction	2011-11-02 16:20:42 -04:00
Mark DePristo	c1da8cd5e7	Final version of bp-resolved locus scatter/gather -- Minor refactoring to allow LocusScatterFunction to have maxIntervals be the original scatter count, rather than capping this by the interval count as Contig and Interval do	2011-11-02 11:26:34 -04:00
Mark DePristo	c2b97030a4	IntervalUtils for completely balanced locus-based scatter/gather -- scatterLocusIntervals master utility -- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc -- Util function for reversing a list (List<T> -> List<T>, unlike Collections version) -- DoC is PartitionType.INTERVAL -- Significant unit tests on new functionality (all passing) -- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work	2011-11-02 10:49:40 -04:00
Mark DePristo	5fc613f972	Better default partition types for walkers -- Added PartitionType.READ, and associated ReadScatterFunction. ReadScatterFunction is literally just ContigScatterFunction until someone wants to implement something better -- LocusWalkers (and subclasses RodWalkers and RefWalkers) are by default PartitionType.LOCUS.	2011-11-01 19:47:10 -04:00
Khalid Shakir	e25d40882a	Swapping Thread.sleep(0) with Object.wait(0) caused Queue to lock up. Thanks to rpoplin for pointing it out.	2011-10-28 15:51:03 -04:00
Khalid Shakir	b80d407dc7	No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path. Other minor cleanup.	2011-10-27 14:17:07 -04:00
Eric Banks	b39fcb1bea	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-26 15:44:25 -04:00
Eric Banks	3273c20c98	Added integration tests for Tribble-based intervals and fixed up some of the other tests based on some method changes.	2011-10-26 15:29:18 -04:00
Khalid Shakir	fac9932938	Embedding gsalib source and queueJobReport R scripts in the dist and package jars. Moved gsalib and queueJobReport.R to embeddable namespaced locations. Updated packager dependencies/dir to add an @includes which filters the embedded fileset. RScriptExecutor can now JIT compiles the gsalib. RScriptExecutor uses ProcessController and sends the Rscript output to java's stdout when run under -l DEBUG. Refactored ProcessController and IOUtils from Queue to Sting Utils. Added more unit tests to ProcessController along with a utility class to hard stop OutputStreams at a specified byte count. Replaced uses of some IOUtils with Apache Commons IO. ShellJobRunner refactored to use direct ProcessController and now kills jobs on shutdown. Better QGraph responsiveness on shutdown by using Object.wait() instead of Thread.sleep().	2011-10-24 15:58:34 -04:00
Mauricio Carneiro	9f867d77ca	no sort order subtle bug fixed.	2011-10-20 18:44:09 -04:00
Mauricio Carneiro	c9d8b22092	Added BWASW support to the pipeline Data Processing Pipeline can now use BWASW for realigning the reads. Useful for Ion Torrent data.	2011-10-20 18:36:28 -04:00
Menachem Fromer	e5fc828546	With Khalid's implicit approval, I have removed this line that overrides the memory limit of the VCF-gathering function, so that the inherited limit remains	2011-10-18 14:47:39 -04:00

1 2

100 Commits (bfbf1686cd0f71c94dea59c84b6c74c71f0ae1af)