gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Christopher Hartl	b567ed8793	Use the right reference path :(	2012-02-01 12:35:18 -05:00
Christopher Hartl	87a63d54d6	fix the script!	2012-02-01 12:05:29 -05:00
Christopher Hartl	810996cfca	Introducing: VariantsToPed, the world's most annoying walker! And also a busted QScript to run it that I need Khalid's help debugging ( frownie face ). Note that VariantsToPed and PlinkSeq generate the same binary file (up to strand flips...thanks PlinkSeq), so I know it's working properly. Hooray!	2012-02-01 10:39:03 -05:00
Mauricio Carneiro	052a4bdb9c	Turning off PHONE HOME option in the MDCP * MDCP is for internal use and there is no need to report to the Amazon cloud. * Reporting to ASW_S3 is not allowing jobs to finish, this is probably a bug.	2012-01-27 11:13:30 -05:00
Mauricio Carneiro	97499529c7	another small bug with the file extension.	2012-01-24 16:14:35 -05:00
Mauricio Carneiro	7c7ca0d799	fixing bug with fastq extension * PPP only recognized .fasta and .fq, failing when the user provided a .fastq file. Fixed.	2012-01-24 11:02:15 -05:00
Mauricio Carneiro	945cf03889	IntelliJ ate my import!	2012-01-23 21:46:45 -05:00
Mauricio Carneiro	2bb9525e7f	Don't set base qualities if fastQ is provided * Pacbio Processing pipeline now works with the new fastQ files outputted by the Pacbio instrument	2012-01-23 17:57:29 -05:00
Khalid Shakir	c18beadbdb	Device files like /dev/null are now tracked as special by Queue and are not used to generate .out file paths, scattered into a temporary directory, gathered, deleted, etc. Attempted workaround for xdr_resourceInfoReq unsatisfied link during loading of libbat.so.	2012-01-23 16:17:04 -05:00
Christopher Hartl	39e6df5aa9	Fix edge case for very small VCFs	2012-01-19 00:51:28 -05:00
Christopher Hartl	1e037a0ecf	Ensure second-to-last line printed	2012-01-19 00:33:08 -05:00
Christopher Hartl	9946853039	Remove duplicated line	2012-01-19 00:25:22 -05:00
Christopher Hartl	cf9b1d350a	Some minor changes to in-process functions that nobody else uses. CGL now properly ignores no-calls for external VCFs.	2012-01-19 00:20:49 -05:00
David Roazen	b7c65cb089	Merged bug fix from Stable into Unstable	2012-01-18 09:52:47 -05:00
David Roazen	d5199db8ec	Be explicit about setting the snpEff -onlyCoding option in the pipeline When run without an explicit -onlyCoding option, as we've been doing up to now, snpEff automatically sets -onlyCoding to "true" provided that there is at least one transcript marked as "protein_coding", which will always be the case for us in practice (and indeed, all pipeline runs so far with snpEff 2.0.5 have run with -onlyCoding auto-set to "true"). However, given the disastrous effect on annotation quality setting "-onlyCoding false" has, we wish to be explicit with this option rather than relying on snpEff's auto-detection logic.	2012-01-17 20:04:27 -05:00
Ryan Poplin	75f87db468	Replacing Mills file with new gold standard indel set in the resource bundle for release with v1.5	2012-01-17 15:02:45 -05:00
Khalid Shakir	a9a6516527	Merged bug fix from Stable into Unstable	2012-01-10 16:16:10 -05:00
Khalid Shakir	ef50e77ee2	When running Queue jobs locally, merge the stderr to the stdout log if the error file is NOT specified. Updated VE strats in the HSP for plotting Ka/Ks by AC.	2012-01-10 16:10:25 -05:00
Mauricio Carneiro	5bf960deb8	adding dbsnp to indel VQSR	2012-01-10 12:38:49 -05:00
Mauricio Carneiro	6f2abd76df	Updating the MDCP with the new indel gold standard from Ryan.	2012-01-09 15:31:18 -05:00
Khalid Shakir	5793625592	No more "Q-<pid>@<host>". Generated log file names now use the first output + ".out" (ex. my.vcf.out) or the name of the first QScript plus the order the function was added (ex. MyScript-1.out). The same function added twice with the same outputs will now have the same default logs, meaning the 2nd instance of the function won't be added to the graph twice. QScript accessor to QSettings to specify a default runName and other default function settings. Because log files are no longer pseudo-random their presense can be used to tell if a job without other file outputs is "done". For now still using the log's .done file in addition to original outputs. Gathered log files concatenate all log files together into the stdout. InProcessFunctions now have PrintStreams for stdout and stderr. Updated ivy to use commons-io 2.1 for copying logs to the stdout PrintStream. Removed snakeyaml. During graph tracking of outputs the Index files, and now BAM MD5s, are tracked with the gathering of the original file. In Queue generated wrappers for the GATK the Index and MD5s used for tracking are switched to private scope. Added more detailed output when running with -l DEBUG. Simplified graphviz visualization for additional debugging. Switched usage of the scala class 'List' to the trait 'Seq' (think java.util.ArrayList vs. using the interface java.util.List) Minor cleanup to build including sending ant gsalib to R's default libloc.	2012-01-08 12:11:55 -05:00
Mauricio Carneiro	f6a18aea63	Updated MDCP with INDEL best practices * chose 90.0 indel cut target for most datasets (this is arbitrary).	2012-01-06 17:21:59 -05:00
Mauricio Carneiro	3358c132a8	Updating the MD5s Clipping adaptor boundaries changed the results of CountCovariates which affected the PPP output. a few more loci were visible to locus walkers.	2011-12-21 15:14:05 -05:00
Mark DePristo	0cc5c3d799	General improvements to Queue -- Support for collecting resources info from DRMAA runners -- Disabled the non-standard mem_free argument so that we can actually use our own SGE cluster gsa4 -- NCoresRequest is a testing queue script for this. -- Added two command line arguments: -- multiCoreJerk: don't request multiple cores for jobs with nt > 1. This was the old behavior but it's really not the best way to run parallel jobs. Now with queue if you run nt = 4 the system requests 4 cores on your host. If this flag is thrown, though, it will only request 1 and you'll just use 4, like a jerk -- job_parallel_env: parallel environment named used with SGE to request multicore jobs. Equivalent to -pe job_parallel_env NT for NT > 1 jobs	2011-12-20 14:05:09 -05:00
Khalid Shakir	6059ca76e8	Removing cruft that snuck in last commit.	2011-12-16 23:00:16 -05:00
Khalid Shakir	7486696c07	When using bam list mode in HSP deriving VCF name from bam list instead of requiring an additional parameter. Creating a single temporary directory per ant test run instead of a putting temp files across all runs in the same directory. Updated various tests for above items and other small fixes.	2011-12-16 18:09:25 -05:00
Mark DePristo	550fb498be	Support for NT testing (default up to 4) for CC and UG -- Added convenience function addJobReportBinding to just new binding to the map (x -> y) as well	2011-12-14 18:45:00 -05:00
Mauricio Carneiro	663184ee9d	Added test mode to PPP * in test mode, no @PG tags are output to the final bam file * updated pipeline test to use -test mode. * MD5s updated accordingly	2011-12-12 18:29:06 -05:00
Mauricio Carneiro	a3c3d72313	Added test mode to DPP * in test mode, no @PG tags are output to the final bam file * updated pipeline test to use -test mode. * MD5s are now dependent on BWA version	2011-12-12 18:29:06 -05:00
Mauricio Carneiro	52c64b971f	Updating MD5s -- really dont know why it didn't update before	2011-12-12 09:48:58 -05:00
Mauricio Carneiro	ed91461c49	Data Processing Pipeline Test * Added standard pipeline test for the DPP * Added a full BWA pipeline test for the DPP * Included the extra files for the reference needed by BWA (to be used by DPP and PPP tests)	2011-12-12 00:24:51 -05:00
Mauricio Carneiro	cca8a18608	PPP pipeline test * added a pipeline test to the Pacbio Processing Pipeline. * updated exampleBAM with more complete RG information so we can use it in a wider variety of pipeline tests * added exampleDBSNP.vcf file with only chromosome 1 in the range of the exampleFASTA.fasta reference for pipeline tests	2011-12-11 17:32:21 -05:00
Mauricio Carneiro	21ac3b59d7	Merged bug fix from Stable into Unstable	2011-12-09 16:51:46 -05:00
Mauricio Carneiro	13905c00b3	Updating PacbioProcessingPipeline to new Queue standards	2011-12-09 16:51:02 -05:00
David Roazen	1ba03a5e72	Use optional() instead of required() to construct javaMemoryLimit argument in JavaCommandLineFunction	2011-12-05 14:06:00 -05:00
David Roazen	d014c7faf9	Queue now properly escapes all shell arguments in generated shell scripts This has implications for both Qscript authors and CommandLineFunction authors. Qscript authors: You no longer need to (and in fact must not) manually escape String values to avoid interpretation by the shell when setting up Walker parameters. Queue will safely escape all of your Strings for you so that they'll be interpreted literally. Eg., Old way: filterSNPs.filterExpression = List("\"QD<2.0\"", "\"MQ<40.0\"", "\"HaplotypeScore>13.0\"") New way: filterSNPs.filterExpression = List("QD<2.0", "MQ<40.0", "HaplotypeScore>13.0") CommandLineFunction authors: If you're writing a one-off CommandLineFunction in a Qscript and don't really care about quoting issues, just keep doing things the direct, simple way: def commandLine = "cat %s \| grep -v \"#\" > %s".format(files, out) If you're writing a CommandLineFunction that will become part of Queue and will be used by other QScripts, however, it's advisable to do things the newer, safer way, ie.: When you construct your commandLine, you should do so ONLY using the API methods required(), optional(), conditional(), and repeat(). These will manage quoting and whitespace separation for you, so you shouldn't insert quotes/extraneous whitespace in your Strings. By default you get both (quoting and whitespace separation), but you can disable either of these via parameters. Eg., override def commandLine = super.commandLine + required("eff") + conditional(verbose, "-v") + optional("-c", config) + required("-i", "vcf") + required("-o", "vcf") + required(genomeVersion) + required(inVcf) + required(">", escape=false) + // This will be shell-interpreted required(outVcf) I've ported the Picard/Samtools/SnpEff CommandLineFunction classes to the new system, so you'll get free shell escaping when you use those in Qscripts just like with walkers.	2011-12-01 18:13:44 -05:00
David Roazen	fdd90825a1	Queue now outputs a GATK-like header with version number, build timestamp, etc.	2011-11-23 14:28:35 -05:00
Khalid Shakir	c50274e02e	During flanking interval creation merging overlapping flanks so that on scatter the list doesn't accidentally genotype the same site twice. Moved flanking interval utilies to IntervalUtils with UnitTests.	2011-11-17 13:56:42 -05:00
Mark DePristo	0111e58d4e	Don't generate PDF unless you have -run specified	2011-11-09 14:45:40 -05:00
Mark DePristo	849c0757f2	Bug fix for LocusScatterFunction when no intervals are provided -- Now correctly grabs reference contigs and cuts them all up, rather than NPE as intervalString == null.	2011-11-04 10:55:09 -04:00
Mark DePristo	bd977c2d92	Bug fix to avoid infinite loop in GATKScatterFunction	2011-11-02 16:20:42 -04:00
Mark DePristo	c1da8cd5e7	Final version of bp-resolved locus scatter/gather -- Minor refactoring to allow LocusScatterFunction to have maxIntervals be the original scatter count, rather than capping this by the interval count as Contig and Interval do	2011-11-02 11:26:34 -04:00
Mark DePristo	c2b97030a4	IntervalUtils for completely balanced locus-based scatter/gather -- scatterLocusIntervals master utility -- Moved around some general functionality from GenomeLocSortedSet to GenomeLoc -- Util function for reversing a list (List<T> -> List<T>, unlike Collections version) -- DoC is PartitionType.INTERVAL -- Significant unit tests on new functionality (all passing) -- Ready for real-world testing, as soon as I can get LocusScatterFunction.scala to actually work	2011-11-02 10:49:40 -04:00
Mark DePristo	5fc613f972	Better default partition types for walkers -- Added PartitionType.READ, and associated ReadScatterFunction. ReadScatterFunction is literally just ContigScatterFunction until someone wants to implement something better -- LocusWalkers (and subclasses RodWalkers and RefWalkers) are by default PartitionType.LOCUS.	2011-11-01 19:47:10 -04:00
Mauricio Carneiro	dbd8c25787	No more R resources in the DPP updating the DPP to conform with Analyze Covariates changes.	2011-10-28 16:57:01 -04:00
Khalid Shakir	e25d40882a	Swapping Thread.sleep(0) with Object.wait(0) caused Queue to lock up. Thanks to rpoplin for pointing it out.	2011-10-28 15:51:03 -04:00
Khalid Shakir	b80d407dc7	No more hunting down R "resources". As a tradeoff Rscript cannot be specified on the commandline and will be found in the environment path. Other minor cleanup.	2011-10-27 14:17:07 -04:00
Eric Banks	b39fcb1bea	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-26 15:44:25 -04:00
Eric Banks	3273c20c98	Added integration tests for Tribble-based intervals and fixed up some of the other tests based on some method changes.	2011-10-26 15:29:18 -04:00
Khalid Shakir	fac9932938	Embedding gsalib source and queueJobReport R scripts in the dist and package jars. Moved gsalib and queueJobReport.R to embeddable namespaced locations. Updated packager dependencies/dir to add an @includes which filters the embedded fileset. RScriptExecutor can now JIT compiles the gsalib. RScriptExecutor uses ProcessController and sends the Rscript output to java's stdout when run under -l DEBUG. Refactored ProcessController and IOUtils from Queue to Sting Utils. Added more unit tests to ProcessController along with a utility class to hard stop OutputStreams at a specified byte count. Replaced uses of some IOUtils with Apache Commons IO. ShellJobRunner refactored to use direct ProcessController and now kills jobs on shutdown. Better QGraph responsiveness on shutdown by using Object.wait() instead of Thread.sleep().	2011-10-24 15:58:34 -04:00
Mauricio Carneiro	86305a5dcf	Adjusting the memory limits of the MDCP Indel caller needs more than 3G for large datasets.	2011-10-21 17:41:52 -04:00
Mauricio Carneiro	9f867d77ca	no sort order subtle bug fixed.	2011-10-20 18:44:09 -04:00
Mauricio Carneiro	c9d8b22092	Added BWASW support to the pipeline Data Processing Pipeline can now use BWASW for realigning the reads. Useful for Ion Torrent data.	2011-10-20 18:36:28 -04:00
Mauricio Carneiro	093cd95c5d	Merged bug fix from Stable into Unstable	2011-10-20 17:03:22 -04:00
Mauricio Carneiro	d7367c152a	Fixing 'revert' when not realigning RevertSam was reverting the alignment information and that was screwing up the pipeline if you didn't want to run it with BWA. Fixed.	2011-10-20 17:01:54 -04:00
Mauricio Carneiro	ed402588cc	Adding the "gold standard NA12878" target	2011-10-20 16:19:13 -04:00
Mauricio Carneiro	c27e2fb676	Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-10-18 15:23:05 -04:00
Menachem Fromer	e5fc828546	With Khalid's implicit approval, I have removed this line that overrides the memory limit of the VCF-gathering function, so that the inherited limit remains	2011-10-18 14:47:39 -04:00
Mauricio Carneiro	0939d16a8d	String not empty bug Apparently var X: String = _ is not the same as var X: String = "". :(	2011-10-13 13:22:05 -04:00
Mauricio Carneiro	66b5646f95	Adding hidden options to the DPP controlling the default platform parameter to Count Covariates and the number of scatter gather jobs to generate are now available under hidden parameters	2011-10-11 13:56:00 -04:00
Mark DePristo	73f9d1f217	GATK read group requirement iron hand -- The GATK will now throw a user exception if it opens a SAM/BAM file that doesn't have at least one RG defined -- LIBS again throws an error if the complete list of samples isn't provided -- Updating ExmpleCountLociPipeline test to use the well-formated versions of the exampleBAM and exampleFASTA files in testdata, instead of the old broken ones in validation_data. -- Convenience constructors for UserExceptions.MalformedBAM	2011-10-06 08:40:35 -07:00
Mark DePristo	a91509e7dd	Shouldn't be public	2011-10-05 15:22:57 -07:00
Khalid Shakir	84bd355690	Merged bug fix from Stable into Unstable	2011-09-27 14:34:39 -04:00
Khalid Shakir	b090751f62	Fixed Ant / PluginManager issue where reflections was picking up all class files under current working directory due to "." in jar manifest classpaths. Updates to HybridSelectionPipeline: - Added annotations back via snpEff - Minor updates to VQSR paths and lowered memory	2011-09-27 14:33:57 -04:00
Khalid Shakir	77ba59e30a	Merge branch 'master' of ssh://gsa3.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-27 00:51:45 -04:00
Khalid Shakir	648b959361	Minor change to log an info message when a signal such as Ctrl-C is caught.	2011-09-27 00:50:19 -04:00
Mauricio Carneiro	d3cc25454c	Updating the MDCP	2011-09-22 11:27:40 -04:00
Mauricio Carneiro	623c49765d	NO BAQ ON EXOMES! says the boss.	2011-09-22 11:13:40 -04:00
Ryan Poplin	5d0f284305	Fixing exome specific arguments to the VQSR in the methods development calling pipeline	2011-09-21 20:26:28 -04:00
Mauricio Carneiro	758ecf2d43	Bringing latest updates of ReduceReads to the master repository	2011-09-20 16:35:09 -04:00
Mauricio Carneiro	08ffb18b96	Renaming datasets in the MDCP Making dataset names and files generated by the MDCP more uniform.	2011-09-20 11:02:51 -04:00
Eric Banks	ba150570f3	Updating to use new rod system syntax plus name change for CountRODs	2011-09-19 13:30:32 -04:00
Eric Banks	095f75ff7d	Merge branch 'master' of ssh://gsa1.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable	2011-09-19 12:24:12 -04:00
Eric Banks	85626e7a5d	We no longer want people to use the August 2010 Dindel calls for indel realignment but instead Guillermo's new whole genome bi-allelic indel calls; updating the bundle accordingly. Also, there was some confusion by the 1000G data processing folks as to exactly what these indel files are, so I've renamed them so that it's clear. Wiki updated too.	2011-09-19 12:24:05 -04:00
Mark DePristo	6ea57bf036	Merge branch 'master' into sgintervals	2011-09-19 09:50:19 -04:00
Khalid Shakir	33967a4e0c	Fixed issue reported by chartl where cloned functions lost tags on @Inputs. Updated ExampleUnifiedGenotyper.scala with new syntax.	2011-09-16 12:46:07 -04:00
Ryan Poplin	981b78ea50	Changing the VQSR command line syntax back to the parsed tags approach. This cleans up the code and makes sure we won't be parsing the same rod file multiple times. I've tried to update the appropriate qscripts.	2011-09-12 12:17:43 -04:00
Mauricio Carneiro	7f9000382e	Making indel calls default in the MDCP You can turn off indel calling by using -noIndels.	2011-09-09 14:09:26 -04:00
Mark DePristo	06cb20f2a5	Intermediate commit cleaning up scatter intervals -- Adding unit tests to ensure uniformity of intervals	2011-09-09 12:56:45 -04:00
Khalid Shakir	510d5e7730	Merged bug fix from Stable into Unstable	2011-09-09 01:34:55 -04:00
Khalid Shakir	367bbee25a	Fixed typo when printing the contents or last N lines of a file. Thanks to larryns.	2011-09-09 01:33:25 -04:00
Mauricio Carneiro	ee9d599558	Just cleaning up clean up old commented code from tha data processing pipeline.	2011-09-07 13:32:40 -04:00
Mauricio Carneiro	28d782b4c7	Allowing multiple dnsnp and indel files in the DPP	2011-09-02 13:38:47 -04:00
Mauricio Carneiro	ad4ea0b80b	Merged bug fix from Stable into Unstable	2011-09-01 18:14:45 -04:00
Mauricio Carneiro	e253f6f05d	Fixing typo in DPP platform and library were exchanged when rebuilding the read group information	2011-09-01 18:13:52 -04:00
Mauricio Carneiro	d2a33beff7	Added WGS/WEX b37-decoy CEU trio datasets	2011-09-01 13:14:40 -04:00
Mark DePristo	61633c95a8	Default jobreport is now jobPrefix, so you see logs like Q-2508.jobreport.txt	2011-08-28 19:19:45 -04:00
Mark DePristo	b38de1fa35	Now captures the exechost in the job report -- Works for in process, shell, and LSF runners -- Cleanup of debugging output	2011-08-28 12:05:56 -04:00
Mark DePristo	e37a638e09	Fix for disallowed characters in GATKReportTable -- Illegal characters are automatically replaced with _	2011-08-26 13:24:06 -04:00
Mark DePristo	0cb1605df0	Clean documentation for JobRunInfo	2011-08-26 09:22:58 -04:00
Mark DePristo	415d5d5301	LSF long times are in seconds, convert to milliseconds to meet standard	2011-08-26 09:18:28 -04:00
Mark DePristo	eef1ac415a	Merge branch 'master' into rodTesting Conflicts: public/java/src/org/broadinstitute/sting/gatk/walkers/variantutils/VariantsToTable.java	2011-08-26 00:35:41 -04:00
Mark DePristo	e03dfdb0ab	Automatic iteration field addition works properly.	2011-08-25 16:59:02 -04:00
Mark DePristo	e01273ca7c	Queue now writes out queueJobReport.pdf -- General purpose RScript executor in java (please use when invoking RScripts) -- Removed groupName. This is now analysisName -- Explicitly added capability to enable/disable individual QFunction	2011-08-25 16:57:11 -04:00
Mark DePristo	0f4be2c4a4	Argument to disable queueJobReport entirely -- Minor improvements to RodPerformanceGoals	2011-08-25 13:32:03 -04:00
Mark DePristo	d65faf509c	Default output name for Queue JobReport is queue_jobreport.gatkreport.txt	2011-08-25 13:15:20 -04:00
Mark DePristo	a7d6946b22	Refactored QJobReport and QFunction, which is now automatically tracked -- All QFunctions, including sg ones, are tracked -- Removed memory information	2011-08-25 13:13:55 -04:00
Mauricio Carneiro	16caca0822	BLASR BAMs and new BWA parameters Added the functions to turn a BLASR generated BAM file into a usable BAM file. Modified the bwa parameters according to test results from NA12878 pb2k dataset.	2011-08-24 17:04:07 -04:00
Mauricio Carneiro	e3f5d7067a	Added ReorderSam queue binding	2011-08-24 17:03:11 -04:00
Mark DePristo	08fb21f127	Removing hostname	2011-08-24 16:45:50 -04:00

1 2 3 4 5

249 Commits (bfbf1686cd0f71c94dea59c84b6c74c71f0ae1af)