Commit Graph

213 Commits (05fac8583d9a1fc00a0c7c31e58fd1d19e91eac7)

Author SHA1 Message Date
carneiro cf15819db5 updated to work with the new VariantEval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5176 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 17:46:07 +00:00
rpoplin 47357b726e Fixing import GenotypeCalculationModel since it doesn't exist anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5175 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 15:39:43 +00:00
fromer 7605f0e6c1 Corrected input/output definitions for Queue
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5173 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 07:39:00 +00:00
fromer 3839fd1a25 Updated phasing pipeline to properly read samples from VCF and BAM files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5172 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 07:16:05 +00:00
fromer 798955b006 After discussing with Mark, revert to "Master merging" of phase information from VCFs. This has the advantage of creating minimal phased VCFs from RBP, from which phase info is merged into the original "master VCF". Also, updated Genotype.sameGenotype() to be simpler and NOT REVERSE the ignorePhase flag in comparing Allele lists/sets
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5167 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 19:50:15 +00:00
fromer a89400b20c Simple implementation to retrieve relevant BAM files for each sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5152 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 00:03:03 +00:00
fromer f258363cfc Minor bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5150 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 22:29:28 +00:00
fromer 742bd44728 Changed output file to be user-defined
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5149 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 22:15:26 +00:00
fromer 6c99dc4dab Take (partial) ownership of phasing 1000G chr20 calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5147 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 21:49:41 +00:00
kshakir 23578b7402 Pipeline tests will only start from scratch after "ant clean", making it faster to debug downstream issues when re-running "ant pipelinetest -Dpipeline.run=run".
Updated the FCP, the test, and the ADPR to handle an issue with the ADPR locating the yaml generated by the FCPTest.
Does not solve the ADPR error: Error in dimnames(x) <- dn : length of 'dimnames' [1] not equal to array extent


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5126 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-29 19:44:03 +00:00
kshakir 2ef66af903 Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
corin b25d131481 updated to work with the new tearsheet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5113 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 18:49:11 +00:00
carneiro cae4b9b0de quick update with the correct CEU trio bam file and it's final location.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5098 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 19:17:19 +00:00
ebanks 68729045ca Always best to use the left-aligned version of the dbsnp vcf
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5091 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 20:21:50 +00:00
delangel fa0c476b82 Script for calling indels in all phase 1 samples - VQSR part still needs work but raw calling is done
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5052 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-22 14:07:10 +00:00
carneiro a0731eaa81 updated NA12878 Trio gold standard data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5048 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 18:48:31 +00:00
depristo 94b64ec54a Moving scala script into analysis directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5047 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 18:42:18 +00:00
depristo b45566760e intermediate checkin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5045 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 18:39:25 +00:00
rpoplin b6497c404f Moving Phase1Calling qscript over to using the cleaned, pre-BAQed bams
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5039 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 02:41:20 +00:00
carneiro fc73569d62 Added NA12878 Trio dataset to the pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5037 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 23:15:33 +00:00
kshakir 8855f080c2 For the fullCallingPipeline.q:
- Reading the refseq table from the YAML if not specified on the command line.
 - Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
 - Added a -mountDir /broad/software option to work around adpr automount issues.
 - Merged the LSF preexec used for automount into the shell script used to execute tasks.
 - Using the LSF C Library to determine when jobs are complete instead of postexec.
 - Updated queue.sh to match the changes above.
 - Updated the FCPTest to match the changes above.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
depristo 41c8552d0a Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
kshakir 4d611e53e7 Passing the ADPR R script to FCPTest.
Changed the FCP.q to use an InProcessFunction work around the -runDir issue GSA-420.
Tested the FCPTest using the following dotkits and "ant clean pipelinetest -Dpipeline.run=run":
  - R-2.11
  - Oracle-full-client
  - .cx-oracle-5.0.2-python-2.6.5-oracle-full-client-11.1


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5029 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 06:08:45 +00:00
corin 50fcebb0c4 Incorporates tearsheet and plot production with database access into standard pipeline. Note that the following dotkit packages must be run before the adpr will be correctly generated:
R-2.10, 
Oracle-full-client, 
cx-oracle-5.0.2-python-2.6.5-oracle-full-client-11.1

This also removes the unused titv argument


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5024 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 20:48:42 +00:00
rpoplin 55eb0387ac Another relevant qscript. I use this one to do thousands of variant recalibration jobs to search for optimal parameters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5019 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 18:17:32 +00:00
chartl a463dbcda1 Refactoring the qscript directory; oneoffs, playground, and core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5017 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 15:23:40 +00:00
rpoplin 7db9601c9d Checking in the 1000G phase1 cleaning and calling scripts for posterity's sake, but also to show everyone what the current best practices for VQSR training looks like.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5015 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 14:32:52 +00:00
rpoplin 457c59e737 Use the sites-only HapMap files in the Methods development pipeline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5013 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 20:50:09 +00:00
carneiro 35a4f1e366 .Added VariantEval as an optional step in the pipeline.
.Lifted to HapMap 3.3
.Lifted to dbSNP 132 where possible.
.Added the CEU-Trio WEx(hg19) dataset 
.Added some options to the pipeline

You can now use : 

-dataset WEX
-dataset HiSeq
...

to choose which datasets to run through the pipeline.

You can now without BAQ and indel mask:

-noBAQ 
-noMASK

Choose not to run the gold standard comparison analysis:

-skipGoldStandard

Activate the VariantEval walker analysis on the Recalibrated vcf:

-eval

The default behavior is to run exactly like it used to, so this version shouldn't change the way you used to use the pipeline.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5004 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 21:55:02 +00:00
carneiro c4f9b262e5 removing the tech dev pipeline script from the repository to keep the methods development pipeline as the reference script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4992 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 18:15:55 +00:00
carneiro 9e93091e9a -baqGOP now takes phred scaled scores instead of probabilities in the command line.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4982 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 00:06:38 +00:00
kshakir 8ba3a5a43f Command lines for locally run Queue jobs no longer have to be escaped differently than bsub'ed jobs.
GSA-410 Local job runs now can run command lines longer than than 4096 on our linux machines.
When determining if the help text and Queue extensions need to be rebuilt, use the .class files not the .java so that GATK oneoffs are picked up correctly.
Added the most basic of all example QScripts for debugging, Hello World.
Minor updates to copy/pasted LSF code to reduce ant javadoc warnings by a third.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4970 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 21:07:29 +00:00
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
chartl 3e7802a3e0 Minor changes to a qscript and the GQ constants on PrivatePermutations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4956 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:26:21 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
rpoplin 20f29e4690 In the Methods development pipeline the call confidence threshold must be lowered from the default value for lowpass calling. What a bone-headed mistake!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4941 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:30:55 +00:00
corin 6d809321d3 Updating combien variants memory limit and dcov default for the full calling pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4907 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-24 03:06:50 +00:00
depristo 5265f943b0 phasing per sample. tmp checkin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4898 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 20:14:06 +00:00
corin e7569cfe6f Updated dbsnp version usage. Calling with 132, but still using 129 for eval to maintain consistant known/novel eval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4895 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 17:37:27 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
rpoplin 7185fcb47b Committing my notes about the methods development pipeline so we stay synced up while I'm on vacation. Cheers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4891 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 21:14:20 +00:00
chartl 80770dc032 Expanded target pipeline complete. Stop trying to be clever about scatter-gather; wait until functional SG is built-in to Q. Til then, a lazy version of the fullCallingPipeline. Seems to take a long time to generate the graph though...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4888 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 00:56:16 +00:00
kshakir 758d14a261 Checking in scripts used for testing the linear index MAX_FEATURES_PER_BIN.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4887 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:25:36 +00:00
chartl fc33901810 Graph structure must be known at compile time. Removing GroupIntervals until a future point where in-process-functions can predict their output based on inputs [though this is probably forever: the inputs may not exist at compile time!]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4886 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:22:58 +00:00
chartl 61d5daa65c EXTREME interval processing. Still undergoing testing.
+ GroupIntervals allows user-defined scattering (e.g. take an interval list file, split it into k smaller interval list files by number of lines)
 + ExpandIntervals expands the intervals, either by widening them, or allowing the definition for nearby intervals (e.g. flanks starting 1bp before and after, ending 10bp after that)
 + IntersectIntervals takes n interval lists, writes 1 interval list that is the n-way intersection of all of them



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4885 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 19:42:50 +00:00
rpoplin 4ca1da1d07 Updating the NA12878.HiSeq bam file to be the correct bam file in the methods development qscript.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4879 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 14:53:10 +00:00
rpoplin 8fac346ac1 Misc cleanup in Methods Development Qscript
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4878 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 04:24:25 +00:00
rpoplin 34ab5b4889 Turning on BAQ in Methods Development pipeline. A new dataset is added: 363 EUR samples from the November 1000G release.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4877 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-19 21:13:25 +00:00
chartl 8118a439c0 Commit for Khalid
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4876 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:24:18 +00:00
rpoplin 15a33545f4 Updating Methods development pipeline qscript with the bam lists for all the data sets. It is ready for people to start running with it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4875 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:19:14 +00:00
corin f0ab7b849a Adding a window size variable to avoid indel genotyper error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4873 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 04:19:54 +00:00
rpoplin bdef4e775a Initial checkin of methods development pipeline qscript. It allows the methods dev team to run an overnight job which calls and recalibrates a variety of data sets and allows for an end-to-end sanity check of final results for potential changes to the methods. It isn't meant to be used by anybody quite yet, but shows the general structure and flow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4871 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:14:02 +00:00
rpoplin 095fc1922a By popular demand I'm adding the qscript I used to do the 660 bamfile 1000G calling for ASHG. It does cleaning, BAQing, and merging in 3mb chunks genome-wide then calls SNPs on those temporary bams.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4866 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 18:49:03 +00:00
depristo 32d5397c01 Experimental support for sided annotations. Currently not more/less valuable than two-tailed testing. Future experiments are needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4864 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:08:31 +00:00
chartl 0d18bd1011 Now that addAll() is in the superclass, no longer need this definition (which, without override, prevents the script from compiling anyway)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4862 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 05:36:31 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
chartl 2217837845 Commit for Khalid -- should be a scala version of vcf2table but for some reason the run method isn't getting called.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4841 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 00:44:15 +00:00
chartl f36861eeee One more little bfix -- the issue was not the grep command, but instead the NFS in the awk; i changed it to ++count in the last commit which was really responsible for the fix. Then this ultra-escaping semi-broke teh grep again.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4831 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 20:36:14 +00:00
chartl d34c5640d2 Bugfix for clf version of extract samples. Due to dynamic shell creation and bsubs and whatnot, the OR pipe for grep ("a|b") needs to be super-escaped ("a\\\\\\\\|b").
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4829 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:06:30 +00:00
chartl f795b25c47 In-process versions of sample extraction and interval-list conversion for VCF files. Required an in-process-function branch of the queue library.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4827 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:36:53 +00:00
depristo e219f6a4b5 Q script to run VQSR on a whole variety of common data sets. To be used as a basis for general methods development pipeline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4826 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 16:55:52 +00:00
chartl 7bc2049031 Updates and bug fixes to private mutations qscript and pipeline libraries. Hand filter strings are now not busted (boo to having to escape quotes); convenience method added to VariantCalling to propagate standard trait data to a given GATK command line -- should be made more scala-esque in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4824 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 04:55:13 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00
chartl 81290d238d Restructuring my qscripts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4821 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 20:58:45 +00:00
kshakir 56433ebf6b Switched from LSF command line wrappers to JNA wrappers around the C API. Side effects:
- bsub command line is no longer fully printed out.
- extraBsubArgs hack is now a callback function updateJobRun.
Updated FullCallingPipelineTest to reflect latest changes to fullCallingPipeline.q.
Added a pipeline that tests the UGv2 runtimes at different bam counts and memory limits.
Updated VE packages that live in oneoffs to compile to oneoffs.
Added a hack to replace the deprecated symbol environ in Mac OS X 10.5+ which is needed by LSF7 on Mac.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4816 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 04:36:06 +00:00
corin 27acede64d Removing old arguments. We'll now be running with the defaults.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4811 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 18:58:56 +00:00
chartl f8dd59c1d1 Tightening of the batch merging pipeline. Optimized to run on hour queue, so please: if you run this, crush 'hour' with it. Testing is forthcoming, but it merged 700 samples overnight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4805 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 14:36:23 +00:00
chartl f4c43f013f Due to the overhead for reading VCF files (>32g for 700 5MB VCF files), batched merging has to generate likelihoods in batches.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4796 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 18:23:54 +00:00
chartl 0944184832 Major refactoring of library and full calling pipeline (v2) structure.
Arguments to the full calling qscript (and indeed, any qscript that wants them) are now specified via the PipelineArgumentCollection

Libraries require a Pipeline object for instantiation -- eliminating their previous dependence on yaml files

Functions added to PipelineUtils to build out the proper Pipeline object from the PipelineArgumentCollection, which now contains 
additional arguments to specify pipeline properties (name, ref, bams, dbsnp, interval list); which are mutually exclusive with
the yaml file.

Pipeline length reduced to a mere 62 lines.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4790 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:33:54 +00:00
corin bdc7516168 Taking out recalibrating for now, since having these files is confusing people and we've not gone to dbsnp 132 yet so cluster generation's broken with these command lines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4786 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 22:12:09 +00:00
chartl 220fb0c44a Added a pipeline for merging batches. For now takes a file containing a list of VCFs, and a file containing a list of bams. Does not do anything smart (e.g. if you leave out some .bams or add some extra ones, you will not be warned). Heavy lifting done in (the beginnings of) a library for managing multi-batch or multi-project tasks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4771 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 07:31:59 +00:00
chartl 9f03f09cc9 Changes to V2 pipeline and libraries. AB dropped. Cleaning enabled. Project name now properly propagated to intermediate files (instead of the string repr of the object). Indel mask is now expanded prior to filtering at indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4769 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:55:48 +00:00
chartl 06a0fb4489 Library-ized pipeline now functions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4759 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 21:34:59 +00:00
ebanks 4413208c45 Removing unnecessary and incorrect includes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4752 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 02:06:48 +00:00
corin 6b70cde0b9 Adding a forgotten quote mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4729 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 16:38:27 +00:00
corin e15d18129c Adding by sample metrics. Not sure why we didn't have this in here in the first place
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4723 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 21:36:03 +00:00
corin fe28f8da9c Removing Uniquify from main pipeline indel merge, since the pipeline isn't merging from samples with the same name anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4721 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 17:25:22 +00:00
kiran 28805d17ca Commenting out allele-balance for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4715 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 16:48:08 +00:00
corin 8dca5bd861 Putting the annotation back in, both to the filters and to UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4709 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 21:02:15 +00:00
corin da1fe5bb37 Removing the AB filter given that we don't have that in the VCF anymore
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4708 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:05 +00:00
hanna 302cc13735 Trying out Queue for the first time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4705 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 18:29:12 +00:00
corin 5466365575 Fixing a silly typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4680 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:16:51 +00:00
corin a64f693b20 Updated pipeline script to include dbSnp for UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4679 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:09:47 +00:00
kshakir 801c562909 Now actually checking in the integration test mentioned in the prior commit: compiles the full calling pipeline.
Removed QScript usages of VariantRecalibrator's -reportDatFile, --report_dat_file


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4668 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-14 04:27:10 +00:00
kshakir 673fa841a4 Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader.
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:14:28 +00:00
chartl c19f567424 Sometimes, inputs are really outputs in disguise.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4631 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-05 19:51:16 +00:00
chartl 0e40321a52 Brütall hack: make the bam list creator job wait for the interval creator job, so that there is an implicit dependency of UG on the interval list, by way of the bam list
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4628 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 20:43:11 +00:00
chartl cb0b2f9811 My analysis script for private mutations. I'm committing it because it contains a number of specialized command line functions that could prove useful in the future. (For example: ConcatVCF and ExtractSample)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4626 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 19:57:27 +00:00
chartl 42e9987e69 Bug fix to GenotypeConcordance. AC metrics get instantiated based on number of eval samples; if Comp has more samples, we can see AC indeces outside the bounds of the array.
Bug fix to LiftoverVariants - no barfing at reference sites.

AlleleFrequencyComparison - local changes added to make sure parsing works properly

Added HammingDistance annotation. Mostly useless. But only mostly.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4622 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-03 19:23:03 +00:00
kshakir 5cdd7a7ba4 There's no such thing as a sam index, so the GATK extension generator doesn't need to add an @Input for them.
Updated a call to swapExt to specify the directory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4586 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 20:39:03 +00:00
corin 6d7ed5781c Added Dbsnp to Indel Realigner; added known indels rod-binding to realigner.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4576 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 22:22:28 +00:00
kshakir 8211cee0b2 Queue UI Improvements:
- Forcing user to set the temp directory via -Djava.io.tmpdir to avoid filling up /tmp.
- By default deleting job outputs tagged as intermediate.
- Defaulting pipeline to scatter count 1 (no reads deleted).
- Cleaning up temp classes even when scripting fails.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4573 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 19:49:08 +00:00
kshakir 80259b9e20 Changed fullCallingPipeline to output all contigs in the refence if scattering.
When the cleaner interval scatter count is set to one explicitly setting the intrevals to Nil.
TODO: Need to add an option that lets the user choose from the command line to scatter all contigs or just those in the intervals list.  For now can get relatively the same behavior by setting the interval scatter count equal to the number of contigs+1, assuming the random contigs come at the end of the sequence dictionary.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4565 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-24 03:01:06 +00:00
kshakir e9c6f681a4 Instead of the pipeline's cleaner only writing BAMs with the target intervals, now pulling the list of contigs from the target intervals and outputing reads in those contigs.
Added a brute force -retry <count> option to Queue for transient errors.
Waiting up to 2 minutes for the LSF logs to appear before trying to display the errors from the logs.
Updates to the local job runner error logging when a job fails.
Refactored QGraph's settings as duplicate code was getting out of control.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4563 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 22:22:30 +00:00
kshakir b954a5a4d5 - After removing special code for intervals, instead of being of type File they are generated as List[File]. Changed previous checkin that was appending to this list and instead assigning a singleton list.
- More cleanup including removing the temporary classes and intermediate error files.  Quieting any errors using Apache Commons IO 2.0.
- Counting the contigs during the QScript generation instead of the end user having to pass a separate contig interval list.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4539 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 06:37:28 +00:00
kshakir 88a0d77433 Changed parsing engine to store the order the argument bindings based on their definition in the class, moving "-T" to the front of Queue command lines.
Queue GATK generated .intervals is now a List(File) again removing special case handling in the generator.
Instead of using @Scatter annotation, using ScatterFunction instance to determine if a job can be scattered.
Implemented special VcfGatherFunction which only uses the header from the first file, even if the other files differ in their headers.
Added a -deleteIntermediates to Queue to delete the outputs from intermediate commands after a successful run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4536 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 21:43:52 +00:00
kshakir 81479229e1 QScript authors can now tag functions as intermediate. Functions tagged as intermediate will be skipped unless another function in the graph needs their output.
Re-logging the failed jobs and the path to their log files at the end of a run.
Added a parameter -bigMemQueue for the fullCallingPipeline.q instead of hardcoding gsa (gsa was backed up and it was actually faster to run on week).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4520 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 22:11:14 +00:00
chartl 2bc5971ca1 Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it.
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. 

**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.

I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 03:18:01 +00:00
kshakir 7157cb9090 While bkill'ing on the shutdown thread Queue will no longer try to submit more jobs on the original thread.
Updated pipeline output structure to current recommendations by Corin.
Directories are now automatically before the function runs.
Fixed several bugs with scatter gather binding when the script author needs to change the directories.
Fixed bug with tracking of log files for CloneFunctions.
More error handling and logging of exceptions (good test environment while LSF was down this early AM!)
Removed cleanup utility for scatter gather.  SG Output structure has changed significantly.  Will need to discuss and find a better approach for Queue programatically deleting files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4504 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 17:01:36 +00:00
corin 5e0c4ecc21 Added DbSnp to VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4497 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 17:02:17 +00:00