Commit Graph

135 Commits (8dca5bd8613a3740d7167c2f92bb5ccc089fd5fc)

Author SHA1 Message Date
corin 8dca5bd861 Putting the annotation back in, both to the filters and to UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4709 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 21:02:15 +00:00
corin da1fe5bb37 Removing the AB filter given that we don't have that in the VCF anymore
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4708 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:05 +00:00
kshakir 79725f2d9c Excluding the QFunction log files from the set of files to delete on completion.
When a QGraph is empty displaying a warning instead of crashing with an JGraph internal assertion error.
Cleaned up code using the Log4J root logger and explicitly talking to a logger for Sting.
When integration tests are run detecting that the logger has already been setup so that messages aren't logged twice.
Updated from Ivy 2.2.0-rc1 to 2.2.0.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4707 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:01 +00:00
hanna 302cc13735 Trying out Queue for the first time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4705 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 18:29:12 +00:00
corin 5466365575 Fixing a silly typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4680 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:16:51 +00:00
corin a64f693b20 Updated pipeline script to include dbSnp for UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4679 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:09:47 +00:00
kshakir 302e8f0239 Fixed bug where the command directory was not being set to an absolute path, leading LSF to write some .done files to /tmp.
No longer using the command directory for temporary .done files, and instead using the user specified temporary directory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4678 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 17:59:39 +00:00
kshakir 801c562909 Now actually checking in the integration test mentioned in the prior commit: compiles the full calling pipeline.
Removed QScript usages of VariantRecalibrator's -reportDatFile, --report_dat_file


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4668 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-14 04:27:10 +00:00
kshakir 673fa841a4 Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader.
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:14:28 +00:00
kshakir f35d1aa43f Moving all file cleanup to IOUtils for easier debugging.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4646 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 21:00:58 +00:00
hanna 8e36a07bea Convert GenomeLocParser into an instance variable. This change is required
for anything that needs to be simultaneously aware of multiple references, eg
Queue's interval sharding code, liftover support, distributed GATK etc.  

GenomeLocParser instances must now be used to create/parse GenomeLocs.
GenomeLocParser instances are available in walkers by calling either

-getToolkit().getGenomeLocParser()
or
-refContext.getGenomeLocParser()

This is an intermediate change; GenomeLocParser will eventually be merged
with the reference, but we're not clear exactly how to do that yet.  This
will become clearer when contig aliasing is implemented.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4642 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-10 17:59:50 +00:00
chartl c19f567424 Sometimes, inputs are really outputs in disguise.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4631 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-05 19:51:16 +00:00
chartl 0e40321a52 Brütall hack: make the bam list creator job wait for the interval creator job, so that there is an implicit dependency of UG on the interval list, by way of the bam list
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4628 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 20:43:11 +00:00
chartl cb0b2f9811 My analysis script for private mutations. I'm committing it because it contains a number of specialized command line functions that could prove useful in the future. (For example: ConcatVCF and ExtractSample)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4626 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-04 19:57:27 +00:00
chartl 42e9987e69 Bug fix to GenotypeConcordance. AC metrics get instantiated based on number of eval samples; if Comp has more samples, we can see AC indeces outside the bounds of the array.
Bug fix to LiftoverVariants - no barfing at reference sites.

AlleleFrequencyComparison - local changes added to make sure parsing works properly

Added HammingDistance annotation. Mostly useless. But only mostly.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4622 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-03 19:23:03 +00:00
hanna 861ee3e37a Changing testing framework from junit -> testng, for its enhanced configurability.
Initial test to see how Bamboo will respond.  More detailed email to follow.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4609 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-01 21:31:44 +00:00
kshakir d768c6558d Now that the user is required to set the java temp directory, it is safer for the LsfJobRunner to write to the java temp directory instead of the command directory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4593 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-28 15:00:21 +00:00
kshakir 5cdd7a7ba4 There's no such thing as a sam index, so the GATK extension generator doesn't need to add an @Input for them.
Updated a call to swapExt to specify the directory.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4586 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 20:39:03 +00:00
hanna 4c23b1fe9c Get rid of the static cache of ArgumentTypeDescriptors by making them an integral part of the
parsing engine.  Hugely lowers our memory footprint in integrationtests, but not yet enough to 
run Mark's new parallelized VariantEvalIntegrationTests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4585 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 19:44:55 +00:00
corin 6d7ed5781c Added Dbsnp to Indel Realigner; added known indels rod-binding to realigner.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4576 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 22:22:28 +00:00
kshakir 8211cee0b2 Queue UI Improvements:
- Forcing user to set the temp directory via -Djava.io.tmpdir to avoid filling up /tmp.
- By default deleting job outputs tagged as intermediate.
- Defaulting pipeline to scatter count 1 (no reads deleted).
- Cleaning up temp classes even when scripting fails.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4573 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-26 19:49:08 +00:00
kshakir 80259b9e20 Changed fullCallingPipeline to output all contigs in the refence if scattering.
When the cleaner interval scatter count is set to one explicitly setting the intrevals to Nil.
TODO: Need to add an option that lets the user choose from the command line to scatter all contigs or just those in the intervals list.  For now can get relatively the same behavior by setting the interval scatter count equal to the number of contigs+1, assuming the random contigs come at the end of the sequence dictionary.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4565 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-24 03:01:06 +00:00
kshakir e9c6f681a4 Instead of the pipeline's cleaner only writing BAMs with the target intervals, now pulling the list of contigs from the target intervals and outputing reads in those contigs.
Added a brute force -retry <count> option to Queue for transient errors.
Waiting up to 2 minutes for the LSF logs to appear before trying to display the errors from the logs.
Updates to the local job runner error logging when a job fails.
Refactored QGraph's settings as duplicate code was getting out of control.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4563 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-22 22:22:30 +00:00
kshakir b954a5a4d5 - After removing special code for intervals, instead of being of type File they are generated as List[File]. Changed previous checkin that was appending to this list and instead assigning a singleton list.
- More cleanup including removing the temporary classes and intermediate error files.  Quieting any errors using Apache Commons IO 2.0.
- Counting the contigs during the QScript generation instead of the end user having to pass a separate contig interval list.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4539 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-21 06:37:28 +00:00
kshakir 88a0d77433 Changed parsing engine to store the order the argument bindings based on their definition in the class, moving "-T" to the front of Queue command lines.
Queue GATK generated .intervals is now a List(File) again removing special case handling in the generator.
Instead of using @Scatter annotation, using ScatterFunction instance to determine if a job can be scattered.
Implemented special VcfGatherFunction which only uses the header from the first file, even if the other files differ in their headers.
Added a -deleteIntermediates to Queue to delete the outputs from intermediate commands after a successful run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4536 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-20 21:43:52 +00:00
kshakir 81479229e1 QScript authors can now tag functions as intermediate. Functions tagged as intermediate will be skipped unless another function in the graph needs their output.
Re-logging the failed jobs and the path to their log files at the end of a run.
Added a parameter -bigMemQueue for the fullCallingPipeline.q instead of hardcoding gsa (gsa was backed up and it was actually faster to run on week).


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4520 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-18 22:11:14 +00:00
chartl 2bc5971ca1 Added - a tool to fix reference bases of a VCF. The OMNI had a couple of sites with incorrect reference bases (look to be legacy from other chips), and a few more that had ref and alt flipped. GAP should probably take care of it, but since I need results by monday, I'm doing it.
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC. 

**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.

I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-17 03:18:01 +00:00
kshakir 9dc2e931b6 Saving the order functions are added to in the QScript. Using the order during submission of ready jobs (but not currently dryrun) and during -status.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4508 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 20:00:35 +00:00
kshakir 7157cb9090 While bkill'ing on the shutdown thread Queue will no longer try to submit more jobs on the original thread.
Updated pipeline output structure to current recommendations by Corin.
Directories are now automatically before the function runs.
Fixed several bugs with scatter gather binding when the script author needs to change the directories.
Fixed bug with tracking of log files for CloneFunctions.
More error handling and logging of exceptions (good test environment while LSF was down this early AM!)
Removed cleanup utility for scatter gather.  SG Output structure has changed significantly.  Will need to discuss and find a better approach for Queue programatically deleting files.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4504 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-15 17:01:36 +00:00
corin 5e0c4ecc21 Added DbSnp to VariantEval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4497 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 17:02:17 +00:00
kshakir 63e3848187 Added status email support with -statusTo. Will send emails on failure of an individual function or success/failure of the whole pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4496 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-14 15:58:52 +00:00
kshakir 5034ca18dc ...and forgot to sync up the changes to CommandLineFunction with CloneFunction.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4492 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 22:40:02 +00:00
kshakir 5ee12875fb Emergency fix for Ryan:
- Catching errors when LSF fails and retrying.
- When LSF retries fail, catching the error, marking the job as failed, and no longer bkilling everything by exiting Queue.
- Caching function fields by class instead of each instance of a function saving a list of its fields.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4490 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-13 22:22:01 +00:00
chartl 6368a46bab Scala protected is more akin to Java private than Java protected. Not typing these defs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4470 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 19:36:23 +00:00
chartl bffb8bb01f The SVN repository is not for dumb analysis-specific scripts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4460 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:04:53 +00:00
chartl 21ec44339d Somewhat major update. Changes:
- ProduceBeagleInputWalker
 + Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present
 + Takes a bootstrap argument -- can use some given %age of the validation sites
 + Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap
-BeagleOutputToVCFWalker
 + Now filters sites where the genotypes have been reverted to hom ref
 + Now calls in to the new VCUtils to calculate AC/AN

-Queue
 + New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype
 + full calling pipeline v2 uses the above libraries
 + minor changes to some of my own scripts
 + no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 13:30:28 +00:00
kshakir e02f837659 Added the ability for Queue functions like mkdirs to override if they are done or not.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4458 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 06:39:55 +00:00
kshakir 7f25019f37 Inprocess functions by default now log what output files they are running for.
On -run cleaning up .done and .fail files for jobs that will be run.
Added detection to Firehose YAML generator shell script for (g)awk versions that ignore "\n" in patterns.
Removed obsolete mergeText and splitIntervals shell scripts.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4452 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 19:08:02 +00:00
kshakir db47230dd9 Wrapping ScatterGatherableFunctions with a facade instead of using slower clone library. Will require keeping Clone's facade code in sync with CommandLineFunction but runs *much* faster.
Shell invoking scripts so that even really long shell scripts make it through LSF.
Using the truncated (up to 1000 characters) of the command line for the job name for use with bjobs.
Switched the default from re-running everything to re-running only files that need to be regenerated.  --skip_up_to_date replaced with --start_clean for those who want to regenerate everything.
Updated logging to let users know when the scatter gather generator is running, which still takes a while but is orders of magnatudes faster for large lists of functions.  (40s for a 100 function graph exploding to a 2500 function graph)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4448 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 01:19:18 +00:00
kshakir ca5db821ce Added the ability to Queue to run scala functions inside the JVM. NOTE: Extend from InProcessFunction instead of CommandLineFunction to use this functionality.
Queue now submits new LSF jobs only after previous functions have completed successfully.
When the Queue process is shutdown (ex: via Control-C) sends a bkill command for any running jobs.
Ported commands like creating directories and scatter/gather interval list to scala functions.
Updates to LSF status tracking by porting the python to internally generated bash scripts.
Temporarily disabled job name submission to LSF.  Plus side is that the full command is now available in "bjobs -w".  TODO: Put back jobName passing to LSF based on an option?
Changed BaseTest to allow scala to access paths to references.
Changed the extension generator to default the analysis name to the walker "name".

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4442 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 18:29:56 +00:00
chartl 28ac1d325e Commit for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4433 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 19:04:10 +00:00
corin e340be34d8 upping mem limit since something was unhappy with the lower limit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4427 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 02:38:17 +00:00
kshakir bb44044ce0 Fixed re-builds of queue so that previously compiled classes are included. Fixes redundant case of "ant queue test" vs. "ant test".
Refactored temp directory utils.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4426 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 21:12:07 +00:00
chartl 7639692e5b Sigh. Fix the source of even more UserErrors in the phone home directory: make sure to gunzip the beagle files before passing them into the conversion walker...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4399 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 03:28:36 +00:00
chartl f34b4c6b82 Be smarter if the beagle output is set such that getParent() returns null. Up the memory limit.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4389 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 12:48:47 +00:00
chartl 0142047da9 And a bugfix 3 seconds later. Don't tell java to use up to 20g while telling the farm to kill the job if it tries to exceed 4g.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4388 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 02:08:47 +00:00
chartl 06970ae039 A qscript that refines genotypes with beagle and merges them into one vcf (running currently on the recent chr20 production calls).
This will be librarized soon; but if you need to do something like this, feel free to cannibalize.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4387 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-30 02:05:30 +00:00
chartl 2708e83198 For show (Queue works nicely): An analysis script that runs QC for the omni chip
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4380 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 15:04:17 +00:00
kiran 51fdf9d701 Default memory limit is now 4g (apparently necessary when testing on full 100-sample Autism_Daly dataset)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4359 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-27 05:43:08 +00:00
kiran bcc09f5d8c Simplifications: removed command-line arguments to control SNP cluster filter parameters. Infer the number of contigs to scatter indel cleaning from the contig list (which we should get rid of too). Changed the PY argument to just Y for specifying the path to the YAML file. Cleaned up command-line argument documentation. See http://iwww.broadinstitute.org/gsa/wiki/index.php/Queue-based_pipeline for a list of remaining issues.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4356 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-26 22:50:30 +00:00