Commit Graph

136 Commits (92045ecaa6232aad977e21e8752b9ece071cab4f)

Author SHA1 Message Date
kshakir 92045ecaa6 Finally figured out what data in the LSF C API call lsb_readjobinfo is causing JNA to SIGSEGV with a strlen error:
- LSF recycles memory for C arrays, but sets a separate variable setting the size of the array to zero.
- JNA only sees the non-NULL pointer and starts to auto-access it, sometimes causing a SIGSEGV.
- In the short term neutered the jobInfoEnt structure so that this bad array is not autoRead().

QGraph updates:
- Job status is now checked in bulk every 30 seconds instead of one job at a time, even in the middle of dispatching jobs.
- If there is a hiccup (unexpected but not fatal error) during status check then the the error is ignored and status is checked again 30 seconds later.
- Jobs prefer to dispatch depth vs. breadth first.

More refactoring of SG framework separating the reusable code from the implementations.
The DistributedScatterFunction is still a work in progress and is not enabled yet. Still need to think through how Queue handles when a job dies.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5387 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 04:29:06 +00:00
kshakir a0309e7fb0 Want to get this into Ryan's hands asap: First working version of a distributed scatter function.
More refactoring to do so that other new scatter functions can be implemented very easily and annotated on walkers.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5363 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 01:40:08 +00:00
kshakir 1e68259f5c Stage one of refactoring GATK scatter functions. This intermediate stage should only be used by "rhymes with Shmoplin" until the next refactoring.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5360 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-02 18:10:19 +00:00
kshakir 4b08baabe7 Let Queue resubmit the job instead of LSF.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5342 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 03:19:50 +00:00
chartl 0723b0f44c Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 01:23:37 +00:00
chartl abab23350f Fix for expanded intervals in the case where the input file has overlapping intervals in it (seriously?)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5337 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:56:33 +00:00
chartl b089d35b21 Fix expand intervals to do the right thing:
- No more duplicate intervals
 - Truncation at intervals that already exist, e.g.

exists:      |--------|           |-------|
new:               |---------|
fixed:                 |-----|

note that weird instances like:

exists:           |-|        |-|                  |-|
new:           |---------------------|
fixed:                          |----|

e.g. you're truncated to the nearest interval on whatever side. In general many behaviors could happen in this instance, this is the one currently implemented.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5323 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 04:19:01 +00:00
kshakir f1f9bd6dcc Due to recent LSF hiccups put a very brief (.5-2min) retry around getting status. Can't wait too long because statuses are archived an hour after exit.
TODO: Switch to bulk status checks and add status archive lookups.
Sending SIGTERM(15) instead of SIGKILL(9) to allow for graceful termination of child process.
Printing out the name of the QScripts in the compile error text.
Added a pipelineretry -PR pass through for the MFCP and MFCPTest.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5295 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 18:59:08 +00:00
chartl 07d381ec51 BatchMerge now uses the correct UG settings, recently added by Eric
ExpandIntervals now checks that identical intervals are not created by (un)fortunately-spaced targets
VCFExtractIntervals no longer creates duplicate intervals in the case where a VCF has multiple entries at the same site



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5294 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 18:46:15 +00:00
ebanks 318035c147 Fixing up the output system of the Unified Genotyper. Deprecating the -all_bases and -genotype arguments. Adding instead the --output_mode (EMIT_VARIANTS_ONLY, EMIT_ALL_CONFIDENT_SITES, EMIT_ALL_SITES) and --genotyping_mode (DISCOVERY, GENOTYPE_GIVEN_ALLELES) arguments. UG now does the correct thing when passed alleles (bound to the 'alleles' rod) to use for genotyping; added several integration tests to cover this case. This commit will break the batched calls merging script, but Chris knows this and is ready for it...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5288 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-22 06:07:18 +00:00
kshakir dee130ad1b Gather the log files before the actual outputs and mark the log files gatherers as intermediates.
Since the outputs will only be gathered iff the logs were gathered this allows the job name to change without causing SG to re-run.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5285 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-21 22:04:35 +00:00
kshakir 290afae047 GSA-423 Better reporting for errors in QScript.script().
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5276 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 22:21:15 +00:00
kshakir a189454343 FCP only adds the expand intervals QFunction once per script instead of once per QFunction using the ExpandTargets scala trait.
Eval dbSNP's type now based on eval dbSNP instead of genotype dbSNP.
Using an external treemap instead of the JGraphT internal node set to speed up larger graph generation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5261 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 19:09:03 +00:00
kshakir d185c2961f Added pipeline for calling FCP in batches called MultiFullCallingPipeline.
Bug smashes for the MCFP:
  Synchronized access to LSF library and modifications to the QGraph.
  If values are missing from the graph with -run make sure to exit with a non-zero.
  Refactored QGraph to pre-generate a unique Int for each QNode speeding up getHashCode/equals inside the graph.
  Added jobPriority and removed jobLimitSeconds from QFunction.
  All scatter gather is by default in a single sub directory queueScatterGather.
  Moved some FCPTest into BaseTest/PipelineTest for use by MFCPTest.
  Rev'ed the 1000G bams used for validation from v1 to v2 and added code to look for the bams before running other tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5247 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 18:26:14 +00:00
chartl aeee41a755 Fix broken pipeline test (replacing PASS with .)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5242 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 02:19:18 +00:00
chartl 851b3e71f9 Major revision of the batch merge script. All sites are now used, hooks for some UG settings, no longer reliant on the pipeline management library (pipeline libs are probably going to go away -- nobody uses them)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5241 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 23:52:05 +00:00
fromer 947cc44854 Thanks to Matt for walking me through a proper version of VCF_BAM_utilities! Feel free to add to it, or use it to get the samples in a VCF file, a BAM file, or a collection of BAM files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5223 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-09 18:08:27 +00:00
kshakir 4d1cca95bb Removed deprecated getDbsnpFile.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5221 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 21:12:15 +00:00
kshakir a8ab5a5fb9 After code review with APSG, trying a patch for SIGSEGV errors which checks the LSF result codes from lsb_openjobinfo instead of checking for a null return value from lsb_readjobinfo.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5220 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 21:08:22 +00:00
fromer 8d0f1b75d5 Added queue/util/BAMutilities Object [with BAM and VCF parsing utilities], which is now used by my qscripts that robustly split runs by sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5214 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 22:17:29 +00:00
chartl 5398cf620a Bug fixes in the in process function (spoiled by python: was not closing my writers). SortByRef now works somewhat like the perl script does, rather than doing a memory-expensive sort. Adding a QTools qscript which is kinda clunky, and will be used mostly for integration tests of these IPFs, pending some better way to construct argument collections and function accessors at compile-time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5182 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 17:32:46 +00:00
chartl a9d0921529 That variable name could only lead to trouble.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5180 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 05:03:48 +00:00
chartl 9515f94242 Commiting a simple merge IPF for use with qscripts (currently use a long grep, awk, pipe command, which can be unsafe and is hard to extend). Tests for all these functions coming soon. Also, IntelliJ + intermittent VPN connection = botched repository.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5179 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 05:01:21 +00:00
kshakir e74f28ad89 If there's an LSF queue maximum time limit set and the user hasn't specified one for this job, pass on the queue defined maximum limit with the job.
Updated LibBatIntegrationTest to use proper networked temp directory accessible by local machines and nodes.
Disabling the FCPTest until the VE3 is incorporated into the fullCallingPipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5151 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 23:13:09 +00:00
chartl 4d9bc84bd5 Initial commit of in-process helper functions for making the BCM more robust
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5144 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 19:18:31 +00:00
kshakir d4f744a4d4 Checking if the interval files exist before using them to calculate the minimum scatter parts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5143 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 18:07:34 +00:00
kshakir 57353294cc Copying jobLimitSeconds to clones.
Some cleanup and refactoring around copying values to clones.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5128 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-30 06:35:53 +00:00
kshakir 2ef66af903 Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
kshakir ce5b11317b Moved some shutdown logic from the LSF job runner into the QGraph.
Because of Java's type erasure JobManagers must provide runtime access to the runner class to shutdown.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5076 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 20:28:54 +00:00
kshakir b3c9b9bfbe +1 file that should have been with the last checkin.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5069 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 05:31:17 +00:00
kshakir 9923e05e0a Moved MD5 utils from WalkerTest to BaseTest for use by PipelineTests.
Moved VariantEval validation from FCPTest to PipelineTest.
Cleaned up some duplicate code for writing temp files during tests.
Moved FCPTest to playground namespace to match move for FCP.q.
Added a basic HelloWorldPipelineTest for the HelloWorld QScript. 
Moved duplicated error handling from JobRunners into the FunctionEdge.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5068 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-25 04:11:49 +00:00
kshakir 6fbd18c759 Cleaning up obsolete code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5044 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 16:27:35 +00:00
kshakir 8855f080c2 For the fullCallingPipeline.q:
- Reading the refseq table from the YAML if not specified on the command line.
 - Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
 - Added a -mountDir /broad/software option to work around adpr automount issues.
 - Merged the LSF preexec used for automount into the shell script used to execute tasks.
 - Using the LSF C Library to determine when jobs are complete instead of postexec.
 - Updated queue.sh to match the changes above.
 - Updated the FCPTest to match the changes above.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
rpoplin 00453919d2 VQSR now only uses the valid polymorphic sites for training and truth sensitivity calculations. Any number of tracks whose ROD binding begins with the name truth can be used as truth sensitivity tracks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5012 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 20:48:19 +00:00
kshakir 2355a55067 Added a QFunction.jobLimitSeconds for experimentation, currently only used as the equivalent of bsub -W.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5002 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 19:45:03 +00:00
kshakir 8ba3a5a43f Command lines for locally run Queue jobs no longer have to be escaped differently than bsub'ed jobs.
GSA-410 Local job runs now can run command lines longer than than 4096 on our linux machines.
When determining if the help text and Queue extensions need to be rebuilt, use the .class files not the .java so that GATK oneoffs are picked up correctly.
Added the most basic of all example QScripts for debugging, Hello World.
Minor updates to copy/pasted LSF code to reduce ant javadoc warnings by a third.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4970 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 21:07:29 +00:00
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
chartl fc33901810 Graph structure must be known at compile time. Removing GroupIntervals until a future point where in-process-functions can predict their output based on inputs [though this is probably forever: the inputs may not exist at compile time!]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4886 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:22:58 +00:00
chartl 61d5daa65c EXTREME interval processing. Still undergoing testing.
+ GroupIntervals allows user-defined scattering (e.g. take an interval list file, split it into k smaller interval list files by number of lines)
 + ExpandIntervals expands the intervals, either by widening them, or allowing the definition for nearby intervals (e.g. flanks starting 1bp before and after, ending 10bp after that)
 + IntersectIntervals takes n interval lists, writes 1 interval list that is the n-way intersection of all of them



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4885 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 19:42:50 +00:00
chartl 8118a439c0 Commit for Khalid
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4876 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:24:18 +00:00
chartl 6db86ae0c6 PAC sets the refseq file to optional.
Additional ipf for expanding interval lists. Either there's heavy latency on the toro -- or it's not working properly yet (e.g. the system print in the same scope as the file print outputs a line, but no file shows up on the system)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4874 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 19:04:52 +00:00
kshakir 3a6d1dbcef Fixed a class initializer crash on shutdown when the graph has nothing to run.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4860 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-16 18:56:55 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
chartl 2217837845 Commit for Khalid -- should be a scala version of vcf2table but for some reason the run method isn't getting called.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4841 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 00:44:15 +00:00
chartl f36861eeee One more little bfix -- the issue was not the grep command, but instead the NFS in the awk; i changed it to ++count in the last commit which was really responsible for the fix. Then this ultra-escaping semi-broke teh grep again.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4831 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 20:36:14 +00:00
chartl d34c5640d2 Bugfix for clf version of extract samples. Due to dynamic shell creation and bsubs and whatnot, the OR pipe for grep ("a|b") needs to be super-escaped ("a\\\\\\\\|b").
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4829 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:06:30 +00:00
chartl f795b25c47 In-process versions of sample extraction and interval-list conversion for VCF files. Required an in-process-function branch of the queue library.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4827 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:36:53 +00:00
chartl 7bc2049031 Updates and bug fixes to private mutations qscript and pipeline libraries. Hand filter strings are now not busted (boo to having to escape quotes); convenience method added to VariantCalling to propagate standard trait data to a given GATK command line -- should be made more scala-esque in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4824 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 04:55:13 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00