Commit Graph

131 Commits (b34e2f733fc206c67c01e725ac0e9ba4f7d99691)

Author SHA1 Message Date
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
chartl 3e7802a3e0 Minor changes to a qscript and the GQ constants on PrivatePermutations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4956 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:26:21 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
rpoplin 20f29e4690 In the Methods development pipeline the call confidence threshold must be lowered from the default value for lowpass calling. What a bone-headed mistake!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4941 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:30:55 +00:00
corin 6d809321d3 Updating combien variants memory limit and dcov default for the full calling pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4907 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-24 03:06:50 +00:00
depristo 5265f943b0 phasing per sample. tmp checkin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4898 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 20:14:06 +00:00
corin e7569cfe6f Updated dbsnp version usage. Calling with 132, but still using 129 for eval to maintain consistant known/novel eval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4895 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 17:37:27 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
rpoplin 7185fcb47b Committing my notes about the methods development pipeline so we stay synced up while I'm on vacation. Cheers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4891 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 21:14:20 +00:00
chartl 80770dc032 Expanded target pipeline complete. Stop trying to be clever about scatter-gather; wait until functional SG is built-in to Q. Til then, a lazy version of the fullCallingPipeline. Seems to take a long time to generate the graph though...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4888 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 00:56:16 +00:00
kshakir 758d14a261 Checking in scripts used for testing the linear index MAX_FEATURES_PER_BIN.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4887 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:25:36 +00:00
chartl fc33901810 Graph structure must be known at compile time. Removing GroupIntervals until a future point where in-process-functions can predict their output based on inputs [though this is probably forever: the inputs may not exist at compile time!]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4886 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:22:58 +00:00
chartl 61d5daa65c EXTREME interval processing. Still undergoing testing.
+ GroupIntervals allows user-defined scattering (e.g. take an interval list file, split it into k smaller interval list files by number of lines)
 + ExpandIntervals expands the intervals, either by widening them, or allowing the definition for nearby intervals (e.g. flanks starting 1bp before and after, ending 10bp after that)
 + IntersectIntervals takes n interval lists, writes 1 interval list that is the n-way intersection of all of them



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4885 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 19:42:50 +00:00
rpoplin 4ca1da1d07 Updating the NA12878.HiSeq bam file to be the correct bam file in the methods development qscript.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4879 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 14:53:10 +00:00
rpoplin 8fac346ac1 Misc cleanup in Methods Development Qscript
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4878 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 04:24:25 +00:00
rpoplin 34ab5b4889 Turning on BAQ in Methods Development pipeline. A new dataset is added: 363 EUR samples from the November 1000G release.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4877 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-19 21:13:25 +00:00
chartl 8118a439c0 Commit for Khalid
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4876 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:24:18 +00:00
rpoplin 15a33545f4 Updating Methods development pipeline qscript with the bam lists for all the data sets. It is ready for people to start running with it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4875 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:19:14 +00:00
corin f0ab7b849a Adding a window size variable to avoid indel genotyper error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4873 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 04:19:54 +00:00
rpoplin bdef4e775a Initial checkin of methods development pipeline qscript. It allows the methods dev team to run an overnight job which calls and recalibrates a variety of data sets and allows for an end-to-end sanity check of final results for potential changes to the methods. It isn't meant to be used by anybody quite yet, but shows the general structure and flow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4871 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:14:02 +00:00
rpoplin 095fc1922a By popular demand I'm adding the qscript I used to do the 660 bamfile 1000G calling for ASHG. It does cleaning, BAQing, and merging in 3mb chunks genome-wide then calls SNPs on those temporary bams.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4866 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 18:49:03 +00:00
depristo 32d5397c01 Experimental support for sided annotations. Currently not more/less valuable than two-tailed testing. Future experiments are needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4864 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:08:31 +00:00
chartl 0d18bd1011 Now that addAll() is in the superclass, no longer need this definition (which, without override, prevents the script from compiling anyway)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4862 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 05:36:31 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
chartl 2217837845 Commit for Khalid -- should be a scala version of vcf2table but for some reason the run method isn't getting called.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4841 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 00:44:15 +00:00
chartl f36861eeee One more little bfix -- the issue was not the grep command, but instead the NFS in the awk; i changed it to ++count in the last commit which was really responsible for the fix. Then this ultra-escaping semi-broke teh grep again.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4831 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 20:36:14 +00:00
chartl d34c5640d2 Bugfix for clf version of extract samples. Due to dynamic shell creation and bsubs and whatnot, the OR pipe for grep ("a|b") needs to be super-escaped ("a\\\\\\\\|b").
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4829 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:06:30 +00:00
chartl f795b25c47 In-process versions of sample extraction and interval-list conversion for VCF files. Required an in-process-function branch of the queue library.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4827 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:36:53 +00:00
depristo e219f6a4b5 Q script to run VQSR on a whole variety of common data sets. To be used as a basis for general methods development pipeline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4826 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 16:55:52 +00:00
chartl 7bc2049031 Updates and bug fixes to private mutations qscript and pipeline libraries. Hand filter strings are now not busted (boo to having to escape quotes); convenience method added to VariantCalling to propagate standard trait data to a given GATK command line -- should be made more scala-esque in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4824 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 04:55:13 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00
chartl 81290d238d Restructuring my qscripts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4821 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 20:58:45 +00:00
kshakir 56433ebf6b Switched from LSF command line wrappers to JNA wrappers around the C API. Side effects:
- bsub command line is no longer fully printed out.
- extraBsubArgs hack is now a callback function updateJobRun.
Updated FullCallingPipelineTest to reflect latest changes to fullCallingPipeline.q.
Added a pipeline that tests the UGv2 runtimes at different bam counts and memory limits.
Updated VE packages that live in oneoffs to compile to oneoffs.
Added a hack to replace the deprecated symbol environ in Mac OS X 10.5+ which is needed by LSF7 on Mac.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4816 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 04:36:06 +00:00
corin 27acede64d Removing old arguments. We'll now be running with the defaults.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4811 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 18:58:56 +00:00
chartl f8dd59c1d1 Tightening of the batch merging pipeline. Optimized to run on hour queue, so please: if you run this, crush 'hour' with it. Testing is forthcoming, but it merged 700 samples overnight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4805 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 14:36:23 +00:00
chartl f4c43f013f Due to the overhead for reading VCF files (>32g for 700 5MB VCF files), batched merging has to generate likelihoods in batches.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4796 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 18:23:54 +00:00
chartl 0944184832 Major refactoring of library and full calling pipeline (v2) structure.
Arguments to the full calling qscript (and indeed, any qscript that wants them) are now specified via the PipelineArgumentCollection

Libraries require a Pipeline object for instantiation -- eliminating their previous dependence on yaml files

Functions added to PipelineUtils to build out the proper Pipeline object from the PipelineArgumentCollection, which now contains 
additional arguments to specify pipeline properties (name, ref, bams, dbsnp, interval list); which are mutually exclusive with
the yaml file.

Pipeline length reduced to a mere 62 lines.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4790 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:33:54 +00:00
corin bdc7516168 Taking out recalibrating for now, since having these files is confusing people and we've not gone to dbsnp 132 yet so cluster generation's broken with these command lines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4786 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 22:12:09 +00:00
chartl 220fb0c44a Added a pipeline for merging batches. For now takes a file containing a list of VCFs, and a file containing a list of bams. Does not do anything smart (e.g. if you leave out some .bams or add some extra ones, you will not be warned). Heavy lifting done in (the beginnings of) a library for managing multi-batch or multi-project tasks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4771 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 07:31:59 +00:00
chartl 9f03f09cc9 Changes to V2 pipeline and libraries. AB dropped. Cleaning enabled. Project name now properly propagated to intermediate files (instead of the string repr of the object). Indel mask is now expanded prior to filtering at indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4769 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:55:48 +00:00
chartl 06a0fb4489 Library-ized pipeline now functions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4759 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 21:34:59 +00:00
ebanks 4413208c45 Removing unnecessary and incorrect includes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4752 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 02:06:48 +00:00
corin 6b70cde0b9 Adding a forgotten quote mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4729 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 16:38:27 +00:00
corin e15d18129c Adding by sample metrics. Not sure why we didn't have this in here in the first place
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4723 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 21:36:03 +00:00
corin fe28f8da9c Removing Uniquify from main pipeline indel merge, since the pipeline isn't merging from samples with the same name anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4721 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 17:25:22 +00:00
kiran 28805d17ca Commenting out allele-balance for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4715 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 16:48:08 +00:00
corin 8dca5bd861 Putting the annotation back in, both to the filters and to UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4709 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 21:02:15 +00:00
corin da1fe5bb37 Removing the AB filter given that we don't have that in the VCF anymore
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4708 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:05 +00:00
hanna 302cc13735 Trying out Queue for the first time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4705 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 18:29:12 +00:00
corin 5466365575 Fixing a silly typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4680 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:16:51 +00:00