Commit Graph

328 Commits (36db9bdcd55aa9668d7b5c2bed620a2eb96f21bc)

Author SHA1 Message Date
depristo abc7d1aef9 BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site.
ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation.  Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field.  This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples). 

Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle.  Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power.  The bootstrap sites can be written to a separate VCF for assessment as well.

Finally, my general Queue script for creating and evaluating reference panels from VCF files.  Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel.  Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 03:08:38 +00:00
carneiro 55e5971b3b this is a oneoff script to clean the papuans and test TargetCreator and IndelRealigner with scatter gathering.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5457 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:09:53 +00:00
rpoplin 9c413fbc9e not useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5450 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 22:47:55 +00:00
carneiro 42f70d9e07 join all per-lane Bams before doing target realigning and indel cleaning.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5435 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 16:11:03 +00:00
depristo d01d4fdeb5 Optimized version of produce beagle tool, along with experimental (hidden) support for combining likelihoods depending on estimate false positive rate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5430 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-12 02:06:28 +00:00
fromer 0b45de14ed Some minor updates to fully utilize the functionality of reduceByInterval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5411 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-09 20:38:08 +00:00
depristo bf2e02f472 Generic, easy-to-use variant evaluation Queue script that tests indel and SNP call sets against standard evaluation data sets for sensitivity and specificity
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5391 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 18:03:29 +00:00
depristo 5c979633f0 Due to a problem in the way that dynamic type selection works, I've added an explicit (temporary) ability to restrict VE to specific variant types (SNPs, INDELs, etc), so that calculations will work when a site has a SNP in dbSNP but is called as an indel, causing the SNP site to mysteriously disappear from the comp track, a huge problem for validation report. VEU updated to allow both dynamic type (old) and just returning everything in the track.
Also, created a standard Queue script that calculates a suite of standard indel and SNP assessment results.  Will be the basis for a general evaluation Queue script with standardized data files for SNPs and Indels.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5385 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 19:31:12 +00:00
chartl a40a8006b5 Added in unit tests for the statistics calculated by the test runner; and bug-fixes to the calculations; so we have some assurance that the statistics coming out the back-end are correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5380 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 16:54:02 +00:00
chartl 9ca1dd5d62 Miscellaneous changes:
- RefMetaDataTracker: grabbing variant contexts given a prefix (not sure where else this was implemented, if someone can show me I'll remove it)
 - VCFUtils: grabbing VCF headers given a prefix 
 - MathUtils: Useful functions for calculating statistics on collections of Numbers
 - VariantAnnotator: Made isUniqueHeaderLine a public static method -- maybe this should go into a different class. Not sure.
 - Associations: PluginManager now used to propagate classes, implementations for Z,T,U tests, slight alteration to format to make the objects stored
      in the window optionally different from those returned by whatever statistic is run across the window
Added:
 - MannWhitneyU. Started to fix up WilcoxonRankSum but there are comments in there questioning the validity of some of the code, and I'm sure that
    it's actually doing a U test. This implementation includes the direct calculation of p-values for small sample sizes, and a uniform approximation
    for when one of the sample sets is small, and the other large. Unit tests to follow.
 - BootstrapCallsMerger: takes n VCFs which have been called on the same samples; merges them together while averaging the annotations
 - BootstrapCalls.q: qscript for testing the effectiveness of boostrap low-pass calling on the exome
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5372 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 22:43:36 +00:00
carneiro 0daa65b9ef quick and dirty 'close your eyes' solution to run the papuans over the weekend. Will be properly fixed soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5370 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 21:42:22 +00:00
carneiro 8ab6eee1cf gold standard creates its own tranches and vcf files now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5347 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 17:48:40 +00:00
chartl 0723b0f44c Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 01:23:37 +00:00
rpoplin ce34a8a918 New hidden option in VQSR to not parse the genotypes of the incoming training data. Updated VQSR training in methods development pipeline to be more in line with best practices.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5340 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 23:19:51 +00:00
carneiro c7a51f0de7 fixed 1kg pilot dindel calls vcf file and combined all vcfs into one master dindel file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5335 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:04:58 +00:00
depristo 146756de79 Class name to reflect actual file name. manySampleUGPerformance now operates on 1000 samples!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5326 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 23:36:04 +00:00
chartl b089d35b21 Fix expand intervals to do the right thing:
- No more duplicate intervals
 - Truncation at intervals that already exist, e.g.

exists:      |--------|           |-------|
new:               |---------|
fixed:                 |-----|

note that weird instances like:

exists:           |-|        |-|                  |-|
new:           |---------------------|
fixed:                          |----|

e.g. you're truncated to the nearest interval on whatever side. In general many behaviors could happen in this instance, this is the one currently implemented.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5323 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 04:19:01 +00:00
carneiro fd5d1f9cfc minor cosmetic changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5322 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 21:56:35 +00:00
carneiro 81414a21dd dpp: back to using 4gb memory assuming all is right with IndelRealigner now.
mdcp: Some class structural changes due to the inclusion of indel calls. ApplyCut now chooses the tranche differently for each dataset.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5319 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 19:21:02 +00:00
kshakir 3e0a722672 MFCP waits for other pipelines to finish by using the previous log file of one pipeline as virtual input to the next pipeline.
Using the name of the yaml in the log file name instead of each writing each to "queue.out" so that two yamls can run from the same directory without creating cycles in the graph.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5318 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 17:51:01 +00:00
carneiro 6db3210387 the data processing pipeline needs more memory...
directory updates in the methods pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5305 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 17:22:58 +00:00
carneiro 897a333aba Methods Development Pipeline now has the option of calling indels with the -indels parameter. Also updated some databases and the new NA12878 HiSeq hg19 that Tim just funneled to us, is updated and called.
Small fixes on the data processing pipeline


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5304 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 17:12:55 +00:00
rpoplin 255cc246a2 Change in Methods development pipeline: dbsnp130 can't be used for anything, changed it to dbsnp129. Optimization for HaplotypeScore and the to-be-committed ReadRosRankSumTest in AlignmentUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5301 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 16:09:03 +00:00
chartl 97e1a5262e -ct x no longer includes coverage in the previous bin
BatchMerge - additional support for indels (can't just test the alternate allele when it's an extended event, must also specify that you want to use the dindel model when you actually test the allele)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5300 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 15:52:04 +00:00
kshakir f1f9bd6dcc Due to recent LSF hiccups put a very brief (.5-2min) retry around getting status. Can't wait too long because statuses are archived an hour after exit.
TODO: Switch to bulk status checks and add status archive lookups.
Sending SIGTERM(15) instead of SIGKILL(9) to allow for graceful termination of child process.
Printing out the name of the QScripts in the compile error text.
Added a pipelineretry -PR pass through for the MFCP and MFCPTest.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5295 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 18:59:08 +00:00
chartl 07d381ec51 BatchMerge now uses the correct UG settings, recently added by Eric
ExpandIntervals now checks that identical intervals are not created by (un)fortunately-spaced targets
VCFExtractIntervals no longer creates duplicate intervals in the case where a VCF has multiple entries at the same site



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5294 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 18:46:15 +00:00
carneiro 2a48ec1307 now only accepts intervals files if the user specifically requests to report bams at interval only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5291 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 16:49:58 +00:00
carneiro ecfb51bcd8 Few organizational changes, queue output is now categorized and hidden. Also changed NA12878.Wex to dbsnp 129.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5290 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-22 22:49:38 +00:00
carneiro 8ea71fd294 minor dataset chages.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5289 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-22 20:18:10 +00:00
carneiro c61dd2f09f data processing pipeline now has on the fly bam indexing (powered by Matt) some new parameters, Indel Cleaning with constrain movement and fixMates is gone.
setting up methods development pipeline for some cosmetic changes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5277 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 23:13:54 +00:00
depristo d97ed3e080 Comments for Mauricio
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5275 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 16:58:34 +00:00
carneiro acad3ada06 changed baq to calculate_as_necessary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5270 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:50:46 +00:00
carneiro 7f9ca6b28a full data processing pipeline, now deleting intermediate files and performing both phases (per lane and combined) of the processing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5269 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:34:00 +00:00
kiran 4f83151c4e Evaluates within standard target and expanded target separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5268 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:04:24 +00:00
kshakir 860b172ef1 Defaulting the MFCP to run without a tear script.
Added a missing virtual output for the inner FCP, so that Queue can tell a run of the FCP is dot-done.
Enabled the MCFPTest for the first time, running without the tear script.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5264 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 21:13:14 +00:00
kshakir a189454343 FCP only adds the expand intervals QFunction once per script instead of once per QFunction using the ExpandTargets scala trait.
Eval dbSNP's type now based on eval dbSNP instead of genotype dbSNP.
Using an external treemap instead of the JGraphT internal node set to speed up larger graph generation.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5261 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 19:09:03 +00:00
carneiro 497e9ab83b too hasty... cleaning up debug messages ;)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5257 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 02:11:03 +00:00
carneiro b4da843c49 now processes either a single bam file or a list of bam files in parallel.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5256 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 02:07:22 +00:00
carneiro 50c870cfce Data Processing Pipeline: local indel realignment, mark duplicates and BQSR. Done.
Pacbio pipeline: now all pacbio bams have baq annotated in so running UG is uber fast.

Methods pipeline: minor cosmetic changes.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5253 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 17:22:30 +00:00
kiran c0a4af3809 Expands targets by 50-bp on both sides when the expandIntervals argument is greater than 0.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5251 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-16 14:47:52 +00:00
carneiro 6d3b878dde data processing pipeline script already does:
. Local Indel Realignment 
. Mark Duplicates

will do:
. Base Quality Score Recalibration (soon)

it's working with a single BAM for testing, but will work with a list of bam files.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5250 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 21:49:05 +00:00
corin d2efea6003 This is a draft of the improved and prettified pipeline. It may not yet compile, but Kiran is taking over adding a few more things as I finish up other tasks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5248 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 19:35:00 +00:00
kshakir d185c2961f Added pipeline for calling FCP in batches called MultiFullCallingPipeline.
Bug smashes for the MCFP:
  Synchronized access to LSF library and modifications to the QGraph.
  If values are missing from the graph with -run make sure to exit with a non-zero.
  Refactored QGraph to pre-generate a unique Int for each QNode speeding up getHashCode/equals inside the graph.
  Added jobPriority and removed jobLimitSeconds from QFunction.
  All scatter gather is by default in a single sub directory queueScatterGather.
  Moved some FCPTest into BaseTest/PipelineTest for use by MFCPTest.
  Rev'ed the 1000G bams used for validation from v1 to v2 and added code to look for the bams before running other tests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5247 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 18:26:14 +00:00
carneiro 87e19a17ae small updates to the variant eval part of the pipeline, some updates to the pacbio specific pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5244 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-15 16:19:07 +00:00
chartl 851b3e71f9 Major revision of the batch merge script. All sites are now used, hooks for some UG settings, no longer reliant on the pipeline management library (pipeline libs are probably going to go away -- nobody uses them)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5241 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 23:52:05 +00:00
fromer d6e3f2eba6 Added GC content calculator for CNV data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5240 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-14 22:29:55 +00:00
carneiro 5f10fffa47 merge intervals now prints a sorted list in the end.
added the ccs datasets to the pbCalling pipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5233 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 20:57:59 +00:00
carneiro 50c2fa3c3a this -1 made ALL the difference in the world. Minor bug fix.
Regular updates to the pbCalling pipeline.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5232 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 19:25:09 +00:00
fromer cdf53188d6 Updated DoC to work with scatter-gather; and, also manually implemented scatter-gather by sample above the scatter-gather by interval. Thansk to Khalid for his support!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5231 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-11 19:14:42 +00:00
carneiro c630701a76 Following Ryan's suggestion, I am moving the Methods Development Calling pipeline to the Core.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5226 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-10 17:36:05 +00:00
carneiro 9c2c5efe35 a modified version of the Methods Development calling pipeline made to work with pacbio data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5225 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-10 16:06:50 +00:00
fromer 947cc44854 Thanks to Matt for walking me through a proper version of VCF_BAM_utilities! Feel free to add to it, or use it to get the samples in a VCF file, a BAM file, or a collection of BAM files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5223 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-09 18:08:27 +00:00
kshakir 4d1cca95bb Removed deprecated getDbsnpFile.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5221 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 21:12:15 +00:00
carneiro e5cfc6ae74 NA12878 hg19 dataset was included to the methods pipeline. (and I am running it)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5217 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-08 16:17:46 +00:00
fromer 8d0f1b75d5 Added queue/util/BAMutilities Object [with BAM and VCF parsing utilities], which is now used by my qscripts that robustly split runs by sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5214 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 22:17:29 +00:00
kshakir 8040998c15 Renamed the pipeline yaml dbsnpFile to genotypeDbsnp, and added an evalDbsnp.
Added a genotypeDbsnpType and evalDbsnpType to check the extensions for .vcf or .rod.
Moved renaming of "recalibrated" bams to "cleaned" from sed to yaml generation template (see diff for more info).
Renamed fCP.q to FCP.q.
Though it's still disabled until VariantEval is updated, added changes above to the FCPTest.
Removed refseq table from the queue.sh wrapper script. Only specified in the yaml.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5213 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 22:01:09 +00:00
fromer 3c1a026c94 Updated script to properly bin DoC values so that down-sampling corresponds to range of DoC values obtainable
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5208 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-07 16:47:55 +00:00
depristo c4707631e2 MethodsDevelopmentPipeline is now the test bed for large scale AWS_S3 logging. Can be disabled from command line if this is necessary
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5203 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-06 17:03:45 +00:00
fromer 8b8b4fced1 Removed explicit memoryLimit, so that memLimit given on the command-line will NOT be ignored...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5202 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-06 01:55:17 +00:00
depristo fe4aa58d35 Removing unused class
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5197 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 22:22:28 +00:00
fromer 4cdc974c5f Preliminary Qscript to run DoC for the purpose of CNV detection
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5194 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 21:25:59 +00:00
corin cd6ace1b47 Includes UG version of indel genotyping rather than IGV2
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5191 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-04 20:25:46 +00:00
carneiro 358a400474 made ApplyVariantCut a default part of the pipeline, added the -noCut option if you don't want to use it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5189 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 19:29:36 +00:00
carneiro 7af003666d added optional argument -cut to apply the variant cut to the ts recalibrated vcf.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5183 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 17:34:40 +00:00
chartl 5398cf620a Bug fixes in the in process function (spoiled by python: was not closing my writers). SortByRef now works somewhat like the perl script does, rather than doing a memory-expensive sort. Adding a QTools qscript which is kinda clunky, and will be used mostly for integration tests of these IPFs, pending some better way to construct argument collections and function accessors at compile-time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5182 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-03 17:32:46 +00:00
carneiro cf15819db5 updated to work with the new VariantEval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5176 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 17:46:07 +00:00
rpoplin 47357b726e Fixing import GenotypeCalculationModel since it doesn't exist anymore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5175 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 15:39:43 +00:00
fromer 7605f0e6c1 Corrected input/output definitions for Queue
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5173 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 07:39:00 +00:00
fromer 3839fd1a25 Updated phasing pipeline to properly read samples from VCF and BAM files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5172 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-02 07:16:05 +00:00
fromer 798955b006 After discussing with Mark, revert to "Master merging" of phase information from VCFs. This has the advantage of creating minimal phased VCFs from RBP, from which phase info is merged into the original "master VCF". Also, updated Genotype.sameGenotype() to be simpler and NOT REVERSE the ignorePhase flag in comparing Allele lists/sets
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5167 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 19:50:15 +00:00
fromer a89400b20c Simple implementation to retrieve relevant BAM files for each sample
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5152 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-01 00:03:03 +00:00
fromer f258363cfc Minor bug fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5150 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 22:29:28 +00:00
fromer 742bd44728 Changed output file to be user-defined
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5149 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 22:15:26 +00:00
fromer 6c99dc4dab Take (partial) ownership of phasing 1000G chr20 calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5147 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-31 21:49:41 +00:00
kshakir 23578b7402 Pipeline tests will only start from scratch after "ant clean", making it faster to debug downstream issues when re-running "ant pipelinetest -Dpipeline.run=run".
Updated the FCP, the test, and the ADPR to handle an issue with the ADPR locating the yaml generated by the FCPTest.
Does not solve the ADPR error: Error in dimnames(x) <- dn : length of 'dimnames' [1] not equal to array extent


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5126 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-29 19:44:03 +00:00
kshakir 2ef66af903 Moved the maximum number of intervals check from FCP to the Queue core so that scatter gather will no longer blow up if you specify a scatter count that is too high.
Moved the BamListWriter from FCP to ListWriterFunction in the Queue core.
Added an ExampleCountLoci QScript along with an example pipeline integration test which checks MD5s.
Added a few more utility methods to PipelineTest including a currentGATK variable that points to the GATK jar.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5121 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 23:33:58 +00:00
corin b25d131481 updated to work with the new tearsheet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5113 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-28 18:49:11 +00:00
carneiro cae4b9b0de quick update with the correct CEU trio bam file and it's final location.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5098 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-27 19:17:19 +00:00
ebanks 68729045ca Always best to use the left-aligned version of the dbsnp vcf
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5091 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-26 20:21:50 +00:00
delangel fa0c476b82 Script for calling indels in all phase 1 samples - VQSR part still needs work but raw calling is done
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5052 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-22 14:07:10 +00:00
carneiro a0731eaa81 updated NA12878 Trio gold standard data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5048 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 18:48:31 +00:00
depristo 94b64ec54a Moving scala script into analysis directory
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5047 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 18:42:18 +00:00
depristo b45566760e intermediate checkin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5045 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 18:39:25 +00:00
rpoplin b6497c404f Moving Phase1Calling qscript over to using the cleaned, pre-BAQed bams
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5039 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-21 02:41:20 +00:00
carneiro fc73569d62 Added NA12878 Trio dataset to the pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5037 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 23:15:33 +00:00
kshakir 8855f080c2 For the fullCallingPipeline.q:
- Reading the refseq table from the YAML if not specified on the command line.
 - Removed obsolete -bigMemQueue now that CombineVariants runs in 4g.
 - Added a -mountDir /broad/software option to work around adpr automount issues.
 - Merged the LSF preexec used for automount into the shell script used to execute tasks.
 - Using the LSF C Library to determine when jobs are complete instead of postexec.
 - Updated queue.sh to match the changes above.
 - Updated the FCPTest to match the changes above.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5036 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 22:34:43 +00:00
depristo 41c8552d0a Added implements HasGenomeLocation to all revelant classes. It's not possible to write generic code for working with objects that support the getLocation() function in HasGenomeLocation. Please, if you have an object that has a location, implement this interface and start using / writing generic functions to sort, compare, etc. these objects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5031 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 12:54:03 +00:00
kshakir 4d611e53e7 Passing the ADPR R script to FCPTest.
Changed the FCP.q to use an InProcessFunction work around the -runDir issue GSA-420.
Tested the FCPTest using the following dotkits and "ant clean pipelinetest -Dpipeline.run=run":
  - R-2.11
  - Oracle-full-client
  - .cx-oracle-5.0.2-python-2.6.5-oracle-full-client-11.1


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5029 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-20 06:08:45 +00:00
corin 50fcebb0c4 Incorporates tearsheet and plot production with database access into standard pipeline. Note that the following dotkit packages must be run before the adpr will be correctly generated:
R-2.10, 
Oracle-full-client, 
cx-oracle-5.0.2-python-2.6.5-oracle-full-client-11.1

This also removes the unused titv argument


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5024 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 20:48:42 +00:00
rpoplin 55eb0387ac Another relevant qscript. I use this one to do thousands of variant recalibration jobs to search for optimal parameters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5019 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 18:17:32 +00:00
chartl a463dbcda1 Refactoring the qscript directory; oneoffs, playground, and core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5017 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 15:23:40 +00:00
rpoplin 7db9601c9d Checking in the 1000G phase1 cleaning and calling scripts for posterity's sake, but also to show everyone what the current best practices for VQSR training looks like.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5015 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-19 14:32:52 +00:00
rpoplin 457c59e737 Use the sites-only HapMap files in the Methods development pipeline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5013 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-18 20:50:09 +00:00
carneiro 35a4f1e366 .Added VariantEval as an optional step in the pipeline.
.Lifted to HapMap 3.3
.Lifted to dbSNP 132 where possible.
.Added the CEU-Trio WEx(hg19) dataset 
.Added some options to the pipeline

You can now use : 

-dataset WEX
-dataset HiSeq
...

to choose which datasets to run through the pipeline.

You can now without BAQ and indel mask:

-noBAQ 
-noMASK

Choose not to run the gold standard comparison analysis:

-skipGoldStandard

Activate the VariantEval walker analysis on the Recalibrated vcf:

-eval

The default behavior is to run exactly like it used to, so this version shouldn't change the way you used to use the pipeline.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5004 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-14 21:55:02 +00:00
carneiro c4f9b262e5 removing the tech dev pipeline script from the repository to keep the methods development pipeline as the reference script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4992 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 18:15:55 +00:00
carneiro 9e93091e9a -baqGOP now takes phred scaled scores instead of probabilities in the command line.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4982 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-13 00:06:38 +00:00
kshakir 8ba3a5a43f Command lines for locally run Queue jobs no longer have to be escaped differently than bsub'ed jobs.
GSA-410 Local job runs now can run command lines longer than than 4096 on our linux machines.
When determining if the help text and Queue extensions need to be rebuilt, use the .class files not the .java so that GATK oneoffs are picked up correctly.
Added the most basic of all example QScripts for debugging, Hello World.
Minor updates to copy/pasted LSF code to reduce ant javadoc warnings by a third.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4970 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-10 21:07:29 +00:00
kshakir b34e2f733f Removed stochasticity from IndelRealigner by random sampling using and seed based on the read list.
Updated the Queue scatter/gather for read walkers to include -L unmapped on the last scatter job when intervals aren't specified, and to map it correctly when it is explicitly set.
Simplified the build.xml/ivy.xml to fix a bug reported with "ant clean dist test" where the scalac target wasn't found.
Now building all scala code at the same time, just like all java code is compiled at the same time.
Sped up the build for everyone by uncommenting a small bit of classes so that javac/scalac will not constantly launch trying to build .class files that will never compile.
Moved some source files to their expected location so that the .java/.scala -> .class is a one-to-one match, again keeping the compilers from wasting cycles.
Used <uptodate> and <touch> to skip extracting the help text and generating the GATK Queue extensions when the source files haven't been modified.
Fixed a couple errors when the <javadoc> task is run.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4963 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 22:03:36 +00:00
chartl 3e7802a3e0 Minor changes to a qscript and the GQ constants on PrivatePermutations
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4956 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-07 18:26:21 +00:00
carneiro 5e9a8f9cb3 Implemented a new argument (-DQS --defaultQualityScore) that allows GATK to deal with BAM files missing quality scores. If a value is specified, all reads are filled with the default quality score. Appropriate exception is thrown if -DQS is not provided and BAM file doesn't have quality scores for every base.
Adding the first version of the techdev pipeline (tdPipeline)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4943 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 22:25:08 +00:00
rpoplin 20f29e4690 In the Methods development pipeline the call confidence threshold must be lowered from the default value for lowpass calling. What a bone-headed mistake!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4941 348d0f76-0448-11de-a6fe-93d51630548a
2011-01-05 20:30:55 +00:00
corin 6d809321d3 Updating combien variants memory limit and dcov default for the full calling pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4907 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-24 03:06:50 +00:00
depristo 5265f943b0 phasing per sample. tmp checkin
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4898 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 20:14:06 +00:00
corin e7569cfe6f Updated dbsnp version usage. Calling with 132, but still using 129 for eval to maintain consistant known/novel eval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4895 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 17:37:27 +00:00
chartl 2235245af0 PrivatePermutations generalized to compute transition counts and average probabilities (and thus was renamed). Changes in some pipelines to reflect the change. Bugfix in the batch merging pipeline (it would halt because the allele VCF for genotyping batches could become off-spec).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4894 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-22 15:16:15 +00:00
rpoplin 7185fcb47b Committing my notes about the methods development pipeline so we stay synced up while I'm on vacation. Cheers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4891 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 21:14:20 +00:00
chartl 80770dc032 Expanded target pipeline complete. Stop trying to be clever about scatter-gather; wait until functional SG is built-in to Q. Til then, a lazy version of the fullCallingPipeline. Seems to take a long time to generate the graph though...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4888 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-21 00:56:16 +00:00
kshakir 758d14a261 Checking in scripts used for testing the linear index MAX_FEATURES_PER_BIN.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4887 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:25:36 +00:00
chartl fc33901810 Graph structure must be known at compile time. Removing GroupIntervals until a future point where in-process-functions can predict their output based on inputs [though this is probably forever: the inputs may not exist at compile time!]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4886 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 21:22:58 +00:00
chartl 61d5daa65c EXTREME interval processing. Still undergoing testing.
+ GroupIntervals allows user-defined scattering (e.g. take an interval list file, split it into k smaller interval list files by number of lines)
 + ExpandIntervals expands the intervals, either by widening them, or allowing the definition for nearby intervals (e.g. flanks starting 1bp before and after, ending 10bp after that)
 + IntersectIntervals takes n interval lists, writes 1 interval list that is the n-way intersection of all of them



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4885 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 19:42:50 +00:00
rpoplin 4ca1da1d07 Updating the NA12878.HiSeq bam file to be the correct bam file in the methods development qscript.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4879 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 14:53:10 +00:00
rpoplin 8fac346ac1 Misc cleanup in Methods Development Qscript
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4878 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-20 04:24:25 +00:00
rpoplin 34ab5b4889 Turning on BAQ in Methods Development pipeline. A new dataset is added: 363 EUR samples from the November 1000G release.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4877 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-19 21:13:25 +00:00
chartl 8118a439c0 Commit for Khalid
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4876 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:24:18 +00:00
rpoplin 15a33545f4 Updating Methods development pipeline qscript with the bam lists for all the data sets. It is ready for people to start running with it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4875 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 22:19:14 +00:00
corin f0ab7b849a Adding a window size variable to avoid indel genotyper error
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4873 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-18 04:19:54 +00:00
rpoplin bdef4e775a Initial checkin of methods development pipeline qscript. It allows the methods dev team to run an overnight job which calls and recalibrates a variety of data sets and allows for an end-to-end sanity check of final results for potential changes to the methods. It isn't meant to be used by anybody quite yet, but shows the general structure and flow.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4871 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 22:14:02 +00:00
rpoplin 095fc1922a By popular demand I'm adding the qscript I used to do the 660 bamfile 1000G calling for ASHG. It does cleaning, BAQing, and merging in 3mb chunks genome-wide then calls SNPs on those temporary bams.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4866 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 18:49:03 +00:00
depristo 32d5397c01 Experimental support for sided annotations. Currently not more/less valuable than two-tailed testing. Future experiments are needed
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4864 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 15:08:31 +00:00
chartl 0d18bd1011 Now that addAll() is in the superclass, no longer need this definition (which, without override, prevents the script from compiling anyway)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4862 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-17 05:36:31 +00:00
chartl 3e75431bc8 Thanks to mark: VCFInfoToTable removed in favor of a more flexible walker. Slight change to the argument structure of the walker to make it play more nicely with Queue: the field list parsing is pushed into the command line system (e.g. the variable is exposed as a List<String> and not a String, so Queue doesn't have to join a list into a string only to have it broken out again. This also allows the user to specify -F field1 -F field2 -F field3 if he/she so desires.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4842 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 03:33:36 +00:00
chartl 2217837845 Commit for Khalid -- should be a scala version of vcf2table but for some reason the run method isn't getting called.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4841 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-15 00:44:15 +00:00
chartl f36861eeee One more little bfix -- the issue was not the grep command, but instead the NFS in the awk; i changed it to ++count in the last commit which was really responsible for the fix. Then this ultra-escaping semi-broke teh grep again.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4831 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 20:36:14 +00:00
chartl d34c5640d2 Bugfix for clf version of extract samples. Due to dynamic shell creation and bsubs and whatnot, the OR pipe for grep ("a|b") needs to be super-escaped ("a\\\\\\\\|b").
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4829 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 19:06:30 +00:00
chartl f795b25c47 In-process versions of sample extraction and interval-list conversion for VCF files. Required an in-process-function branch of the queue library.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4827 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 17:36:53 +00:00
depristo e219f6a4b5 Q script to run VQSR on a whole variety of common data sets. To be used as a basis for general methods development pipeline
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4826 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 16:55:52 +00:00
chartl 7bc2049031 Updates and bug fixes to private mutations qscript and pipeline libraries. Hand filter strings are now not busted (boo to having to escape quotes); convenience method added to VariantCalling to propagate standard trait data to a given GATK command line -- should be made more scala-esque in the future.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4824 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-13 04:55:13 +00:00
chartl cf75caf653 java changes:
VariantEvalWalker's logger is made public, so that variant eval modules can access it through the parent object.
 DesignFileGenerator comment lists how best to bind things to it, and the feature accessor is better refined to grab the genome loc. (old change)

scala changes:

convenience addAll( List[CommandLineFunction] ) added to QScript class (and thus removed from the fCPV2)
useful command line functions added to a new library package for command line functions (these are fast simple VCF command lines)
bug fixed in ProjectManagement for the class where there's only one batch to be batch-merged (not really part of the use-case, but an edge-condition that came up during pipeline testing)
first draft of a private mutations pipeline which will be elaborated in future



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4823 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-12 05:10:45 +00:00
chartl 81290d238d Restructuring my qscripts
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4821 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-11 20:58:45 +00:00
kshakir 56433ebf6b Switched from LSF command line wrappers to JNA wrappers around the C API. Side effects:
- bsub command line is no longer fully printed out.
- extraBsubArgs hack is now a callback function updateJobRun.
Updated FullCallingPipelineTest to reflect latest changes to fullCallingPipeline.q.
Added a pipeline that tests the UGv2 runtimes at different bam counts and memory limits.
Updated VE packages that live in oneoffs to compile to oneoffs.
Added a hack to replace the deprecated symbol environ in Mac OS X 10.5+ which is needed by LSF7 on Mac.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4816 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-10 04:36:06 +00:00
corin 27acede64d Removing old arguments. We'll now be running with the defaults.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4811 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-09 18:58:56 +00:00
chartl f8dd59c1d1 Tightening of the batch merging pipeline. Optimized to run on hour queue, so please: if you run this, crush 'hour' with it. Testing is forthcoming, but it merged 700 samples overnight.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4805 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-08 14:36:23 +00:00
chartl f4c43f013f Due to the overhead for reading VCF files (>32g for 700 5MB VCF files), batched merging has to generate likelihoods in batches.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4796 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-06 18:23:54 +00:00
chartl 0944184832 Major refactoring of library and full calling pipeline (v2) structure.
Arguments to the full calling qscript (and indeed, any qscript that wants them) are now specified via the PipelineArgumentCollection

Libraries require a Pipeline object for instantiation -- eliminating their previous dependence on yaml files

Functions added to PipelineUtils to build out the proper Pipeline object from the PipelineArgumentCollection, which now contains 
additional arguments to specify pipeline properties (name, ref, bams, dbsnp, interval list); which are mutually exclusive with
the yaml file.

Pipeline length reduced to a mere 62 lines.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4790 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-05 02:33:54 +00:00
corin bdc7516168 Taking out recalibrating for now, since having these files is confusing people and we've not gone to dbsnp 132 yet so cluster generation's broken with these command lines.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4786 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-03 22:12:09 +00:00
chartl 220fb0c44a Added a pipeline for merging batches. For now takes a file containing a list of VCFs, and a file containing a list of bams. Does not do anything smart (e.g. if you leave out some .bams or add some extra ones, you will not be warned). Heavy lifting done in (the beginnings of) a library for managing multi-batch or multi-project tasks.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4771 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-02 07:31:59 +00:00
chartl 9f03f09cc9 Changes to V2 pipeline and libraries. AB dropped. Cleaning enabled. Project name now properly propagated to intermediate files (instead of the string repr of the object). Indel mask is now expanded prior to filtering at indels.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4769 348d0f76-0448-11de-a6fe-93d51630548a
2010-12-01 18:55:48 +00:00
chartl 06a0fb4489 Library-ized pipeline now functions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4759 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 21:34:59 +00:00
ebanks 4413208c45 Removing unnecessary and incorrect includes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4752 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-30 02:06:48 +00:00
corin 6b70cde0b9 Adding a forgotten quote mark
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4729 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-24 16:38:27 +00:00
corin e15d18129c Adding by sample metrics. Not sure why we didn't have this in here in the first place
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4723 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 21:36:03 +00:00
corin fe28f8da9c Removing Uniquify from main pipeline indel merge, since the pipeline isn't merging from samples with the same name anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4721 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-23 17:25:22 +00:00
kiran 28805d17ca Commenting out allele-balance for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4715 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-22 16:48:08 +00:00
corin 8dca5bd861 Putting the annotation back in, both to the filters and to UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4709 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 21:02:15 +00:00
corin da1fe5bb37 Removing the AB filter given that we don't have that in the VCF anymore
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4708 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 20:22:05 +00:00
hanna 302cc13735 Trying out Queue for the first time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4705 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-18 18:29:12 +00:00
corin 5466365575 Fixing a silly typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4680 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:16:51 +00:00
corin a64f693b20 Updated pipeline script to include dbSnp for UG
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4679 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-15 18:09:47 +00:00
kshakir 801c562909 Now actually checking in the integration test mentioned in the prior commit: compiles the full calling pipeline.
Removed QScript usages of VariantRecalibrator's -reportDatFile, --report_dat_file


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4668 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-14 04:27:10 +00:00
kshakir 673fa841a4 Updated PluginManager so that during testing Queue can dynamically compile and load separately multiple class directories into the same class loader.
Removed obsolete usages of PackageUtils with updated PluginManager.
Ported Queue interval utilities written in scala over to Sting's java IntervalUtils.
Added a very basic intergration test to ensure that the fullCallingPipeline.q compiles.
Added options to specify the temporary directories without having to use -Djava.io.tmpdir (useful during the above integration test).
While adding tempDir added options to specify the run directory from the command line, for example "-runDir v1".
Upgraded to scala 2.8.1 and updated calls to deprecated functions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4661 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 20:14:28 +00:00