Commit Graph

346 Commits (4288ca1c247cfb0db09194e7ea9d22e809a8f6cb)

Author SHA1 Message Date
rpoplin 1d11e88899 Adding another example call set to GATK resource bundle for use in VQSR wiki tutorial
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5774 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 21:16:33 +00:00
fromer 04f156d86b Removed extraneous import
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5772 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-05 18:51:03 +00:00
rpoplin 825682f58c oops, putting the script back into a sensible state
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5765 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 20:17:05 +00:00
rpoplin b5ab2274f6 Committing the base qscript I used to make the Phase1 Project Consensus. Does per-population cleaning and simplifyBAM, and then per-analysis-panel calling with genotype given alleles. Combines info fields using the panel with max AC.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5764 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 20:13:26 +00:00
kshakir 08f0509a5c Disabling the queue/pipeline package by default so that scala code can build. If it's not going to be fixed the package should be removed. If it is going to be fixed this patch to build.xml should be reverted.
Also added the old model of indel calling to the FCP.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5749 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-04 12:17:33 +00:00
carneiro f35d955490 recalibrates a dataset splitting between good and bad regions for comparison (used to be named justRecalibrate)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5747 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:43:09 +00:00
carneiro 9f2a8033ff just recalibrates now recalibrates one sample, fully, not splitting intervals (naming makes more sense)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5746 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:42:23 +00:00
carneiro c2f8536e02 removing old GATK options
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5745 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:40:39 +00:00
carneiro 8bb92160b5 Script to identify mendelian violations in the CEU Trio and follow up with supposedly incorrect SNP calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5744 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:19:42 +00:00
carneiro e2b9227d8d script to test BQSR on good/bad regions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5743 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 21:16:37 +00:00
rpoplin 4bbce42861 Renaming ContrastiveRecalibrator --> VariantRecalibrator in preparation for move to core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5733 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-03 18:12:47 +00:00
rpoplin 3224bbe750 New visualization output for VQSR. It creates the R script file on the fly and then runs Rscript on it. Adding 1000G Project consensus code. First pass of having VQSR work with missing data by marginalizing over the missing dimension for that data point (thanks Chris and Bob for ideas). Updated math functions to use apache math commons instead of approximations from wikipedia. New parameters available for the priors based on further reading in Bishop and looking at the new visualizations. Updated integration test to use more modern files. Updated MDCP to use new best practices w.r.t. annotations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5723 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 19:14:42 +00:00
carneiro a93a9ac663 adding gold standard (full coverage) to the variant eval analysis output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5721 348d0f76-0448-11de-a6fe-93d51630548a
2011-05-02 16:29:11 +00:00
carneiro 2384e23274 Added the capability of running count covariates only on a given interval.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5717 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 21:30:14 +00:00
carneiro 3868a7e778 Oneoff project to downsample, bootstrap and call snps to test sensitivity/specificity of downsampled coverage in WEX projects.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5713 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-29 19:17:30 +00:00
carneiro f04cc4321f fixed a bug when the pipeline was used on a single bam.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5708 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-28 17:19:22 +00:00
depristo 122d5845d3 GATK Resource bundle, latest version (now with b37 -> b36 support). Oneoff scala script that assesses chip coverage of calls
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5703 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 22:01:36 +00:00
kshakir 6b1b4931e7 Added FCP VE stratifications for Filter, FunctionalClass, and Stratification as requested by Corin.
Feeding FCP UG the bam list instead of individual bams to cut scatter gather time from O(m^100) as measured by Chris to O(m^1).
Fixed NPE when eval values aren't found in PipelineTests.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5694 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-27 02:29:56 +00:00
kshakir 00b57c751b Added missing ".0".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5682 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-22 21:50:07 +00:00
chartl 5b9a8555cd Queue graph time is currently of O(n^m) where n = num jobs, m = num unique base files. This script therefore was running in order 1200^16, which I don't think would finish before the heat death of the universe. For now, push down the number of files to 1 and gather them outside of Queue, once I've fixed up scatter-gather in core, outputs can be uncommented.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5674 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 12:56:25 +00:00
corin 9f006be425 Updates Omni path and removes a typo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5673 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-21 04:17:13 +00:00
kshakir 8619f49d20 Added a utility method to retrieve the contig lengths for WG chunking.
Added a rudimentary GATKReportParser for parsing VE3 results.
Re-enabled the FCPTest using VE3, the GATKRP, and the PicardAggregationUtils.
The tag type for .rod files is DBSNP, not ROD.
More explicit return types on implicit methods.
Added null checks for implicit string to/from file conversions.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5668 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 19:22:21 +00:00
depristo d8b8f857f3 V2 -- now working -- of a core walker that creates the standard GATK resource bundle
See https://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle

Which live locally in /humgen/gsa-hpprojects/GATK/bundle/current

You use this following command to create the bundle:

java -Djava.io.tmpdir=/broad/shptmp/depristo/tmp -jar dist/Queue.jar -S scala/qscript/core/GATKResourcesBundle.scala --gatkjarfile dist/GenomeAnalysisTK.jar -bsub -jobQueue gsa -svn 5660 $* 

Annoyingly, it must be run in the trunk directory, and requires an explicit svn version number to create the directory.  It also must be run in two stages manually.  First, the local bundle is created, and then with the -phase2 argument all of the files in the local bundle are compressed and pushed to the FTP server.  I'm likely going to shift most of my processes over to using this location for data file access, especially for b37 data sets.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5665 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-20 12:48:47 +00:00
carneiro d35c7d1029 - minor changes to the 'justclean' script to handle the Trio Cleaning.
- fixing a bug on single ended BWA option of the data processing pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5662 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-19 16:35:24 +00:00
depristo 541c9109b3 V1 of GATK Resource Bundling system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5659 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-18 19:23:45 +00:00
chartl 23fac043d9 Fix the outputs so the proper files are gathered (not automatic due to multiplexer)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5654 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 23:55:12 +00:00
chartl e5ef8388fc BatchMerge - AlleleVCF --> AllelesVCF, this (combined with Eric's fix) will solve James P.'s forum issue.
After viewing results on real case/control data from RAW -- it's really working quite well. ReadIndels, however, needs to use a T-test rather than a U-test, especially in deep coverage (at indel sites, the reads with indels will have mostly the same number of CIGAR indel elements -- one -- which doesn't really play nicely with the UTest when sample sets are large). Modified ReadsLargeInsertSize to be a two-way test (e.g. ReadsLarge and ReadsSmall). BaseQualityScore also suffers from the same issue as read indels, so switching over to a T-test in that case as well.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5653 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-15 22:03:16 +00:00
chartl 104d5515fe Huh, somehow this change didn't make it through last time
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5639 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 17:09:37 +00:00
chartl 47fa7e2227 + Added override to extractFileEntries
+ UG now doesn't care whether it's given SNPs or indels to genotype, it will do the right thing -- so remove the option to specify which GM user wants

+ Max misamatches argument removed

integration test will follow



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5638 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 15:13:35 +00:00
kshakir 475ad1259d Put a band-aid on the FCP by switching use of DINDEL to INDEL and explicitly running UG the old way with just indels and just snps.
Switched YAML parser to new Broad parser which will additionally update picard cleaned bams to the latest version if the project and sample are specified.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5634 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-14 02:22:31 +00:00
corin 9ee30ce594 Whole genome pipeline script. currently chunks, cleans, calls, merges, selects and filters indels, recalibrates, and evals.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5627 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-13 16:59:48 +00:00
chartl 8125b8b901 Old changes to the exome VQSR search.
SGA updated to include new proportion-based insert size test.

Major fix for dichotomization test: MathUtils now optionally ignores NaN values for sums, averages, variances. In the future this feature can be pushed back into the AssociationContext object iself (e.g. no data? no entry), but it's kept like this for transparency for now.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5618 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-12 23:00:50 +00:00
kshakir 4b7c3af763 When /etc/mailname is unreadable fall back to the hostname.
Implicit conversions for String to/from File.
Small updates to the example QScripts.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5614 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-11 20:22:44 +00:00
rpoplin 05ad6ecf72 bug fix in MDCP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5613 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-11 18:27:47 +00:00
chartl b81228fec1 Minor bug fixes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5603 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 17:30:40 +00:00
hanna 437db28937 Incorporating Khalid's feedback.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5602 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-08 16:22:49 +00:00
chartl cc58e19621 This is now running. Expect results in a few weeks when the ~7k jobs have percolated through the week queue. Pray gsa1 doesn't go down.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5593 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 21:12:59 +00:00
chartl 6a26957b65 Bug squashing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5592 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 20:11:28 +00:00
chartl a1b7d28375 Initial VQSR full search script
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5591 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-07 20:03:48 +00:00
rpoplin febb883511 updates to MDCP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5586 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-06 19:44:46 +00:00
hanna 798fb6a7a2 First draft of a script to measure performance of read walkers when merging
dynamically.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5570 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-04 15:35:14 +00:00
carneiro b722ebf244 quick help/comments updates to match the wikipage.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5569 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-04 12:55:55 +00:00
depristo 349661b958 Renamed StratifyAlignmentContext to AlignmentContextUtils, and StatiefyContextType to ReadOrientation. Also, went through the system and deleted all references to second bases. That ship passed long ago.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5563 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-03 15:35:09 +00:00
rpoplin 40a25af58e Bug fixes in MDCP
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5561 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-03 00:04:38 +00:00
depristo f2c4356a40 Minor usability improvements to the standard eval script.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5551 348d0f76-0448-11de-a6fe-93d51630548a
2011-04-01 17:36:50 +00:00
carneiro 0a772688fe implementation of the Gatherer class for CountCovariates, which makes it now scatter/gatherable. Kudos to the @Gather annotation Khalid just introduced!
QuickCCTest is my test script for the gatherer.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5547 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:15:21 +00:00
carneiro 20344a27b4 Quick updates to the data processing pipeline after successfully cleaning the papuans. It now scatter gathers everything and runs in the hour queue for low pass data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5546 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-31 21:13:33 +00:00
carneiro 5d26c66769 Count Covariates is almost scatter-gatherable now!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5537 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 22:25:33 +00:00
rpoplin 5ddc0e464a Under guidance from Matt added ability to use key-value tags with ROD binding command line arguments, so now one can say -B:hapmap,VCF,known=false,training=true,truth=true,prior=12.0 hapmap.vcf and get the tags in a walker. Look at ContrastiveRecalibrator for an example of how to use the new ReferenceOrderedDataSource.getTags(). Removed references to FDR in tranches since we are only using truth sensitivity. Finally fixed long standing bug where tranche filters weren't set appropriately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5536 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 21:04:09 +00:00
carneiro c3f70cc5cb DPP: Updated after some tests with BWA. Still needs more testing.
MDP: Removed ApplyVariantCut as it's no longer necessary with VQSR2.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5534 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-29 18:22:09 +00:00
carneiro ccdc021207 Added BWA (option) to the data processing pipeline. Lots of testing still happening...
little fix to the calling pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5528 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-28 20:17:57 +00:00
depristo cdb0bde952 Bringing script up to date
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5526 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 20:49:07 +00:00
depristo bae0b6cba8 A script for playing with BEAGLE refinement parameters. Supports construction of reference panels from NGS data sets with varying niteration and calibration curve parameters, as well as imputing missing genotypes in a VCF with this reference panel, and comparison to a deeply sequenced individual.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5523 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-27 12:44:25 +00:00
chartl fe7f45ee2e First pass at recalibrating associations, with optional data whitening. Modification to the TableCodec so it can natively read bedgraph files (just needed to add an extra header marker: "track").
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5515 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-25 19:35:39 +00:00
kshakir e47513f043 Minor updates to match the wiki documentation.
Upper cased the PartitionType enum values.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5506 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 20:22:23 +00:00
carneiro 1281c842ad quick updates to conform with the new picard bam function structure
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5505 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 16:58:37 +00:00
kshakir f3e94ef2be Walkers can now specify a class extending from Gatherer to merge custom output formats. Add @Gather(MyGatherer.class) to the walker @Output.
JavaCommandLineFunctions can now specify the classpath+mainclass as an alternative to specifying a path to an executable jar.
JCLF by default pass on the current classpath and only require the mainclass be specified by the developer extending the JCLF, relieving the QScript author from having to explicitly specify the jar.
Like the Picard MergeSamFiles, GATK engine by default is now run from the current classpath. The GATK can still be overridden via .jarFile or .javaClasspath.
Walkers from the GATK package are now also embedded into the Queue package.
Updated AnalyzeCovariates to make it easier to guess the main class, AnalyzeCovariates instead of AnalyzeCovariatesCLP.
Removed the GATK jar argument from the example QScripts.
Removed one of the most FAQ when getting started with Scala/Queue, the use of Option[_] in QScripts:
1) Fixed mistaken assumption with java enums. In java enums can be null so they don't need nullable wrappers.
2) Added syntactic sugar for Nullable primitives to the QScript trait. Any variable defined as Option[Int] can just be assigned an Int value or None, ex: myFunc.memoryLimit = 3
Removed other unused code.
Re-fixed dry run function ordering.
Re-ordered the QCommandline companion object so that IntelliJ doesn't complain about missing main methods.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5504 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-24 14:03:51 +00:00
chartl cd90fdeca1 Right. The issue was not setting the scatter/gather classes appropriately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5501 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 20:08:53 +00:00
chartl 3c1bf40a45 QScript for scatter-gathering regional association (not quite as easy as using the built-in extension, due to the multiplexer). Currently does not work due to something I'm missing re: scatter gather class, this commit is an interim one.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5500 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-23 19:42:29 +00:00
carneiro 3414bccb46 documentation changes to agree with the wiki
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5494 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 21:48:49 +00:00
carneiro 28149e5c5e GenotypeAndValidate version 2, ready to be used.
- now it differentiates between confident REF calls and not confident calls.
- you can now use a BAM file as the truth set. 
- output is much clearer now

dataProcessingPipeline version 2, ready to be used.
- All the processing is now done at the sample level
- Reads the input bam file headers to combine all lanes of the same sample.
- Cleaning is now scattered/gathered. Inteligently breaks down in as many intervals as possible, given the dataset.
- Outputs one processed bam file per sample (and a .list file with all processed files listed)
- Much faster, low pass (read Papuans) can run in the hour queue.




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5493 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 20:18:02 +00:00
carneiro 748787c509 helper script to the papuan processing... minor updates
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5489 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-22 14:11:02 +00:00
kshakir f6d4b0aaf5 Using an embedded version of Picard for merging un-indexed bam files after scatter/gather instead of requiring the QScripts to specify the picard JAR. May do this for the GATK jar too.
Fixed initialization of pending counts when using -startFromScratch so the count doesn't start at zero and end at -<#njobs>.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5483 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 18:20:01 +00:00
carneiro 1198a90ac7 cosmetic change.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5481 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-21 15:46:04 +00:00
carneiro 96628457cb pacbio calling pipeline also using VQSR2 now, minor updates on the other pipelines to get the papuans through.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5479 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 22:06:52 +00:00
carneiro 4e449905d1 methods development pipeline now sports VQSR2.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5478 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 22:00:46 +00:00
carneiro c9442e4b21 now merging bam files per sample and processing according to cleaning options.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5477 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 21:31:29 +00:00
carneiro 18fac5112c first step towards the new sample based processing pipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5471 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 19:25:15 +00:00
depristo abc7d1aef9 BeagleOutputToVCF now accepts an option to keep monomorphic sites. This is useful to genotype a single sample, where having AC=0 just means that the sample is hom-ref at the site.
ProduceBeagleInputWalker can optionally emit a beagle markers file, necessary to use the beagled reference panel for imputation.  Also supports the VQSR calibration curve idea that a site can be flagged as a certain FP, based on the VQSLOD field.  This allows us to have both continuous quality in the refinement of sites as well as hard filtering at some threshold so we don't end up with lots of sites with all 1/3 1/3 1/3 likelihoods for all samples (i.e., a definite FP site where we don't know anything about the samples). 

Added a new VariantsToBeagleUnphased walker that writes out a marker drive hard-call unphased genotypes file suitable for imputating missing genotypes with a reference panel with beagle.  Can optionally keep back a fraction of sites, marked as missing in the genotypes file, for assessment of imputation accuracy and power.  The bootstrap sites can be written to a separate VCF for assessment as well.

Finally, my general Queue script for creating and evaluating reference panels from VCF files.  Supports explicitly genotyping a BAM file at each panel SNP site, for assessment of imputation accuracy of a reference panel.  Lots of options for exploring the impact of the VQS likelihooods, multiple VCFs for constructing the reference panel, as well as fraction of sites left out in assessing the panel's power.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5467 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-18 03:08:38 +00:00
carneiro 55e5971b3b this is a oneoff script to clean the papuans and test TargetCreator and IndelRealigner with scatter gathering.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5457 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-17 17:09:53 +00:00
rpoplin 9c413fbc9e not useful
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5450 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-15 22:47:55 +00:00
carneiro 42f70d9e07 join all per-lane Bams before doing target realigning and indel cleaning.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5435 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-14 16:11:03 +00:00
depristo d01d4fdeb5 Optimized version of produce beagle tool, along with experimental (hidden) support for combining likelihoods depending on estimate false positive rate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5430 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-12 02:06:28 +00:00
fromer 0b45de14ed Some minor updates to fully utilize the functionality of reduceByInterval
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5411 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-09 20:38:08 +00:00
depristo bf2e02f472 Generic, easy-to-use variant evaluation Queue script that tests indel and SNP call sets against standard evaluation data sets for sensitivity and specificity
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5391 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-07 18:03:29 +00:00
depristo 5c979633f0 Due to a problem in the way that dynamic type selection works, I've added an explicit (temporary) ability to restrict VE to specific variant types (SNPs, INDELs, etc), so that calculations will work when a site has a SNP in dbSNP but is called as an indel, causing the SNP site to mysteriously disappear from the comp track, a huge problem for validation report. VEU updated to allow both dynamic type (old) and just returning everything in the track.
Also, created a standard Queue script that calculates a suite of standard indel and SNP assessment results.  Will be the basis for a general evaluation Queue script with standardized data files for SNPs and Indels.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5385 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 19:31:12 +00:00
chartl a40a8006b5 Added in unit tests for the statistics calculated by the test runner; and bug-fixes to the calculations; so we have some assurance that the statistics coming out the back-end are correct.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5380 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-06 16:54:02 +00:00
chartl 9ca1dd5d62 Miscellaneous changes:
- RefMetaDataTracker: grabbing variant contexts given a prefix (not sure where else this was implemented, if someone can show me I'll remove it)
 - VCFUtils: grabbing VCF headers given a prefix 
 - MathUtils: Useful functions for calculating statistics on collections of Numbers
 - VariantAnnotator: Made isUniqueHeaderLine a public static method -- maybe this should go into a different class. Not sure.
 - Associations: PluginManager now used to propagate classes, implementations for Z,T,U tests, slight alteration to format to make the objects stored
      in the window optionally different from those returned by whatever statistic is run across the window
Added:
 - MannWhitneyU. Started to fix up WilcoxonRankSum but there are comments in there questioning the validity of some of the code, and I'm sure that
    it's actually doing a U test. This implementation includes the direct calculation of p-values for small sample sizes, and a uniform approximation
    for when one of the sample sets is small, and the other large. Unit tests to follow.
 - BootstrapCallsMerger: takes n VCFs which have been called on the same samples; merges them together while averaging the annotations
 - BootstrapCalls.q: qscript for testing the effectiveness of boostrap low-pass calling on the exome
 


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5372 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 22:43:36 +00:00
carneiro 0daa65b9ef quick and dirty 'close your eyes' solution to run the papuans over the weekend. Will be properly fixed soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5370 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-03 21:42:22 +00:00
carneiro 8ab6eee1cf gold standard creates its own tranches and vcf files now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5347 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 17:48:40 +00:00
chartl 0723b0f44c Generalized association is now working. Output is in a horrific format. Implementation of T-testing. Improvements are to look for classes dynamically (a la VariantEval/VariantAnnotator), beautify output, and do optimizations where they exist.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5341 348d0f76-0448-11de-a6fe-93d51630548a
2011-03-01 01:23:37 +00:00
rpoplin ce34a8a918 New hidden option in VQSR to not parse the genotypes of the incoming training data. Updated VQSR training in methods development pipeline to be more in line with best practices.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5340 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 23:19:51 +00:00
carneiro c7a51f0de7 fixed 1kg pilot dindel calls vcf file and combined all vcfs into one master dindel file.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5335 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-28 19:04:58 +00:00
depristo 146756de79 Class name to reflect actual file name. manySampleUGPerformance now operates on 1000 samples!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5326 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 23:36:04 +00:00
chartl b089d35b21 Fix expand intervals to do the right thing:
- No more duplicate intervals
 - Truncation at intervals that already exist, e.g.

exists:      |--------|           |-------|
new:               |---------|
fixed:                 |-----|

note that weird instances like:

exists:           |-|        |-|                  |-|
new:           |---------------------|
fixed:                          |----|

e.g. you're truncated to the nearest interval on whatever side. In general many behaviors could happen in this instance, this is the one currently implemented.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5323 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-26 04:19:01 +00:00
carneiro fd5d1f9cfc minor cosmetic changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5322 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 21:56:35 +00:00
carneiro 81414a21dd dpp: back to using 4gb memory assuming all is right with IndelRealigner now.
mdcp: Some class structural changes due to the inclusion of indel calls. ApplyCut now chooses the tranche differently for each dataset.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5319 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 19:21:02 +00:00
kshakir 3e0a722672 MFCP waits for other pipelines to finish by using the previous log file of one pipeline as virtual input to the next pipeline.
Using the name of the yaml in the log file name instead of each writing each to "queue.out" so that two yamls can run from the same directory without creating cycles in the graph.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5318 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-25 17:51:01 +00:00
carneiro 6db3210387 the data processing pipeline needs more memory...
directory updates in the methods pipeline.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5305 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 17:22:58 +00:00
carneiro 897a333aba Methods Development Pipeline now has the option of calling indels with the -indels parameter. Also updated some databases and the new NA12878 HiSeq hg19 that Tim just funneled to us, is updated and called.
Small fixes on the data processing pipeline


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5304 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 17:12:55 +00:00
rpoplin 255cc246a2 Change in Methods development pipeline: dbsnp130 can't be used for anything, changed it to dbsnp129. Optimization for HaplotypeScore and the to-be-committed ReadRosRankSumTest in AlignmentUtils
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5301 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 16:09:03 +00:00
chartl 97e1a5262e -ct x no longer includes coverage in the previous bin
BatchMerge - additional support for indels (can't just test the alternate allele when it's an extended event, must also specify that you want to use the dindel model when you actually test the allele)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5300 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-24 15:52:04 +00:00
kshakir f1f9bd6dcc Due to recent LSF hiccups put a very brief (.5-2min) retry around getting status. Can't wait too long because statuses are archived an hour after exit.
TODO: Switch to bulk status checks and add status archive lookups.
Sending SIGTERM(15) instead of SIGKILL(9) to allow for graceful termination of child process.
Printing out the name of the QScripts in the compile error text.
Added a pipelineretry -PR pass through for the MFCP and MFCPTest.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5295 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 18:59:08 +00:00
chartl 07d381ec51 BatchMerge now uses the correct UG settings, recently added by Eric
ExpandIntervals now checks that identical intervals are not created by (un)fortunately-spaced targets
VCFExtractIntervals no longer creates duplicate intervals in the case where a VCF has multiple entries at the same site



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5294 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 18:46:15 +00:00
carneiro 2a48ec1307 now only accepts intervals files if the user specifically requests to report bams at interval only.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5291 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-23 16:49:58 +00:00
carneiro ecfb51bcd8 Few organizational changes, queue output is now categorized and hidden. Also changed NA12878.Wex to dbsnp 129.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5290 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-22 22:49:38 +00:00
carneiro 8ea71fd294 minor dataset chages.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5289 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-22 20:18:10 +00:00
carneiro c61dd2f09f data processing pipeline now has on the fly bam indexing (powered by Matt) some new parameters, Indel Cleaning with constrain movement and fixMates is gone.
setting up methods development pipeline for some cosmetic changes.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5277 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 23:13:54 +00:00
depristo d97ed3e080 Comments for Mauricio
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5275 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-18 16:58:34 +00:00
carneiro acad3ada06 changed baq to calculate_as_necessary.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5270 348d0f76-0448-11de-a6fe-93d51630548a
2011-02-17 23:50:46 +00:00