Commit Graph

4424 Commits (cece19d4d2f0dcd84f64cef23a9523ecc9d35219)

Author SHA1 Message Date
asivache cece19d4d2 not ready for commit yet
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4465 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:14:54 +00:00
asivache 39e373af6e deleting accidentally committed junk
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4464 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 15:13:01 +00:00
asivache b3d81984aa renaming MergingIterator to RODMergingIterator as it is more appropriate for this specialized implementation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4462 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:10:11 +00:00
asivache 77dddd0afa renaming MergingIterator to RODMergingIterator as it is more appropriate for this specialized implementation
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4461 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:08:28 +00:00
chartl bffb8bb01f The SVN repository is not for dumb analysis-specific scripts.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4460 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 14:04:53 +00:00
chartl 21ec44339d Somewhat major update. Changes:
- ProduceBeagleInputWalker
 + Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present
 + Takes a bootstrap argument -- can use some given %age of the validation sites
 + Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap
-BeagleOutputToVCFWalker
 + Now filters sites where the genotypes have been reverted to hom ref
 + Now calls in to the new VCUtils to calculate AC/AN

-Queue
 + New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype
 + full calling pipeline v2 uses the above libraries
 + minor changes to some of my own scripts
 + no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 13:30:28 +00:00
kshakir e02f837659 Added the ability for Queue functions like mkdirs to override if they are done or not.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4458 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 06:39:55 +00:00
ebanks 97b153f2fa Quick fix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4457 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 06:10:52 +00:00
ebanks acd238f3f2 For Chris: pull out the chromosome counting code into VCUtils so that other tools can make use of it. Transitioned SelectVariants over to use it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4456 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 04:37:54 +00:00
delangel 3838823262 Two ugly hopefully temporary fixes for new genotyping model:
a) In Indel genotyper: we can't deal yet with extended events correctly and we are still triggering at each extended event which results in repeated records on a vcf. So, to avoid this, keep track of start position of candidate variantes we've visited and if we've visited a variant before we don't do it again.
b) Avoid infinite terms in QUAL and in genotype likelihoods which can happen if posterior AF happens to be exactly zero. For now, hard-code a minimum value of each term of the posterior AF likelihood to be -300 (ie 1e-300 in lin space). This can be solved with better and smarter log-to-lin conversions and some precision fixes in AF calculation.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4455 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-08 00:53:16 +00:00
depristo 0a2e76e9dc 2nd step towards on the fly indexing. Also fixed parsing bug for headers with < symbols
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4454 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 21:38:46 +00:00
rpoplin 7bb9704592 Update the BeagleOutputToVCF integration test because of removing the source header line. Source headers are provided by the engine for all VCF files now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4453 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 19:55:57 +00:00
kshakir 7f25019f37 Inprocess functions by default now log what output files they are running for.
On -run cleaning up .done and .fail files for jobs that will be run.
Added detection to Firehose YAML generator shell script for (g)awk versions that ignore "\n" in patterns.
Removed obsolete mergeText and splitIntervals shell scripts.


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4452 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 19:08:02 +00:00
rpoplin 0de658534d Removed the qScale arguments in VariantRecalibrator. It is smarter about how it tries to find a cut so the arbitrary scale factor hopefully is no longer necessary. Now the recalibrated variant quality score more accurately reflects our believed lod of the call.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4451 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 18:04:57 +00:00
fromer ee00dcb79d 1. Phasing now ignores bases without minimum base quality (BQ) and minimum mapping quality (MQ); 2. The probability of a non-called base is now divided by 3, to evenly split up the error probability over the non-called bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4450 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 17:40:59 +00:00
ebanks 6205910f9f updating integration test for Sarah Calvo
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4449 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 04:03:37 +00:00
kshakir db47230dd9 Wrapping ScatterGatherableFunctions with a facade instead of using slower clone library. Will require keeping Clone's facade code in sync with CommandLineFunction but runs *much* faster.
Shell invoking scripts so that even really long shell scripts make it through LSF.
Using the truncated (up to 1000 characters) of the command line for the job name for use with bjobs.
Switched the default from re-running everything to re-running only files that need to be regenerated.  --skip_up_to_date replaced with --start_clean for those who want to regenerate everything.
Updated logging to let users know when the scatter gather generator is running, which still takes a while but is orders of magnatudes faster for large lists of functions.  (40s for a 100 function graph exploding to a 2500 function graph)


git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4448 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 01:19:18 +00:00
depristo 04b4adafda File reports are now sorted in order
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4447 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 21:54:23 +00:00
fromer 652a3e8de5 Added integration tests for ReadBackedPhasing
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4446 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:50:32 +00:00
fromer f8f1cc45a3 Now ReadBackedPhasing caps Base Quality by Mapping Quality
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4445 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:48:57 +00:00
scalvo bda427f078 Change specification of AnnotationInputTable, and fix 2 bugs.
Previous output spec contained 3 columns:
 haplotypeReference,haplotypeAlternate,haplotypeStrand
where haplotypeReference was always on the + strand, and haplotypeAlternate was on the strand specified by haplotypeStrand.

The new specification contains 3 columns:
 haplotypeReference,haplotypeAlternate,transcriptStrand
where haplotypeRef and haplotypeAlt are required to be on the + strand.  transcriptStrand now specifies the strand of the transcript, which is needed for interpreting the haplotypes.

Bugfix #1: fix incorrect assignment of variantCodon and variantAA
(Previously variantCodon was incorrectly set to referenceCodon)

Bugfix #2: fix incorrect codingCoordStr values for - strands (bug reported by Giulio Genovese), and incorrect usage of "m." for mitochondrial transcripts (bug reported by Steve Hershman)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4444 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:46:09 +00:00
scalvo b5c127e643 Removed HAPLOTYPE_STRAND_COLUMN; Previously, GenomicAnnotation allowed a user to specify the strand of the haplotypeAlternate, and would reverseComplement the haplotypeAlternate if HAPLOTYPE_STRAND_COLUMN was "-". The new specification does not allow this functionality, and instead requires both the reference and the alternate haplotypes to be on the + strand (as in VCF format).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4443 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 20:37:41 +00:00
kshakir ca5db821ce Added the ability to Queue to run scala functions inside the JVM. NOTE: Extend from InProcessFunction instead of CommandLineFunction to use this functionality.
Queue now submits new LSF jobs only after previous functions have completed successfully.
When the Queue process is shutdown (ex: via Control-C) sends a bkill command for any running jobs.
Ported commands like creating directories and scatter/gather interval list to scala functions.
Updates to LSF status tracking by porting the python to internally generated bash scripts.
Temporarily disabled job name submission to LSF.  Plus side is that the full command is now available in "bjobs -w".  TODO: Put back jobName passing to LSF based on an option?
Changed BaseTest to allow scala to access paths to references.
Changed the extension generator to default the analysis name to the walker "name".

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4442 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 18:29:56 +00:00
ebanks 3c5dc675ab For Guillermo: only decide that something is a clear reference call if it is at least 10 times as likely as the next best genotype
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4441 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 15:16:41 +00:00
depristo effcd26977 Shorter outputs, new summary mode
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4440 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:34:50 +00:00
depristo d841f260eb minor improvements to queue status
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4439 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:34:35 +00:00
depristo 0508dd0c31 Better reporting -- figured out how to drop unused levels in subset
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4438 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:31:51 +00:00
depristo 00491fcd2e Only see not writing GATK Run Report if you are running with debug enabled
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4437 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:09:21 +00:00
rpoplin 69485d6a7a Added command line argument for the max value of the allele count prior in VariantRecalibrator (--max_ac_prior). Default value increased to 0.99 from 0.95.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4436 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:00:53 +00:00
ebanks c56c2641a8 /broad/1KG doesn't exist
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4435 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 12:54:38 +00:00
ebanks 3d564f4a29 reverting an accidental change from the dindel merge
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4434 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 03:08:09 +00:00
chartl 28ac1d325e Commit for Ryan
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4433 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 19:04:10 +00:00
depristo e8af776b99 Fixes for example bam files
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4432 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 18:43:02 +00:00
ebanks b5e148140b Officially fixed the UG priors; updated the default min MQ/BQs to pipeline values of q20 and min calling threshold to Q50
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4431 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 18:35:36 +00:00
fromer c6668bd49c Fixed bug in phasing, where mapping probability was incorrectly raised to the power of number of non-null bases [instead, it is just multiplied into phasing probability once]
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4430 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 17:07:31 +00:00
ebanks 7f1e44b764 update the example: /broad/1KG doesn't exist anymore
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4429 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 13:50:24 +00:00
hanna 250c18e679 Error message fixes for the following issues:
nvjpM4yOwQAu3fNGxi4oXLuVpKn6aAlf,1GL0OuXK2xKQfvbu34tWYgbojSVSLo0l,
ehEGBJOfgc4V7qj8W0Homf5ICuVK5Sm3,cZsreLm1CbY3aYKZhV7DOSvQNwur41zp,
GlrlyGEyP9kJDIRCQNFQp7BGJBXSzdDJ,hyz1uiHXr39ANmdZu9K1epOSX8EL3mDw,
q0n4EucZESCI4LZhQik306zD4VAuH2cb.  

Messages:
camrhG5tHzlY9WUSEVpVZGkU1tyJqKb5,s0OX2g7nYRctJxyFoQCa6clac9IsjHyi,
THIAtjllvYNlnTmiMnJEIHd2Ju4gqQIO,jwVk3JYZJNHloW7HO4LeGxFexknqro0v,
BFNRGOGmGGJNNPZqgeF1ikTNFfskbyLc,...

Were fixed in 4392.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4428 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 03:37:13 +00:00
corin e340be34d8 upping mem limit since something was unhappy with the lower limit
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4427 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-05 02:38:17 +00:00
kshakir bb44044ce0 Fixed re-builds of queue so that previously compiled classes are included. Fixes redundant case of "ant queue test" vs. "ant test".
Refactored temp directory utils.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4426 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 21:12:07 +00:00
kshakir 4dfed62e7d Generating the Queue GATK extensions using java, then compiling all the Queue scala code at once to allow circular dependencies between existing and generated scala code.
Will see how this behaves for those using IntelliJ as generated source code will disappear during an ant clean.

git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4425 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 19:38:29 +00:00
kiran 24cf6f9e36 Fix to handle situation where there are no filtered variants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4424 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 18:34:01 +00:00
ebanks aa00801108 remove reference to -mrl
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4423 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 17:27:01 +00:00
chartl f978c25b9d Perhaps both, Eric. Perhaps both.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4422 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 13:56:04 +00:00
chartl 0eb777612a Swap "." over to VCFConstants.MISSING_DEPTH_v3
Why v3, you ask? Why not? Simply because v2 was a String so old and clunky, the sun would fizzle out and grow cold before any VCF could be successfully parsed.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4421 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 13:41:41 +00:00
chartl 74087c44ae Fixed a bug which caused a parsing exception when there was a variant with a dp field of ".", e.g. "GT:DP 0/1:." -- which can happen when using imputation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4420 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 12:37:36 +00:00
ebanks 6448753cf7 Removed the SequenomValidationConvertor and renamed it VariantValidationAssessor since it no longer handles ped/sequenom files (but instead works on vcfs/variantcontexts). Updated all of the wiki docs, including adding instructions on how to convert ped files to vcf, a la Shaun Purcell. We now officially no longer support ped files everyone. Other misc cleanup in the code.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4419 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 02:11:38 +00:00
kiran a15757b8e8 Obsoleted by VariantReport.R
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4418 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 01:00:59 +00:00
kshakir cf01f6d58a Renamed conflicting 'package.dir' in build.xml to 'package.xml.dir'.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4417 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 00:46:47 +00:00
kiran 62f5383859 * Added an R package, "gsalib", providing a place to store common, useful, documented R methods. To use this module, you must follow three steps:
1) Build the module with the following command:
$ ant gsalib

2) Add the module path to your ~/.Rprofile file:
.libPaths("/path/to/Sting/trunk/R/")

3) At the top of each R script that will use the library, include the line:
library(gsalib)

You can now use the package like any other R package.  To get high-level documentation, supply the following command to R:
help(gsalib)

The methods contained herein are:

    getargs         : A method to easily provide arguments to interactive and non-interactive scripts.
                        Prints out a help message specifying how the script should be run if no arguments
                        or "-h" is provided.  Very helpful when you're writing an R-script piecemeal in
                        interactive mode, then want to make it a command-line program.
    plot.venn       : Plots a two-way or three-way proportional Venn diagram.
    read.eval       : Reads VariantEval output that's formatted in R style.
    read.gatkreport : Reads GATKReport output.
    gsa.message     : Emits a message with the prefix "[gsalib]" to stdout.
    gsa.warn        : Emits a warning message with the prefix "[gsalib] Warning:" to stdout.
    gsa.error       : Emits an error message with the prefix "[gsalib] Error: to stdout, calls traceback()
                        and halts execution.

Documentation on each of these methods can be obtained by typing "help(method_name)" at the R prompt.

* Retired GATKReport.R, as that functionality has now been moved to gsalib.
* Retired gsacommons, as that functionality has been split between gsalib and VariantReport.R.
* Modified VariantReport.R to make use of gsalib.  The script now uses the getargs() method to provide the user with some information as to the proper way to run the script.  Documentation on how to prepare output is given at http://www.broadinstitute.org/gsa/wiki/index.php/VariantEval .
* Added 'gsalib' target to build.xml file.  Running "ant gsalib" will compile this module and place the R-ready package in R/gsalib .



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4416 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 00:27:59 +00:00
ebanks d8db48204e Fix typo and tell people not to post user errors
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4415 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-03 18:58:03 +00:00