Commit Graph

98 Commits (d492eb94ad2bb7e5c79a4d9cd051feca3abfd8a7)

Author SHA1 Message Date
kiran d492eb94ad Actually subsets the resulting table now, like it was supposed to all along.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4696 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-17 16:18:23 +00:00
kiran 50dbbdb8ab Retrieves per-sample or per-lane metrics from the SQUID database and populates a dataframe with the results.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4693 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-16 22:46:07 +00:00
depristo 44d0cb6cde New version of cutting routines for VQSR. Old code removed. Working unit tests. Best practice with testng integration test (everyone look at it). Walker test now allows you to not specify no. input files, if it can infer input counts from MD5s
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4664 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-13 16:19:56 +00:00
depristo 4f4eec12dd Minor improvement
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4659 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-12 19:30:54 +00:00
depristo 760f06cf8c now prints a nice report, can be invoked from command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4641 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-09 19:44:10 +00:00
depristo 3c08a1c061 Basic script for assessing simulation sensitivity and specificity
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4638 348d0f76-0448-11de-a6fe-93d51630548a
2010-11-08 21:02:10 +00:00
kiran 1d68b28bbd Takes a list of BAMs, looks up the read group information in the sequencing platform's SQUID database, and computes the tearsheet stats. Also takes the VariantEval output (R format) and outputs the variant stats and some plots for the tearsheet. This script requires the gsalib library to be in the R library path (add the line '.libPaths('/path/to/Sting/R/')' to your ~/.Rprofile).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4584 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-27 19:06:22 +00:00
depristo 0508dd0c31 Better reporting -- figured out how to drop unused levels in subset
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4438 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:31:51 +00:00
kiran 24cf6f9e36 Fix to handle situation where there are no filtered variants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4424 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 18:34:01 +00:00
kiran 62f5383859 * Added an R package, "gsalib", providing a place to store common, useful, documented R methods. To use this module, you must follow three steps:
1) Build the module with the following command:
$ ant gsalib

2) Add the module path to your ~/.Rprofile file:
.libPaths("/path/to/Sting/trunk/R/")

3) At the top of each R script that will use the library, include the line:
library(gsalib)

You can now use the package like any other R package.  To get high-level documentation, supply the following command to R:
help(gsalib)

The methods contained herein are:

    getargs         : A method to easily provide arguments to interactive and non-interactive scripts.
                        Prints out a help message specifying how the script should be run if no arguments
                        or "-h" is provided.  Very helpful when you're writing an R-script piecemeal in
                        interactive mode, then want to make it a command-line program.
    plot.venn       : Plots a two-way or three-way proportional Venn diagram.
    read.eval       : Reads VariantEval output that's formatted in R style.
    read.gatkreport : Reads GATKReport output.
    gsa.message     : Emits a message with the prefix "[gsalib]" to stdout.
    gsa.warn        : Emits a warning message with the prefix "[gsalib] Warning:" to stdout.
    gsa.error       : Emits an error message with the prefix "[gsalib] Error: to stdout, calls traceback()
                        and halts execution.

Documentation on each of these methods can be obtained by typing "help(method_name)" at the R prompt.

* Retired GATKReport.R, as that functionality has now been moved to gsalib.
* Retired gsacommons, as that functionality has been split between gsalib and VariantReport.R.
* Modified VariantReport.R to make use of gsalib.  The script now uses the getargs() method to provide the user with some information as to the proper way to run the script.  Documentation on how to prepare output is given at http://www.broadinstitute.org/gsa/wiki/index.php/VariantEval .
* Added 'gsalib' target to build.xml file.  Running "ant gsalib" will compile this module and place the R-ready package in R/gsalib .



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4416 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-04 00:27:59 +00:00
kiran 40b2f62a83 Changed precision on Ti/Tv in venn diagrams
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4413 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-02 05:27:13 +00:00
kiran d0e44b7a8e Lower precision on Ti/Tv in variant summary matrix
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4412 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-02 05:18:48 +00:00
kiran 6deb755164 Ti/Tv plots are restricted to a Ti/Tv range of 0.0-4.0. Added column to variant summary specifying the total variant counts (known+novel). Allele spectrum plots now show neutral expectation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4411 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-02 05:15:34 +00:00
kiran 1d7e48c4b0 Venn diagrams are now oriented properly when a < b. Added a slide with callset summary table. All plots now show the present-in-a, filtered-in-b metrics. Added title page with project name, author, and timestamp.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4407 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 22:17:21 +00:00
kiran fe29c8b09c Placeholder commit: improvements to VariantReport (now shows stats for variants that are called in one set and filtered in another). Better command-line argument support.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4404 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-01 18:46:53 +00:00
corin 9cf079e1bb Ready for integration with queue script
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4346 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-24 19:46:01 +00:00
corin d6bd1debeb This is an updated version of the automated data processing report. Each page in the report is a stand alone function, which are linked together with a function which pulls all appropriate data (assuming a standard naming convention) and generates the pdf. This script still need to respond appropriately when it doesn't find the data it needs, database access, and a way of getting some information from sequencing for the tearsheet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4335 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-23 18:08:16 +00:00
depristo b57a0a0310 improvements to the report code
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4280 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 00:45:13 +00:00
kiran dfdd0b69a9 Removed unused dependency (it was causing a problem by looking for an X11 connection that didn't necessarily exist).
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4244 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 19:56:00 +00:00
depristo 594fb4a547 More plots in report
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4225 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-08 02:56:51 +00:00
kiran 19e22cfa87 Fixed a bug where the script looked for the wrong column name. Also, all results are now returned in a single plot.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4216 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-06 14:19:57 +00:00
depristo 0c54bf4195 Better reporting and now with a special mode for listing exceptions
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4183 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 16:19:51 +00:00
corin cdad243645 updated version of the DPR. Now produces part of the tearsheet as well as good depth of coverage figures
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4182 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 15:38:58 +00:00
depristo fc5caa98a5 Improved reporting now with metrics by day/week/etc.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4180 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 02:43:13 +00:00
depristo 8683087756 Suppl. tools for working with and displaying GATK run reports
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4176 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-31 20:32:22 +00:00
kiran e14a347e2e Now prints cluster report to a single PDF, rather than a dozen different PDFs.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4164 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-29 18:58:39 +00:00
kiran fd19c63aaf A data structure that allows data to be collected over the course of a walker's computation, then have that data written to a PrintStream such that it's human-readable, AWK-able, and R-friendly (given that you load it using the GATKReport loader module).
This object designed to be both the structure that holds data during the execution of the walker, as well as the object that properly formats and emits the data so that it can be easily loaded into R.  In the end, you get a table that looks like this:

##:GATKReport.v0.1 ErrorRatePerCycle : The error rate per sequenced position in the reads
cycle  errorrate.61PA8.7         qualavg.61PA8.7
0      0.007451835696110506      25.474613284804366
1      0.002362777171937477      29.844949954504095
2      9.087604507451836E-4      32.87590975254731
3      5.452562704471102E-4      34.498999090081895
4      9.087604507451836E-4      35.14831665150137
5      5.452562704471102E-4      36.07223435225619
6      5.452562704471102E-4      36.1217248908297
7      5.452562704471102E-4      36.1910480349345
8      5.452562704471102E-4      36.00345705967977
...

A GATKReport object can hold multiple tables, and the write() method will emit all tables in succession.  Each table starts with its own ##:GATKReport.v0.1 table header, so each table can stand alone.  This allows for tables to be mixed and matched in a single file, or for the output from different walkers to be combined into a single file with no ill effect.

The display property of individual columns can be turned off.  This is useful when a column is used to store intermediate results, necesary for the computation of some later value, but the contents of the intermediate column itself are not required in the final output file.

Finally, the GATKReportTable allows for some simple, mathematical, element-wise and column-wise operations.  For instance, two whole columns can be divided, the results of the operation being stored in a third column.  This mimics the most basic of R operations, where whole vectors can be added, subtracted, multiplied or divided without requiring the developer to explicitly write a loop.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4159 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-29 05:39:24 +00:00
corin 8931a63588 updated a whole bunch of column names to work like i want them to and added more informative figures for DOC
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4131 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-26 18:19:09 +00:00
kiran fba71e3c15 Placeholder commit. Implements a loader for a new multi-part GATK reporting format. See what it looks like at /home/radon01/kiran/scr1/projects/NewVariantEvalOutput/results/v1/tableexample.txt . Still need to address the issue where numeric columns are being interpreted as a vector of strings, not numbers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4115 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-25 18:48:44 +00:00
corin 8054b6b295 Changing a name of a column for variantevals output for easier reading by R--let me know if this needs to be updated elsewhere; it's just a space to an underscore.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4062 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-19 19:18:16 +00:00
depristo ede87a03c2 Nicer plotting routine for tranches. Add a third arg to suppress the legend.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4049 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-17 19:20:58 +00:00
depristo e0abb73fd7 plot now assumes 1 / 1000 is the min error rate, not 1/100
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4010 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 14:48:22 +00:00
kiran 6037443e55 Handle interactive and non-interactive modes more elegantly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4009 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 02:38:53 +00:00
kiran a7409df1a6 Be more robust to missing or empty files in VariantEval output.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4008 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-11 02:22:50 +00:00
depristo 67063deb16 Removed coloring by mixture weight. Each cluster gets a distinct color, and the legend indicates which cluster has which id and its weight
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4001 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 14:28:24 +00:00
depristo 672bee295c now plots tranches separately from optimizer
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4000 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-10 12:02:52 +00:00
depristo 41fee2d75e Publication tranches report is now the default output
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3967 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-07 13:58:59 +00:00
depristo f4ffef4479 Default max variants is now 5000
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3966 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-07 13:58:32 +00:00
depristo b63d64bbbc Beautiful labels, better choice of dimension ranges. Supports fast loading of just first N records for testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3964 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 23:17:32 +00:00
depristo d3bebe0f2c Reasonable comment
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3963 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 22:03:55 +00:00
depristo bb5dfd7e5e Slightly nicer plotting; not yet complete
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3961 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 20:01:31 +00:00
depristo 70f492a6e8 Prints out trivial debugging info
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3957 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-06 13:24:21 +00:00
kiran 1a36cb9296 Can now set the maximum number of variants to see in a cluster plot (useful when you don't need to see a billion points to get an idea of what's going on. Limit applies to known and novel variants separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3937 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 18:45:24 +00:00
kiran bd27287fe7 An R module that takes in a Variant Recalibration cluster file (file with '@!CLUSTER' lines in it), a tabularized VCF, and optionally a set of loci that should be examined more carefully, and emits a tremendous number of plots. For every annotation used in clustering, the distributions and pair-wise comparison (with ellipses denoting the 2-sigma cluster boundaries) are shown. Each cluster is shaded with a color proportional to its mixture coefficient.
To use this module, you'll first have to take your VCF and create an R-readable table out of it with the following command:

python /path/to/Sting/trunk/python/vcf2table.py -f CHROM,POS,ID,AC,AF,AN,DB,DP,HRun,MQ,MQ0,MyHaplotypeScore,QD,SB my.vcf > my.vcf.table

Then, simply invoke this module with the command:

Rscript /path/to/Sting/trunk/R/VariantRecalibratorReport/VariantRecalibratorReport.R /path/to/output/prefix /path/to/my/my.clusters /path/to/my.vcf.table [/path/to/my.suspicious.loci]

This will create a number of plots all with the prefix "/path/to/output/prefix".  For instance, if you used QD, SB, HRun, and MyHaplotypeScore annotations during clustering, you should see output like this:

    /path/to/output/prefix.anndist.HRun.pdf
    /path/to/output/prefix.anndist.MyHaplotypeScore.pdf
    /path/to/output/prefix.anndist.QD.pdf
    /path/to/output/prefix.anndist.SB.pdf
    /path/to/output/prefix.cluster.HRun_vs_MyHaplotypeScore.pdf
    /path/to/output/prefix.cluster.HRun_vs_QD.pdf
    /path/to/output/prefix.cluster.HRun_vs_SB.pdf
    /path/to/output/prefix.cluster.MyHaplotypeScore_vs_HRun.pdf
    /path/to/output/prefix.cluster.MyHaplotypeScore_vs_QD.pdf
    /path/to/output/prefix.cluster.MyHaplotypeScore_vs_SB.pdf
    /path/to/output/prefix.cluster.QD_vs_HRun.pdf
    /path/to/output/prefix.cluster.QD_vs_MyHaplotypeScore.pdf
    /path/to/output/prefix.cluster.QD_vs_SB.pdf
    /path/to/output/prefix.cluster.SB_vs_HRun.pdf
    /path/to/output/prefix.cluster.SB_vs_MyHaplotypeScore.pdf
    /path/to/output/prefix.cluster.SB_vs_QD.pdf



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3936 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 18:35:14 +00:00
kiran b990a22bac A very nice way of automatically plotting the results of a VariantEval run. All of the hard work is actually in the common R repository, gsacommons.R, including methods for creating a Venn diagram. It also provides a mechanism for the output of a VariantEval run to be loaded into a single list object.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3828 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-19 12:38:26 +00:00
depristo 6ffcaa0afe Can run R scripts on the command line
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3750 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-09 00:13:18 +00:00
depristo 66931d433c useful routines for R
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3685 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-30 16:38:49 +00:00
corin bcab0eba01 This replaces tearsheet.r, neatens up graphics, and allows the script to be used in R's interactive environment
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3625 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-24 01:02:58 +00:00
corin ae88630d52 This script produces tearsheet and data processing report figures and tables when given Squid and Firehose produced data
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3594 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 21:36:29 +00:00
corin a2c266bda3 This script accpets file paths to analysis metrics tables and produces tearsheet data and data processing report graphs
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3585 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 13:02:25 +00:00