Commit Graph

973 Commits (2520889cb35d09acf8313868c542bdfd98b9ae80)

Author SHA1 Message Date
jmaguire 81313d9452 added class VCFMerge
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2840 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-15 14:41:50 +00:00
jmaguire 0ef50bcae7 - update to match recent changes in the VCF parser
- compute Het Error Rate in VCFConcordance
- changes to the frequency-specific optimizer




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2839 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-15 14:27:01 +00:00
chartl 04a2784bf7 Initial commit of tools under development for data QC through firehose.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2834 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-12 19:13:24 +00:00
rpoplin ecebf0bc62 Bug fix for null pointer exception in AnalyzeAnnotations if -name argument isn't specified
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2828 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-11 18:39:26 +00:00
mmelgar ad608d0e9d Cleaned up documentation on SecondaryBaseTransitionTableWalker and added Read Group and Allele Balance to the info.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2827 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-11 17:20:35 +00:00
andrewk 369cc50802 Added playground walker that does a basic concordance check between two VCF files - an eval and a truth file - across all samples in the eval file. Produces per-sample, per-locus debug info and simple concordance stats. This is not meant to be extended, but rather used for validating the HapMap to VCF conversion in preparation for retiring GFF-based HapMap data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2813 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-09 02:41:18 +00:00
depristo c6d86da4b8 almost managed to move things around perfectly in move go
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2788 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 14:18:26 +00:00
depristo 69132c81aa Documentation. Plus nicer structure to adaptors. Intermediate checkin before move into core
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2783 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-05 13:33:27 +00:00
depristo 1d86dd7fd1 Interface changes following Matt's advice. VariantContexts are now immutable, and there are special mutable versions, in case you need to change things. AttributedObject now a InferredGeneticContext and package protected. VariantContexts are now named, which makes them easier to use with the rod system
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2780 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-04 20:55:49 +00:00
rpoplin 210c4c9913 AnalyzeAnnotations now makes plots for the value in the QUAL column as if it were an annotation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2771 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-03 20:33:15 +00:00
hanna 9dbdfff786 Moved VariantEval to core. Updated integration test md5s to reflect new Analysis class names.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2762 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-02 00:22:15 +00:00
chartl 2c4f709f6f Bunch of oneoff stuff that I don't want to lose. Also:
VCFRecord - "." dbsnp-ID entries now taken into account (thought these were represented as null; but I guess not)
VCFGenotypeRecord - added a replaceFormat option; since intersecting Broad/BC call sets required genotype formats also be intersected (no changing on-the-fly)
VCFCombine - altered doc to instruct user to give complete priority list (was throwing exception if not)




git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2760 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 21:35:10 +00:00
ebanks 506d39f751 The UG calculations are now driven by an independent engine.
This completely separates the genotyper walker from other walkers.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2758 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 20:57:31 +00:00
ebanks e0808e6c37 Moved old EM model to archive
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2754 348d0f76-0448-11de-a6fe-93d51630548a
2010-02-01 02:55:32 +00:00
ebanks f6da57dc79 1. For Matt: JIRA GSA-270. Other walkers needing to call into the Unified Genotyper now use static methods (e.g. runGenotyper()) instead of calling initialize and map.
2. Set the default confidence cutoff to 50 (instead of 0).



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2752 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-31 21:14:57 +00:00
depristo 3d45457595 VariantEval2 test framework implemented; Kiran is experimenting with the system. Not for use by anyone else. VariantContext appears to work well; I'll release it next week for general use following docs of the functions. Removing newvarianteval and other classes to avoid any future confusion. Update to TraverseLoci and RodLocusView to simplify a few functions and to correct some minor errors. All tests pass without modification.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2748 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-30 20:51:24 +00:00
chartl 97f60dbc4b Moving stuff around. ( core;playground ) ----> ( oneoffs ). I've been a bad boy, sullying the core codebase.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2745 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 22:50:03 +00:00
rpoplin 16da5011c0 Added a new option for indicating the mean number of variants on the AnalyzeAnnotations plots. This way one can say, for example, filtering at this point will keep 75 percent of all the variants.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2744 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 21:58:31 +00:00
rpoplin c6cc844e55 Added -name argument to AnalyzeAnnotations that allows one to specify the name of the annotation to be used on the plots. Instead of seeing AB and DP, one can add -name AB,AlleleBalance -name DP,Depth
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2742 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:48:53 +00:00
rpoplin 4f29a1d4f6 AnalyzeAnnotations now plots true positive rate instead of percentage of variants found in the truth set. Committing GCContentCovariate to help people experiment with correcting the pilot3/Kristian base calling error mode in slx.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2740 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 20:01:56 +00:00
depristo 1993472b38 Just like VariantFiltration but lets you match info fields out of the VCF instead of annotating them.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2736 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:38:03 +00:00
depristo 0a7426c29c Computes SNP density over the genome. Doesn't work with intervals
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2735 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:36:49 +00:00
depristo 9decd20f46 Fix to priors to allow lower het values for mouse guys; no intergration test changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2734 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-29 15:36:12 +00:00
rpoplin 79c4cc1db7 AnalyzeAnnotations now breaks out titv by calls in hapmap and also plots true positive rates. Any RODs passed in whose name starts with 'truth' is considered to be the truth set.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2726 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:41:23 +00:00
chartl 8de6a8d246 Lots of changes; all to do something relatively minor.
1) Changed VCF/RodVCF to allow for inquiries to whether or not the site is novel; isNovel() looks at the ID field, and those members of the info field that indicate membership in dbsnp, hapmap2, or hapmap3; and if none can be found, returns true.

2) Changed VariantAnnotator to annotate hapmap2 and hapmap3, if you bind rods to it with those names. Works in the same way as DBSNP does -- if you give it a rod named "hapmap2" it'll annotate membership in it. -- Passes integration tests

3) Changed UnifiedGenotyper to do the same thing (since it uses Annotations as a subroutine) -- Passes integration tests

4) Changed MultiSampleConcordanceWalker to take a flag --ignoreKnownSites (or -novels) to examine concordance only on sites that are not marked as in dbSNP or in Hapmap in the variant VCF

5) Changed VCFConcordanceCalculator (the object MultiSampleConcordanceWalker runs on) to output Concordant_Het_Calls and Concordant_Hom_Calls separately, rather than combined as Concordant_Calls

6) AlleleBalanceHistogramWalker -- I don't know what i did to this thing. I've been jerry rigging System.outs to do stuff it was never really intended to do; so there's probably some dumb System.out.print("HI I AM AT LOCUS:"+loc) stuck somewhere. It compiles at any rate.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2724 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 21:06:56 +00:00
depristo 956b570c8e V5 improvements to VariantContext. Now fully supports genotypes. Filtering enabled. Significant tests throughout system. Support for rebuilding variant contexts from subsets of genotypes. Some code cleanup around repository
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2721 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-28 18:37:17 +00:00
chartl 23fc9737b4 Added the ability to filter out variant (not truth) calls based on read depth. Using -NLD 5 will not update concordant counts for calls with 0, 1, 2, 3, or 4 reads supporting them. Not to be used with VCF files that do not have DP in the format field.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2716 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 23:28:04 +00:00
chartl 1b9184a1c7 Added a multisample concordance walker which takes the place of the VCF python library I've been using. Takes a truth VCF and a variant VCF and outputs A TSV that looks like this:
Sample_ID       Concordant_Refs Concordant_Vars Homs_called_het Het_called_homs False_Positives False_Negatives_Due_To_Ref_Call False_Negatives_Due_To_No_Call
NA19381 491     294     2       0       0       0       1
NA19451 489     298     1       0       0       0       0
NA19463 486     289     2       3       1       4       3
NA19376 488     296     1       0       2       0       1
NA19317 489     284     5       3       3       3       1


This walker will be merged with GenotypeConcordance once it's clear how to do so. 



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2715 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 22:59:17 +00:00
rpoplin b8ae083d1b AnalyzeAnnotations creates a plot of dbsnp rate as a function of the annotations.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2711 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 21:08:33 +00:00
rpoplin 3999a8d2c8 IntelliJ no longer complains that my methods are too complex to analyze.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2708 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 20:12:13 +00:00
rpoplin fc4285f9fd AnalyzeAnnotations seems to be popular so I've rewritten the guts to be easier to extend and maintain.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2707 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 19:30:31 +00:00
rpoplin 4bcdab580c --output_dir has been changed to --output_prefix to give the user more control over the names of the resulting mass of files in AnalyzeAnnotations. The fontsize of the axes is increased. Cumulative filtering plots are removed since the binned filtering plots are much more useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2700 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-27 04:50:54 +00:00
rpoplin 0345d9f6a5 Updating the recalibrator to use non-depricated getPileup() method. Adding documentation to AnalyzeAnnotations so that the walker isn't marked as unclean at compile time.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2688 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-26 14:15:09 +00:00
rpoplin 24d4082925 AnalyzeAnnotations can now process only variants that are found in samples that match the -sampleName argument. X-axis of plots no longer use annoying scientific notation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2684 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 20:52:11 +00:00
rpoplin 2b51cf18f0 AnalyzeAnnotations now outputs plots with log x-axis in addition to standard x-axis so things like DP and MQ0 are easier to see. AnalyzeAnnotations now skips over all annotations that aren't floating point values. Recalibrator now warns users if PL tags are missing and so therefore it is reverting to illumina.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2681 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-25 19:39:18 +00:00
jmaguire 588417e17d Don't reference that optimiation library I'm not using anyway.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2676 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-24 20:30:50 +00:00
jmaguire d3e3c1c2e0 don't require that optmization lib that I'm not using yet... (doh)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2675 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-24 20:28:21 +00:00
jmaguire 1d6d2b26f7 tools for optimizing calls.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2674 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-24 20:16:55 +00:00
jmaguire 877957761f lots of new stuff, some generally useful, some one-off.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2673 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-24 19:50:48 +00:00
depristo c871a0f221 UG map() now returns a VariantCallContext object. Also has a field for confidentlyCalledBases. UG reduce() emits statistics on the confident called % of bases
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2664 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 23:06:43 +00:00
chartl fbf82526cb Minor renamign changes.
PlinkRodWithGenomeLoc now supports .bed file parsing (and doesn't require |c#_p# conventions for SNPs -- still requires _g[I/D] for indels)



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2663 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 23:06:32 +00:00
rpoplin a11503819a AnalyzeAnnotations now breaks out its TiTv plots into novel SNPs, dbSNP sites, and combined.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2659 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-22 19:00:23 +00:00
rpoplin d9df72e1b5 AnalyzeAnnotations now bins variants per each annotation and outputs plots of TiTv ratio as a function of the annotation's value.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2654 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 21:15:11 +00:00
chartl f51cffe220 Alteration of PlinkToVCF to be much more flexible about parsing .ped file headers, which can have one of a number of different standard fields, and be in different orders.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2650 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 18:02:28 +00:00
chartl 5b2a1e483e Renamed SequenomToVCF as PlinkToVCF. Wiki will be changed accordingly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2649 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-21 17:35:20 +00:00
depristo ff66023d83 Trivial change to support filter field in VCF
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2636 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 22:56:22 +00:00
depristo 9e0ae993c7 -B 1kg_ceu,VFC,CEU.vcf -B 1kg_yri,VCF,YRI.vcf system supported to allow 1KG % (like dbSNP%)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2632 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:33:13 +00:00
rpoplin c98df0a862 Updated solid_recal_modes to work with bfast aligned data. Added an integration test that uses the BFAST file provided by TGen.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2630 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 21:18:02 +00:00
rpoplin a12465b6d5 The recalFile argument is no longer added into the PG tag of a bam produced by TableRecalibration. Based on a request from the Sanger.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2625 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-19 15:25:57 +00:00
rpoplin ba19afd529 Draft version of AnalyzeAnnotations which creates plots of cumulative TiTv ratio versus filter value per each annotation in the input VCF rod. Minor cleanup of recalibration walkers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@2623 348d0f76-0448-11de-a6fe-93d51630548a
2010-01-18 20:47:10 +00:00