Commit Graph

33 Commits (cbce3e3c83c72a8c7dff7b8fec00f00a2b419e83)

Author SHA1 Message Date
depristo 0a2e76e9dc 2nd step towards on the fly indexing. Also fixed parsing bug for headers with < symbols
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4454 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 21:38:46 +00:00
rpoplin 0de658534d Removed the qScale arguments in VariantRecalibrator. It is smarter about how it tries to find a cut so the arbitrary scale factor hopefully is no longer necessary. Now the recalibrated variant quality score more accurately reflects our believed lod of the call.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4451 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-07 18:04:57 +00:00
rpoplin 69485d6a7a Added command line argument for the max value of the allele count prior in VariantRecalibrator (--max_ac_prior). Default value increased to 0.99 from 0.95.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4436 348d0f76-0448-11de-a6fe-93d51630548a
2010-10-06 14:00:53 +00:00
rpoplin a6c7de95c8 By using the AC info field instead of parsing the genotypes we cut 78% off the runtime of VariantRecalibrator. There is a new argument to force the parsing of genotypes if necessary. Various other optimizations throughout.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4383 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-29 18:56:50 +00:00
rpoplin 547763b230 Better error message for Petr's null pointer exception. Also added an exception integration test because I'm certain this used to work.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4319 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-21 13:44:40 +00:00
rpoplin 0a06fbdb94 Adding header lines to output of VR walkers to settle validator warnings. Command lines are added to the VCF header. GATK version numbers will be added to the header lines by Matt.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4288 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 16:45:03 +00:00
ebanks 4e83ba411f We now do lazy loading for the genotype data in VCF. Practically, almost all walkers end of loading the genotype data because we need to be smarter about transfering the unparsed genotype string when modifying VariantContexts; however, this does solve the problem for VR's piece to generate clusters (shaved off 75% of runtime for Ryan's large case). That further optimization will happen later.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4279 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-15 00:18:17 +00:00
rpoplin 7b113a4886 Truncate the floating point numbers coming out of the variant recalibration walkers. Integration tests now work with both 1.6.0_16-b01 and 1.6.0_21-b06
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4253 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 18:37:49 +00:00
rpoplin 61e848c4f0 It's clear from Sendu's calling and my own calling that -qScale 100.0 is a much better default value for low pass data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4248 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-10 01:47:21 +00:00
rpoplin aeb897db7f VR walkers look at by-hapmap validation status by default. Eric will be updating the syntax to allow for more flexibility here.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4242 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 15:40:56 +00:00
rpoplin d625186796 I think the VR integration tests are fine.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4240 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 15:00:41 +00:00
depristo 6a30617a60 Initial implementation of UserError exceptions and error message overhaul. UserErrors and their subclasses UserError.MalFormedBam for example should be used when the GATK detects errors on part of the user. The output for errors is now much clearer and hopefully will reduce GS posts. Please start using UserError and its subclasses in your code. I've replace some, but not all, of the StingExceptions in the GATK with UserError where appropriate.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4239 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-09 11:32:20 +00:00
rpoplin e3962c0d13 VR integration tests are longer but much more useful.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4210 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-03 15:50:19 +00:00
rpoplin 0bb05fb472 Bug fix in VariantRecalibrator. Only add sample names from the input rod bindings, not from all rod bindings.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4194 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-02 21:12:09 +00:00
rpoplin 469bbaa240 Added more integration tests for the variant quality score recalibrator
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4181 348d0f76-0448-11de-a6fe-93d51630548a
2010-09-01 15:31:24 +00:00
rpoplin 9c3f403307 Add the calculated lod value to the info field of each recalibrated VCF record.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4153 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-27 21:33:58 +00:00
rpoplin 5623e01602 GenerateVariantClusters and VariantRecalibrator now uses hapmap and 1kg ROD bindings (in addition to dbsnp) to distinguish between knowns and novels. It no longer looks at by-hapmap validation status so providing hapmap is highly recommended. Example on the wiki. Input variants tracks now must start with input.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4113 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-25 18:33:40 +00:00
hanna bf0b6bd486 Update integration tests to use the new ROD syntax.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4112 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-25 18:13:30 +00:00
rpoplin 85007ffa87 Some clean up for the variant recalibrator. Now uses @Input and @Output so that it can join the Queue party. Users now specify a -o, -clusterFile, -tranchesFile, and -reportDatFile. Example on the wiki. ApplyVariantCuts now has an integration test. Base quality recalibrator now requires a dbsnp rod or vcf file. Now that the base quality recalibrator is using @Output the PrintStream shouldn't be closed in OnTraversalDone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4101 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-24 20:14:58 +00:00
rpoplin 8f15b2ba72 Memory optimization for the VariantRecalibrator. Only add variants to the list if they pass the novelty and qual filters.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4051 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-17 21:57:28 +00:00
rpoplin 578e7fa36d Don't output -0 as qual value in VariantRecalibrator.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4044 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-16 16:47:58 +00:00
rpoplin 3eee3183fd Checking in the tiger team changes. LOD calculation modified. -qScale is back in case people need it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3990 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-09 20:41:03 +00:00
rpoplin a8d37da10b Checking in everyone's changes to the variant recalibrator. We now calculate the variant quality score as a LOD score between the true and false hypothesis. Allele Count prior is changed to be (1 - 0.5^ac). Known prior breaks out HapMap sites
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3952 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-05 14:12:19 +00:00
ebanks 340bd0e2c1 Removed hard-coded pointers to references
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3934 348d0f76-0448-11de-a6fe-93d51630548a
2010-08-04 17:59:37 +00:00
ebanks c6ad26e04f 1) When quals/GQs are really integers (x.00), strip off the floating points.
2) Keep track of whether vcf records are unfiltered vs. pass filters in the variant context so we can regenerate the records on output.
3) No more "ID" hard-coded all over the code to set the VariantContext ID.  Use a static variable instead.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3840 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-20 18:01:45 +00:00
delangel 55b756f1cc First step in major cleanup/redo of VCF functionality. Specifically, now:
a) VCF track name can work again with 3.3 or 4.0 VCF's when specifying -B name,VCF,file. Code will read header and parse automatically the version. 
b) Old VCF codec is deprecated. Reader goes now direct from parsing VCF lines into producing VariantContext objects, with no intermediate VCF records. If anyone can't resist the urge to still input files using the old method, a new VCF3Codec is in place with the old code, but it will be eventually deleted.
c) VCF headers and VCF info fields no longer keep track of the version. They are parsed into an internal representation and will be output only in VCF4.0 format.
d) As a consequence, the existing GATK bug where files are produced with VCF4 body but VCF3.3 headers is solved.
e) Several VCF 4.0 writer bugs are now solved.
f) Integration test MD5's are changed, mostly because of corrected VCF4.0 headers and because validation data mostly uses now VCF4.0.
g) Several VCF files in the ValidationData/ directory have been converted to VCF 4.0 format. I kept the old versions, and the new versions have a .vcf4 extension.

Pending issues:
a) We are still not dealing with indels consistently or correctly when representing them. This will be a second part of the changes.
b) The VCF writer doesn't use VCFRecord but it does still use a lot of leftovers like VCFGenotypeEncoding, VCFGenotypeRecord, etc. This needs to be simplified and cleaned.



git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3813 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-16 22:49:16 +00:00
ebanks 6b5c88d4d6 The GATK no longer writes vcf3.3; welcome to the world of vcf4.0. Needed to fix a few output bugs to get this to work, but it's looking great. Much more still to come. Guillermo: hopefully this doesn't break your local build too badly.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3786 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-14 04:56:58 +00:00
rpoplin 255b036fb5 Variant Recalibrator MLE EM algorithm is moved over to variational Bayes EM in order to eliminate problems with singularities when clustering in higher than two dimensions. Because of this there is no longer a number of Gaussians parameter. Wiki will be updated shortly with new recommended command.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3704 348d0f76-0448-11de-a6fe-93d51630548a
2010-07-01 18:51:07 +00:00
rpoplin 724affc3cc Major bug fixes for the Variant Recalibrator. Covariance matrix values are now allowed to be negative. When probabilities are multiplied together the calculation is done in log space, normalized, then converted back to real valued probabilities. Clustering weights have been changed to only use HapMap and by-1000genomes sites. The -nI argument was removed and now clustering simply runs until convergence. Test cases seem to work best when using just two annotations (QD and SB). More changes are in the works and are being evaluated. Misc fixes to walkers that use RScript due to CentOS changes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3590 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 17:37:11 +00:00
aaron b978d5946b adding changes for VCF 4, mostly in the way we handle VCF headers. The header fields are now aware of the differences between different VCF formats. There was also a bunch of clean-up of out-of-spec VCF used in the tests (mismatched VCF file format fields, etc), and updates to the associated integration tests. Also some logging statements for BTI.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3584 348d0f76-0448-11de-a6fe-93d51630548a
2010-06-18 08:23:23 +00:00
rpoplin 4e268ef6ac Removing the Variant Recalibration Performance test because it isn't ready yet.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3460 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:27:25 +00:00
rpoplin 522dd7a5b2 Adding the variantrecalibration classes.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3459 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:21:27 +00:00
rpoplin 2014837f8a VariantOptimizer package is moved to core, renamed as VariantRecalibration, and added to the binary release package. VariantOptimizer walker is renamed to GenerateVariantClustersWalker and ApplyVariantClustersWalker renamed to VariantRecalibrator. Integration tests added, performance tests still to be done.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@3458 348d0f76-0448-11de-a6fe-93d51630548a
2010-05-28 18:20:18 +00:00