Commit Graph

8982 Commits (41068b698513ffafa38df40bb9b07ced2fcbd75b)

Author SHA1 Message Date
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Eric Banks f7c2c818fe Exact model memory optimization: instead of having a later matrix column pull in data from earlier ones (requiring us to keep them around until all dependencies are hit), the earlier columns push data into their dependents immediately and then are removed. This does trade off speed a little bit (because we need to call approximateLog10Sum each time we add to a dependent instead of once in an array at the end). Note that this commit would normally not get pushed into stable, but I'm about to make a very disruptive push into stable that would make merging this from unstable a nightmare. 2012-03-14 14:02:36 -04:00
Mark DePristo bb2c10b785 Capture the class of the exception in GATKRunReport
-- As suggested by David.
2012-03-14 12:16:22 -04:00
Eric Banks 5200f7f919 When creating a synthetic VC based on the passed in alleles, set the reference base for indel. 2012-03-13 10:59:58 -04:00
Eric Banks 1675bd4dd7 When creating a synthetic VC based on the passed in alleles, set the length correctly. 2012-03-13 10:55:52 -04:00
Eric Banks 04cafffaa7 Merge remote-tracking branch 'unstable/master' 2012-03-12 08:43:43 -04:00
Eric Banks b4749757f8 Fixes for SLOD: 1) didn't work properly for multi-allelics (randomly chose an allele, possibly one that wasn't genotyped in the full context); 2) in cases when there were more alt alleles than the max allowed and the user is calculating SB, we would recompute the best alt alleles(s); 3) for some reason, we were recomputing the LOD for the full context when we'd already done that. Given that this passes integration tests on my end, this should be the last commit before the release. 2012-03-12 01:07:07 -04:00
Mark DePristo 1ee46e5c06 Collect only the bare essentials in the GATKRunReport
Now looks like:
<GATK-run-report>
   <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
   <start-time>2012/03/10 20.21.19</start-time>
   <end-time>2012/03/10 20.21.19</end-time>
   <run-time>0</run-time>
   <walker-name>CountReads</walker-name>
   <svn-version>1.4-483-g63ecdb2</svn-version>
   <total-memory>85000192</total-memory>
   <max-memory>129957888</max-memory>
   <user-name>depristo</user-name>
   <host-name>10.0.1.10</host-name>
   <java>Apple Inc.-1.6.0_26</java>
   <machine>Mac OS X-x86_64</machine>
   <iterations>105</iterations>
</GATK-run-report>

No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
2012-03-10 20:27:14 -05:00
Mark DePristo bd883031a4 Final version of QualQuantizer
-- Docs everywhere
-- Contracts everywhere
-- More unit tests
-- Better error checking
-- Marginally nicer interface to QuantizeQualsWalker
2012-03-09 16:00:08 -05:00
Mark DePristo 4b404cae48 Final evaluation script for quantizing quality scores 2012-03-09 16:00:08 -05:00
Mark DePristo e2c62572f9 Further upgrades to CalibrateGenotypeLikelihoods.R
- Uses modified yates correction of e + 1 / n + 2 to estimate error rates
- Now shows ALL and per read group information
- Better limits on diff plots so we can see more information
2012-03-09 16:00:08 -05:00
Mark DePristo fceb2bf25b Updating CalibrateGenotypeLikelihoods.R to display Q93 not filter them out 2012-03-09 16:00:07 -05:00
Mark DePristo 3ba2e5667c CalibrateGenotypesLikelihoods include pOfDGivenD now 2012-03-09 16:00:07 -05:00
Mark DePristo 1011f3862b CalibrateGenotypeLikelihoods now emits the position of the variant for debugging
-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
2012-03-09 16:00:07 -05:00
Mark DePristo 8158348e01 Prints xlim = 30 and xlim = 99 in CalibrateGenotypeLikelihoods.R 2012-03-09 16:00:07 -05:00
David Roazen 91d10431d3 BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.

Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
2012-03-09 15:20:16 -05:00
David Roazen bc65f6326f Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.

Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
2012-03-09 12:33:48 -05:00
David Roazen 32dee7ed9b Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.

Added a unit test for this case as well.
2012-03-08 15:30:44 -05:00
Guillermo del Angel c04853eae6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-08 12:30:04 -05:00
Guillermo del Angel 858acf8616 Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns) 2012-03-08 12:29:44 -05:00
Andrey Sivachenko 56f074b520 docs updated 2012-03-07 18:47:15 -05:00
Andrey Sivachenko 117ea605ac Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-07 18:35:07 -05:00
Andrey Sivachenko 497a1b059e transition to JEXL completed, old parameters setting individual cutoffs now deprecated 2012-03-07 18:34:11 -05:00
Andrey Sivachenko fbd2f04a04 JEXL support added; intermediate commit, not yet functional 2012-03-07 17:29:42 -05:00
Mark DePristo 20d10dfa35 EvalQuantizedQuals now tests the impact on reduced reads as well 2012-03-07 13:10:08 -05:00
Mark DePristo 0376d73ece Improved, public version of ErrorRateByCycle
-- A cleaner table output (molten).  For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
2012-03-07 13:10:08 -05:00
Christopher Hartl a6a8fc0521 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-07 10:05:43 -05:00
Eric Banks c4824a77f5 Some to-do items for the reduced reads calling script 2012-03-07 10:03:10 -05:00
Christopher Hartl 155839e901 Commit of VQSRV3 with Random Forest Bridge and Decision Tree engines. Lots of code duplication with the variant recalibrator in public, but also some subtle changes (i.e. to the engines and data manager). Code worked when it overwrote the stuff in public, but couldn't commit that. Will push if it works for private as well. 2012-03-07 09:46:43 -05:00
Mark DePristo 26dcec08d5 Bugfix for QualQuantizerUnitTest
-- Enabled failing provider
-- Fixed incorrect expectation in unit test
2012-03-07 09:30:03 -05:00
Mark DePristo 8ef654aa77 Minor improvements to QuantizeQuals
-- Commenting out excessive debugging in the walker
-- Scala script to quantize BAM, run calibrate genotype likelihoods, call snps, and compare them to the full bam call set for 1, 2, 4, 8, 16, 32, and 64 quantization levels
2012-03-06 16:56:59 -05:00
Mark DePristo 569be953b9 Bugfix for VariantEval
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp.  These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
2012-03-06 16:56:59 -05:00
Mark DePristo 5f35f5d338 QualQuantizer scales the penalty by the log of the two error rates
-- Old equation was |E1 - E*| * N1.  New equation is |log10(E1) - log10(E2)| * N1 which is equivalent to E1 * N1/E2
2012-03-06 16:56:58 -05:00
Mark DePristo 8d2db3f249 Emit and visualize quality histogram in QualQuantizer 2012-03-06 16:56:58 -05:00
Mark DePristo b7089a3b05 Improvements to QualQuantizer; Walker to quantize quals in BAM file
-- QualQuantizer now tracks merge order and level in the QualInterval for debugging / visualization
-- Write out QualIntervals tree for visualization
-- visualizeQuantizedQuals.R r script for basic visualization of the quality score quantization
2012-03-06 16:56:58 -05:00
David Roazen 811f871f78 Do not fail tests that require the GATK private key if the user does not have permission to read it
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.

Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite

Bamboo, of course, will always be able to run these tests.
2012-03-06 15:57:02 -05:00
Christopher Hartl 67def6acc8 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-06 14:23:14 -05:00
Christopher Hartl 20c1fbaf0f Fixing a merge (turning off downsampling on DoC) 2012-03-06 14:22:45 -05:00
David Roazen 0702ee1587 Public-key authorization scheme to restrict use of NO_ET
-Running the GATK with the -et NO_ET or -et STDOUT options now
 requires a key issued by us. Our reasons for doing this, and the
 procedure for our users to request keys, are documented here:
 http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home

-A GATK user key is an email address plus a cryptographic signature
 signed using our private key, all wrapped in a GZIP container.
 User keys are validated using the public key we now distribute with
 the GATK. Our private key is kept in a secure location.

-Keys are cryptographically secure in that valid keys definitely
 came from us and keys cannot be fabricated, however keys are not
 "copy-protected" in any way.

-Includes private, standalone utilities to create a new GATK user key
 (GenerateGATKUserKey) and to create a new master public/private key
 pair (GenerateKeyPair). Usage of these tools will be documented on
 the internal wiki shortly.

-Comprehensive unit/integration tests, including tests to ensure the
 continued integrity of the GATK master public/private key pair.

-Generation of new user keys and the new unit/integration tests both
 require access to the GATK private key, which can only be read by
 members of the group "gsagit".
2012-03-06 00:09:43 -05:00
Lechu 027843d791 I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them.
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2012-03-05 21:27:03 -05:00
Ryan Poplin f6905630bb Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode. 2012-03-05 21:08:07 -05:00
Ryan Poplin 9b53250bef Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode. 2012-03-05 21:07:36 -05:00
Ryan Poplin b37461587d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 17:54:59 -05:00
Ryan Poplin c6ded4d23c Bug fix for hard clipping reads when base insertion and base deletion qualities are present in the read. Updating HaplotypeCaller integration tests to reflect all the recent changes. 2012-03-05 17:54:42 -05:00
Ryan Poplin 14a77b1e71 Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly. 2012-03-05 12:28:32 -05:00
Mauricio Carneiro e9ad382e74 unifying the BQSR argument collection 2012-03-05 10:48:26 -05:00
Mauricio Carneiro a1d6b3818c dont include deletions in the pileup 2012-03-05 10:48:26 -05:00
Mauricio Carneiro dfbffc95a3 getting rid of the old Indel BQSR 2012-03-05 10:48:26 -05:00
Ryan Poplin f879daa7d0 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 08:29:08 -05:00
Ryan Poplin d6871967ae Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably. 2012-03-05 08:28:42 -05:00