Commit Graph

8971 Commits (fceb2bf25bf00f7a6acbd00ad38bdf93e6c3dd7c)

Author SHA1 Message Date
Mark DePristo fceb2bf25b Updating CalibrateGenotypeLikelihoods.R to display Q93 not filter them out 2012-03-09 16:00:07 -05:00
Mark DePristo 3ba2e5667c CalibrateGenotypesLikelihoods include pOfDGivenD now 2012-03-09 16:00:07 -05:00
Mark DePristo 1011f3862b CalibrateGenotypeLikelihoods now emits the position of the variant for debugging
-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
2012-03-09 16:00:07 -05:00
Mark DePristo 8158348e01 Prints xlim = 30 and xlim = 99 in CalibrateGenotypeLikelihoods.R 2012-03-09 16:00:07 -05:00
David Roazen 91d10431d3 BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.

Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
2012-03-09 15:20:16 -05:00
David Roazen bc65f6326f Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.

Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
2012-03-09 12:33:48 -05:00
David Roazen 32dee7ed9b Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.

Added a unit test for this case as well.
2012-03-08 15:30:44 -05:00
Guillermo del Angel c04853eae6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-08 12:30:04 -05:00
Guillermo del Angel 858acf8616 Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns) 2012-03-08 12:29:44 -05:00
Andrey Sivachenko 56f074b520 docs updated 2012-03-07 18:47:15 -05:00
Andrey Sivachenko 117ea605ac Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-07 18:35:07 -05:00
Andrey Sivachenko 497a1b059e transition to JEXL completed, old parameters setting individual cutoffs now deprecated 2012-03-07 18:34:11 -05:00
Andrey Sivachenko fbd2f04a04 JEXL support added; intermediate commit, not yet functional 2012-03-07 17:29:42 -05:00
Mark DePristo 20d10dfa35 EvalQuantizedQuals now tests the impact on reduced reads as well 2012-03-07 13:10:08 -05:00
Mark DePristo 0376d73ece Improved, public version of ErrorRateByCycle
-- A cleaner table output (molten).  For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
2012-03-07 13:10:08 -05:00
Christopher Hartl a6a8fc0521 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-07 10:05:43 -05:00
Eric Banks c4824a77f5 Some to-do items for the reduced reads calling script 2012-03-07 10:03:10 -05:00
Christopher Hartl 155839e901 Commit of VQSRV3 with Random Forest Bridge and Decision Tree engines. Lots of code duplication with the variant recalibrator in public, but also some subtle changes (i.e. to the engines and data manager). Code worked when it overwrote the stuff in public, but couldn't commit that. Will push if it works for private as well. 2012-03-07 09:46:43 -05:00
Mark DePristo 26dcec08d5 Bugfix for QualQuantizerUnitTest
-- Enabled failing provider
-- Fixed incorrect expectation in unit test
2012-03-07 09:30:03 -05:00
Mark DePristo 8ef654aa77 Minor improvements to QuantizeQuals
-- Commenting out excessive debugging in the walker
-- Scala script to quantize BAM, run calibrate genotype likelihoods, call snps, and compare them to the full bam call set for 1, 2, 4, 8, 16, 32, and 64 quantization levels
2012-03-06 16:56:59 -05:00
Mark DePristo 569be953b9 Bugfix for VariantEval
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp.  These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
2012-03-06 16:56:59 -05:00
Mark DePristo 5f35f5d338 QualQuantizer scales the penalty by the log of the two error rates
-- Old equation was |E1 - E*| * N1.  New equation is |log10(E1) - log10(E2)| * N1 which is equivalent to E1 * N1/E2
2012-03-06 16:56:58 -05:00
Mark DePristo 8d2db3f249 Emit and visualize quality histogram in QualQuantizer 2012-03-06 16:56:58 -05:00
Mark DePristo b7089a3b05 Improvements to QualQuantizer; Walker to quantize quals in BAM file
-- QualQuantizer now tracks merge order and level in the QualInterval for debugging / visualization
-- Write out QualIntervals tree for visualization
-- visualizeQuantizedQuals.R r script for basic visualization of the quality score quantization
2012-03-06 16:56:58 -05:00
David Roazen 811f871f78 Do not fail tests that require the GATK private key if the user does not have permission to read it
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.

Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite

Bamboo, of course, will always be able to run these tests.
2012-03-06 15:57:02 -05:00
Christopher Hartl 67def6acc8 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-06 14:23:14 -05:00
Christopher Hartl 20c1fbaf0f Fixing a merge (turning off downsampling on DoC) 2012-03-06 14:22:45 -05:00
David Roazen 0702ee1587 Public-key authorization scheme to restrict use of NO_ET
-Running the GATK with the -et NO_ET or -et STDOUT options now
 requires a key issued by us. Our reasons for doing this, and the
 procedure for our users to request keys, are documented here:
 http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home

-A GATK user key is an email address plus a cryptographic signature
 signed using our private key, all wrapped in a GZIP container.
 User keys are validated using the public key we now distribute with
 the GATK. Our private key is kept in a secure location.

-Keys are cryptographically secure in that valid keys definitely
 came from us and keys cannot be fabricated, however keys are not
 "copy-protected" in any way.

-Includes private, standalone utilities to create a new GATK user key
 (GenerateGATKUserKey) and to create a new master public/private key
 pair (GenerateKeyPair). Usage of these tools will be documented on
 the internal wiki shortly.

-Comprehensive unit/integration tests, including tests to ensure the
 continued integrity of the GATK master public/private key pair.

-Generation of new user keys and the new unit/integration tests both
 require access to the GATK private key, which can only be read by
 members of the group "gsagit".
2012-03-06 00:09:43 -05:00
Lechu 027843d791 I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them.
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2012-03-05 21:27:03 -05:00
Ryan Poplin f6905630bb Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode. 2012-03-05 21:08:07 -05:00
Ryan Poplin 9b53250bef Adding Unit test for Haplotype class. Used in HC's genotype given alleles mode. 2012-03-05 21:07:36 -05:00
Ryan Poplin b37461587d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 17:54:59 -05:00
Ryan Poplin c6ded4d23c Bug fix for hard clipping reads when base insertion and base deletion qualities are present in the read. Updating HaplotypeCaller integration tests to reflect all the recent changes. 2012-03-05 17:54:42 -05:00
Ryan Poplin 14a77b1e71 Getting rid of redundant methods in MathUtils. Adding unit tests for approximateLog10SumLog10 and normalizeFromLog10. Increasing the precision of the Jacobian approximation used by approximateLog10SumLog which changes the UG+HC integration tests ever so slightly. 2012-03-05 12:28:32 -05:00
Mauricio Carneiro e9ad382e74 unifying the BQSR argument collection 2012-03-05 10:48:26 -05:00
Mauricio Carneiro a1d6b3818c dont include deletions in the pileup 2012-03-05 10:48:26 -05:00
Mauricio Carneiro dfbffc95a3 getting rid of the old Indel BQSR 2012-03-05 10:48:26 -05:00
Ryan Poplin f879daa7d0 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-05 08:29:08 -05:00
Ryan Poplin d6871967ae Adding more unit tests and contracts to PairHMM util class. Updating HaplotypeCaller to use the new PairHMM util class. Now that the HMM result isn't dependent on the length of the haplotype there is no reason to ensure all haplotypes have the save length which simplifies the code considerably. 2012-03-05 08:28:42 -05:00
Guillermo del Angel 3b5a7c34d7 Added argument to ValidationAmplicons to only output valid sequences - useful for not having to post-filter or grep resulting files before delivering downstream 2012-03-04 10:24:29 -05:00
Mark DePristo 69611af7d3 Workaround for bug in Picard in ReadGroupProperties
-- NPE caused when you call getRunDate on a read group without a date.
2012-03-02 18:53:45 -05:00
Mark DePristo 914c23da51 Generic infrastructure for quantizing quality scores
-- Just infrastructure at this point (but with UnitTests!).
-- Capable of taking a histogram of quality scores and a target number of levels (8 for example), and mapping the full range of input quality scores down to only 8.
-- The selected quality scores are chosen to minimize the miscalibration rate of the resulting bins.  I believe this adaptive approach is vastly better than the current systems being developed by EBI and NCBI
-- This infrastructure is designed to work with BQSRv2.  I envision a system where we feed in the projected empirical quality score distribution from the BQSRv2 table, compute the required deleveling for each of the B, I, and D qualities, and on the fly emit calibrated, compressed quality scores.
-- Note the algorithm right now for determining the best intervals is both greedy (i.e., will miss the best overall choice) and potentially extremely slow.  But it is enough for me to play with.
2012-03-02 16:12:42 -05:00
Mark DePristo ba71b0aee4 ReadGroupProperties mk3
-- Includes sequencing date
2012-03-02 16:12:42 -05:00
Khalid Shakir fc1c0a9d8f Minor change: switched HSP default fasta from bundle/g1k to Picard since in all oneoff runs of the HSP the BAMs were aligned by Picard to Picard's reference. 2012-03-02 14:20:54 -05:00
Eric Banks 1e07e97b58 Optimization: create allele list just once, not for each genotype 2012-03-02 13:30:17 -05:00
Mark DePristo 0a7137616c Now converts gatkreports to properly typed R data types in gsa.read.gatkreport
-- use the general function type.convert from read.table to automagically convert the string data to booleans, factors, and numeric types as appropriate.  Vastly better than the previous behavior which only worked for numerics, in some cases.
2012-03-02 09:11:59 -05:00
Ryan Poplin 0ad7d5fbc1 Standalone common Pair HMM utility class with associated unit tests. 2012-03-01 22:41:13 -05:00
Mark DePristo 2f334a57c2 ReadGroupProperties mk2
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
2012-03-01 18:43:53 -05:00
Mauricio Carneiro 486712bfc2 ugly RG encoding 2012-03-01 17:56:45 -05:00
Mauricio Carneiro 4409293b5d get rid of the sorting parameter 2012-03-01 17:56:45 -05:00