Commit Graph

1714 Commits (e86ce8f3d66f7cffc9e349e2b207d77ccab9f9c3)

Author SHA1 Message Date
Ryan Poplin 0c6b34e9df Fixing a bug identified by the ActivityProfile unit tests 2012-03-15 14:24:30 -04:00
Ryan Poplin 252b830aa8 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-15 11:56:04 -04:00
Ryan Poplin 0fa5a7af05 Adding contracts and unit tests for HaplotypeCaller GenotypingEngine 2012-03-15 11:55:48 -04:00
Ryan Poplin 1429ddcf55 Adding contracts and unit tests for HaplotypeCaller LikelihoodCalculationEngine 2012-03-14 21:25:43 -04:00
Mark DePristo 7c5cdb51c2 UnitTests for ActivityProfile and minor ART cleanup
-- TODO for ryan -- there are bugs in ActivityProfile code that I cannot fix right now :-(
-- UnitTesting framework for ActivityProfile -- needs to be expanded
-- Minor helper functions for ActiveRegion to help with unit tests
2012-03-14 17:26:37 -04:00
Mark DePristo e440c9be98 Clean up logic for adding reads to ART cache
-- No longer has duplicate code
2012-03-14 17:26:37 -04:00
Mark DePristo 5bcb5c7433 Preliminary refactoring of ART
-- Refactored ART into clearer, simpler procedures.  Attempted to merge shared code into utility classes.
-- Added some docs
-- Created a new, testable ActivityProfile that represents as a class the probability of a base being active or inactive
-- Separated band-pass filtering from creation of active regions.  Now you can band pass filter a profile to make another profile, and then that is explicitly converted to active regions
-- Misc. utility functions in ActiveRegionWalker such as hasPresetActiveRegions()
-- Many TODOs in ActivityProfile.
2012-03-14 17:26:37 -04:00
Mark DePristo e73406b9b5 CountReadsInActiveRegions now emits a detailed GATK report
-- This report details which intervals are coming in and how many reads they contain
-- Added integration test to verify that the intervals aren't changing, before heading into the ART refactor
2012-03-14 17:26:37 -04:00
Ryan Poplin 1da8928407 HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes. 2012-03-14 15:22:21 -04:00
Guillermo del Angel eca055ccad Add option in ValidationAmplicons to only output SNPs and INDELs, ignoring complex variants (or SVs, etc.) 2012-03-14 14:26:40 -04:00
Mark DePristo 6a40ca6bec Merged bug fix from Stable into Unstable 2012-03-14 12:19:33 -04:00
Mark DePristo bb2c10b785 Capture the class of the exception in GATKRunReport
-- As suggested by David.
2012-03-14 12:16:22 -04:00
Ryan Poplin 78a4e7e45e Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner. 2012-03-14 12:05:05 -04:00
Eric Banks 77243d0df1 Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me 2012-03-13 16:31:51 -04:00
Eric Banks f76da1efd2 Updating md5s because MultiallelicSummary is now standard 2012-03-13 16:31:13 -04:00
Eric Banks 568a1362f5 Splitting up the MultiallelicSummary module into the standard part for use by all and the dev piece used just by me 2012-03-13 16:19:15 -04:00
Eric Banks 6e18ecfc9a Adding integration test to cover errors from my previous commit (GENOTYPE_GIVEN_ALLELE bugs reported by Sara Pulit and Chris Hartl) 2012-03-13 12:43:40 -04:00
Eric Banks 5d7c761784 Merged bug fix from Stable into Unstable 2012-03-13 11:01:03 -04:00
Eric Banks 5200f7f919 When creating a synthetic VC based on the passed in alleles, set the reference base for indel. 2012-03-13 10:59:58 -04:00
Eric Banks 1675bd4dd7 When creating a synthetic VC based on the passed in alleles, set the length correctly. 2012-03-13 10:55:52 -04:00
David Roazen 5d6a686474 Restoring key-related unit/integration tests
The recent GATKReport commit accidentally clobbered a few tests -- this
restores them.
2012-03-13 00:58:24 -04:00
Roger Zurawicki 7887a06703 GATKReport v1.0
GATKReport format changes:

 - All non-data header lines are preceeded with a single pound ( #:)
 - Every report now has a report header containing the version number and number of tables
 - Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
 - This new format will allow reports in the future to be gatherable.
 - Changed the header format to include an end-of-line string ":;"

Added features:

 - Simplified GATK Reports:

	The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.

	A simple GATK Report consists of:
		- A single table
		- No primary key ( it is hidden )
	    Optional:
		- Only untyped columns. As long as the data is an Object, it will be accepted.
		- Default column values being empty strings.
	Limitations:
		- A simple GATK report cannot contain multiple tables.
		- It cannot contain typed columns, which prevents arithmetic gathering.

       - Added a constructor to generate simplified GATK reports.
       - Added a method to easily add data to simple GATK reports.

 - Upgraded the input parser take advantage of the new file format (v1).
 - Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
 - Made some GATKReport methods public, and added more setters and getters.
 - Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
 - The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
 - Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
 - Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.

Test changes:

 - Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
 - Updated the MD5 hashes in integration tests throughout the GATK.

Other changes:

 - Added the gatherer functions to CoverageByRG
 - Also added the scatterCount parameter in the Interval Coverage script
 - Dropped support for reading in legacy GATKReport formats ( v0.*)
 - Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
 - Rewrote the read file method for GATK report files.
 - Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.

Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-12 23:09:19 -04:00
Eric Banks 10995d349e Fix old error message 2012-03-12 22:56:08 -04:00
Eric Banks 2314787767 Generalizing to avoid JDK 1.7 incompatibilities 2012-03-12 22:50:59 -04:00
Ryan Poplin 03223029e3 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-12 09:42:37 -04:00
Eric Banks b4749757f8 Fixes for SLOD: 1) didn't work properly for multi-allelics (randomly chose an allele, possibly one that wasn't genotyped in the full context); 2) in cases when there were more alt alleles than the max allowed and the user is calculating SB, we would recompute the best alt alleles(s); 3) for some reason, we were recomputing the LOD for the full context when we'd already done that. Given that this passes integration tests on my end, this should be the last commit before the release. 2012-03-12 01:07:07 -04:00
Ryan Poplin 2836c161ee Moving trimToVariableRegion out of reduced reads and into a public static ReadClipper function. HaplotypeCaller clips reads to the active region boundries before passing to the HMM. The philosophy of the HC is moving towards genotyping the entire haplotype sequence contained within the active region as a single allele. 2012-03-11 14:45:59 -04:00
Ryan Poplin 8db11eb781 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-10 21:00:55 -05:00
Mark DePristo 1ee46e5c06 Collect only the bare essentials in the GATKRunReport
Now looks like:
<GATK-run-report>
   <id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
   <start-time>2012/03/10 20.21.19</start-time>
   <end-time>2012/03/10 20.21.19</end-time>
   <run-time>0</run-time>
   <walker-name>CountReads</walker-name>
   <svn-version>1.4-483-g63ecdb2</svn-version>
   <total-memory>85000192</total-memory>
   <max-memory>129957888</max-memory>
   <user-name>depristo</user-name>
   <host-name>10.0.1.10</host-name>
   <java>Apple Inc.-1.6.0_26</java>
   <machine>Mac OS X-x86_64</machine>
   <iterations>105</iterations>
</GATK-run-report>

No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
2012-03-10 20:27:14 -05:00
Ryan Poplin 92bbb9bbdd Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-10 10:09:57 -05:00
Mark DePristo 3ba2e5667c CalibrateGenotypesLikelihoods include pOfDGivenD now 2012-03-09 16:00:07 -05:00
Mark DePristo 1011f3862b CalibrateGenotypeLikelihoods now emits the position of the variant for debugging
-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
2012-03-09 16:00:07 -05:00
David Roazen 91d10431d3 BAMScheduler: detect contigs from the interval list that are not in the merged BAM header's sequence dictionary
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.

Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
2012-03-09 15:20:16 -05:00
David Roazen bc65f6326f Detect incomplete reads from BAM schedule file in BAMSchedule before they become buffer underflows
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.

Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
2012-03-09 12:33:48 -05:00
David Roazen 32dee7ed9b Avoid buffer underflow in GATKBAMIndex by detecting premature EOF in BAM indices
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.

Added a unit test for this case as well.
2012-03-08 15:30:44 -05:00
Guillermo del Angel c04853eae6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-08 12:30:04 -05:00
Guillermo del Angel 858acf8616 Hidden mode in ValidationAmplicons to support ILMN output format (same as Sequenom, with just shuffled columns) 2012-03-08 12:29:44 -05:00
Andrey Sivachenko 56f074b520 docs updated 2012-03-07 18:47:15 -05:00
Andrey Sivachenko 117ea605ac Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-03-07 18:35:07 -05:00
Andrey Sivachenko 497a1b059e transition to JEXL completed, old parameters setting individual cutoffs now deprecated 2012-03-07 18:34:11 -05:00
Andrey Sivachenko fbd2f04a04 JEXL support added; intermediate commit, not yet functional 2012-03-07 17:29:42 -05:00
Mark DePristo 0376d73ece Improved, public version of ErrorRateByCycle
-- A cleaner table output (molten).  For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
2012-03-07 13:10:08 -05:00
Christopher Hartl a6a8fc0521 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-07 10:05:43 -05:00
Mark DePristo 569be953b9 Bugfix for VariantEval
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp.  These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
2012-03-06 16:56:59 -05:00
David Roazen 811f871f78 Do not fail tests that require the GATK private key if the user does not have permission to read it
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.

Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite

Bamboo, of course, will always be able to run these tests.
2012-03-06 15:57:02 -05:00
Christopher Hartl 67def6acc8 Merge branch 'master' of ssh://ni.broadinstitute.org/humgen/gsa-scr1/chartl/dev/unstable 2012-03-06 14:23:14 -05:00
Christopher Hartl 20c1fbaf0f Fixing a merge (turning off downsampling on DoC) 2012-03-06 14:22:45 -05:00
Ryan Poplin 46b470cc69 Minor misc updates 2012-03-06 10:14:45 -05:00
David Roazen 0702ee1587 Public-key authorization scheme to restrict use of NO_ET
-Running the GATK with the -et NO_ET or -et STDOUT options now
 requires a key issued by us. Our reasons for doing this, and the
 procedure for our users to request keys, are documented here:
 http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home

-A GATK user key is an email address plus a cryptographic signature
 signed using our private key, all wrapped in a GZIP container.
 User keys are validated using the public key we now distribute with
 the GATK. Our private key is kept in a secure location.

-Keys are cryptographically secure in that valid keys definitely
 came from us and keys cannot be fabricated, however keys are not
 "copy-protected" in any way.

-Includes private, standalone utilities to create a new GATK user key
 (GenerateGATKUserKey) and to create a new master public/private key
 pair (GenerateKeyPair). Usage of these tools will be documented on
 the internal wiki shortly.

-Comprehensive unit/integration tests, including tests to ensure the
 continued integrity of the GATK master public/private key pair.

-Generation of new user keys and the new unit/integration tests both
 require access to the GATK private key, which can only be read by
 members of the group "gsagit".
2012-03-06 00:09:43 -05:00
Lechu 027843d791 I've simply added a "library(grid)" call at the beginning of the R script generation since R 2.14.2 doesn't seem to load the "grid" package as default. I haven't tested it on previous R versions (you may edit the R version comment to be more precise if desired), but I'm almost certain that this library call shouldn't do any harm on them.
Signed-off-by: Ryan Poplin <rpoplin@broadinstitute.org>
2012-03-05 21:27:03 -05:00