GATKReport format changes:
- All non-data header lines are preceeded with a single pound ( #:)
- Every report now has a report header containing the version number and number of tables
- Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
- This new format will allow reports in the future to be gatherable.
- Changed the header format to include an end-of-line string ":;"
Added features:
- Simplified GATK Reports:
The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.
A simple GATK Report consists of:
- A single table
- No primary key ( it is hidden )
Optional:
- Only untyped columns. As long as the data is an Object, it will be accepted.
- Default column values being empty strings.
Limitations:
- A simple GATK report cannot contain multiple tables.
- It cannot contain typed columns, which prevents arithmetic gathering.
- Added a constructor to generate simplified GATK reports.
- Added a method to easily add data to simple GATK reports.
- Upgraded the input parser take advantage of the new file format (v1).
- Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
- Made some GATKReport methods public, and added more setters and getters.
- Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
- The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
- Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
- Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.
Test changes:
- Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
- Updated the MD5 hashes in integration tests throughout the GATK.
Other changes:
- Added the gatherer functions to CoverageByRG
- Also added the scatterCount parameter in the Interval Coverage script
- Dropped support for reading in legacy GATKReport formats ( v0.*)
- Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
- Rewrote the read file method for GATK report files.
- Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
Now looks like:
<GATK-run-report>
<id>D7D31ULwTSxlAwnEOSmW6Z4PawXwMxEz</id>
<start-time>2012/03/10 20.21.19</start-time>
<end-time>2012/03/10 20.21.19</end-time>
<run-time>0</run-time>
<walker-name>CountReads</walker-name>
<svn-version>1.4-483-g63ecdb2</svn-version>
<total-memory>85000192</total-memory>
<max-memory>129957888</max-memory>
<user-name>depristo</user-name>
<host-name>10.0.1.10</host-name>
<java>Apple Inc.-1.6.0_26</java>
<machine>Mac OS X-x86_64</machine>
<iterations>105</iterations>
</GATK-run-report>
No longer capturing command line or directory information, to minimize people's concerns with phone home and privacy
- Uses modified yates correction of e + 1 / n + 2 to estimate error rates
- Now shows ALL and per read group information
- Better limits on diff plots so we can see more information
-- Refactored some duplicated code (FYI, code duplication = root of all evil) into shared functions
-- Added long-missing integrationtests
-- CHRIS/RYAN -- it would be very good to add an integration test covering external VCF files as I believe we rely on this functionality and it's not tested at all
This is a quick-and-dirty patch for the null pointer error Mauricio reported earlier.
Later on we might want to address in a more general way the fact that we validate user intervals
against the reference but not against the merged BAM header produced by the engine at runtime.
This fix is similar, but distinct from the earlier fix to GATKBAMIndex. If we fail to read in
a complete 3-integer bin header from the BAM schedule file that the engine has written, throw a
ReviewedStingException (since this is our problem, not the user's) rather than allowing a
cryptic buffer underflow error to occur.
Note that this change does not fix the underlying problem in the engine, if there is one
(there may be an as-yet-undetected bug in the code that writes the bam schedule). It will
just make it easier for us to identify what's going wrong in the future.
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.
Added a unit test for this case as well.