Infrastructure:
* Added static interface to all different clipping algorithms of low quality tail clipping
* Added reverse direction pileup element event lookup (indels) to the PileupElement and LocusIteratorByState
* Complete refactor of the KeyManager. Much cleaner implementation that handles keys with no optional covariates (necessary for on-the-fly recalibration)
* EventType is now an independent enum with added capabilities. All functionality is now centralized.
BQSR and RecalibrateBases:
* On-the-fly recalibration is now generic and uses the same bit set structure as BQSR for a reduced memory footprint
* Refactored the object creation to take advantage of the compact key structure
* Replaced nested hash maps with single hash maps indexed by bitsets
* Eliminated low quality tails from the context covariate (using ReadClipper's write N's algorithm).
* Excluded contexts with N's from the output file.
* Fixed cycle covariate for discrete platforms (need to check flow cycle platforms now!)
* Redfined error for indels to look at the previous base in negative strand reads (using new PE functionality)
* Added the covariate ID (for optional covariates) to the output for disambiguation purposes
* Refactored CovariateKeySet -- eventType functionality is now handled by the EventType enum.
* Reduced memory usage of the BQSR script to 4
Tests:
* Refactored BQSRKeyManagerUnitTest to handle the new implementation of the key manager
* Added tests for keys without optional covariates
* Added tests for on-the-fly recalibration (but more tests are necessary)
Infrastructure:
* Generic BitSet implementation with any precision (up to long)
* Two's complement implementation of the bit set handles negative numbers (cycle covariate)
* Memoized implementation of the BitSet utils for better performance.
* All exponents are now calculated with bit shifts, fixing numerical precision issues with the double Math.pow.
* Replace log/sqrt with bitwise logic to get rid of numerical issues
BQSR:
* All covariates output BitSets and have the functionality to decode them back into Object values.
* Covariates are responsible for determining the size of the key they will use (number of bits).
* Generalized KeyManager implementation combines any arbitrary number of covariates into one bitset key with event type
* No more NestedHashMaps. Single key system now fits in one hash to reduce hash table objects overhead
Tests:
* Unit tests added to every method of BitSetUtils
* Unit tests added to the generalized key system infrastructure of BQSRv2 (KeyManager)
* Unit tests added to the cycle and context covariates (will add unit tests to all covariates)
-- This report details which intervals are coming in and how many reads they contain
-- Added integration test to verify that the intervals aren't changing, before heading into the ART refactor
GATKReport format changes:
- All non-data header lines are preceeded with a single pound ( #:)
- Every report now has a report header containing the version number and number of tables
- Every table has two lines of table header: The first explains the size of the table and the data types of each column, the second contains the table name and description.
- This new format will allow reports in the future to be gatherable.
- Changed the header format to include an end-of-line string ":;"
Added features:
- Simplified GATK Reports:
The constructor for a simplified GATK Report. Simplified GATK report are designed for reports that do not need the advanced functionality of a full GATK Report.
A simple GATK Report consists of:
- A single table
- No primary key ( it is hidden )
Optional:
- Only untyped columns. As long as the data is an Object, it will be accepted.
- Default column values being empty strings.
Limitations:
- A simple GATK report cannot contain multiple tables.
- It cannot contain typed columns, which prevents arithmetic gathering.
- Added a constructor to generate simplified GATK reports.
- Added a method to easily add data to simple GATK reports.
- Upgraded the input parser take advantage of the new file format (v1).
- Added the GATKReportGatherer, more usability cmoing in next versionof GATK Report. Curently, it can only add rows from one table to another. Added private methods in GATKReport to combine Tables and Reports, It is very conservative and will only gather if the table columns, as well as everything else matches. At the column level, it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
- Made some GATKReport methods public, and added more setters and getters.
- Added method that compares formats of two GATKReports, and added an equals method to verify all data inside.
- The gsalib for R now supports reading GATKReport v1 files in addition to legacy formats (v0.*)
- Added a GATKReportDataType enum to give column a certain data type. This must be specified when making a gatherable report. This enum contains several methods including a reverse lookup map.
- Added a data type field in GATKColumn, when a type is not specified, the unknown type is used. Unknown types should not be gathered.
Test changes:
- Updated Unit Tests for GATK Report v1. Added a test for the gatherer. Left one test disabled while we transition from v0 to v1.
- Updated the MD5 hashes in integration tests throughout the GATK.
Other changes:
- Added the gatherer functions to CoverageByRG
- Also added the scatterCount parameter in the Interval Coverage script
- Dropped support for reading in legacy GATKReport formats ( v0.*)
- Updated VariantEvalWalker to work with GATK Report v1, added a format String to all applicable DataPoints.
- Rewrote the read file method for GATK report files.
- Optimized the equals methods within GATKReport. The protected functions should only be called by the GATKReport methods.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
GATKBAMIndex would allow an extremely confusing BufferUnderflowException to be
thrown when a BAM index file was truncated or corrupt. Now, a UserException is
thrown in this situation instructing the user to re-index the BAM.
Added a unit test for this case as well.
-- A cleaner table output (molten). For those interested in seeing how this can be done with GATKReports look here for a nice clean example
-- Integration tests
-- Minor improvements to GATKReportTable with methods to getPrimaryKeys
-- We weren't properly handling the case where a site had both a SNP and indel in both eval and comp. These would naturally pair off as SNP x SNP and INDEL x INDEL in eval, but we'd still invoke update2 with (null, SNP) and (null, INDEL) resulting most conspicously as incorrect false negatives in the validation report.
-- Updating misc. integrationtests, as the counting of comps (in particular for dbSNP) was inflated because of this effect.
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
-- DoC now by default ignores bases with reference Ns, so these are not included in the coverage calculations at any stage.
-- Added option --includeRefNSites that will include them in the calculation
-- Added integration tests that ensures the per base tables (and so all subsequent calculations) work with and without reference N bases included
-- Reorganized command line options, tagging advanced options with @Advanced
premature push from my part. Roger is still working on the new format and we need to update the other tools to operate correctly with the new GATKReport.
This reverts commit aea0de314220810c2666055dc75f04f9010436ad.
- Added the GATKReportGatherer
- Added private methods in GATKReport to combine Tables and Reports
- It is very conservative and it will only gather if the table columns, match.
- At the column level it uses the (redundant) row ids to add new rows. It will throw an exception if it is overwriting data.
Added the gatherer functions to CoverageByRG
Also added the scatterCount parameter in the Interval Coverage script
Made some more GATKReport methods public
The UnitTest included shows that the merging methods work
Added a getter for the PrimaryKeyName
Fixed bugs that prevented the gatherer form working
Working GATKReportGatherer
Has only the functional to addLines
The input file parser assumes that the first column is the primary key
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
Eric reported this bug due to the reduced reads failing with an index out of bounds on what we thought was a deletion, but turned out to be a read starting with insertion.
* Refactored PileupElement to distinguish clearly between deletions and read starting with insertion
* Modified ExtendedEventPileup to correctly distinguish elements with deletion when creating new pileups
* Refactored most of the lazyLoadNextAlignment() function of the LocusIteratorByState for clarity and to create clear separation between what is a pileup with a deletion and what's not one. Got rid of many useless if statements.
* Changed the way LocusIteratorByState creates extended event pileups to differentiate between insertions in the beginning of the read and deletions.
* Every deletion now has an offset (start of the event)
* Fixed bug when LocusITeratorByState found a read starting with insertion that happened to be a reduced read.
* Separated the definitions of deletion/insertion (in the beginning of the read) in all UG annotations (and the annotator engine).
* Pileup depth of coverage for a deleted base will now return the average coverage around the deletion.
* Indel ReadPositionRankSum test now uses the deletion true offset from the read, changed all appropriate md5's
* The extra pileup elements now properly read by the Indel mode of the UG made any subsequent call have a different random number and therefore all RankSum tests have slightly different values (in the 10^-3 range). Updated all appropriate md5s after extremely careful inspection -- Thanks Ryan!
phew!
-- Missing BAMs were appearing as StingExceptions
-- Missing VCFs were showing up as CommandLineErrors, but it's clearer for them to be CouldNotReadInputFile exceptions
-- Added integration tests to ensure missing BAMs, VCFs, and -L files are properly thrown as CouldNotReadInputFile exceptions
-- Added path to standard b37 BAM to BaseTest
-- Cleaned up code in SAMDataSource, removing my parallel loading code as this just didn't prove to be useful.
This is somewhat of an arbitrary decision, and is negotiable. I could see treating
GT:PL ./.:.
differently from
GT:PL .:0,3,6
but am not sure the worth of doing so.
* using the filter() instead of map() makes for a cleaner walker.
* renaming the unit tests to make more sense with the other unit and integration tests
* if the adaptor boundary is more than MAXIMUM_ADAPTOR_SIZE bases away from the read, then let's not clip anything and consider the fragment to be undetermined for this read pair.
* updated md5's accordingly