-- QualQuantizer now tracks merge order and level in the QualInterval for debugging / visualization
-- Write out QualIntervals tree for visualization
-- visualizeQuantizedQuals.R r script for basic visualization of the quality score quantization
Several of the unit tests for the new key authorization feature require
read access to the GATK master private key file. Since this file is only
readable by members of the group gsagit, this makes it hard for people
outside the group to run the test suite.
Now, we skip tests that require the master private key if the private
key exists (since not existing would be a true error) but is not readable
by the user running the test suite
Bamboo, of course, will always be able to run these tests.
-Running the GATK with the -et NO_ET or -et STDOUT options now
requires a key issued by us. Our reasons for doing this, and the
procedure for our users to request keys, are documented here:
http://www.broadinstitute.org/gsa/wiki/index.php/Phone_home
-A GATK user key is an email address plus a cryptographic signature
signed using our private key, all wrapped in a GZIP container.
User keys are validated using the public key we now distribute with
the GATK. Our private key is kept in a secure location.
-Keys are cryptographically secure in that valid keys definitely
came from us and keys cannot be fabricated, however keys are not
"copy-protected" in any way.
-Includes private, standalone utilities to create a new GATK user key
(GenerateGATKUserKey) and to create a new master public/private key
pair (GenerateKeyPair). Usage of these tools will be documented on
the internal wiki shortly.
-Comprehensive unit/integration tests, including tests to ensure the
continued integrity of the GATK master public/private key pair.
-Generation of new user keys and the new unit/integration tests both
require access to the GATK private key, which can only be read by
members of the group "gsagit".
-- Just infrastructure at this point (but with UnitTests!).
-- Capable of taking a histogram of quality scores and a target number of levels (8 for example), and mapping the full range of input quality scores down to only 8.
-- The selected quality scores are chosen to minimize the miscalibration rate of the resulting bins. I believe this adaptive approach is vastly better than the current systems being developed by EBI and NCBI
-- This infrastructure is designed to work with BQSRv2. I envision a system where we feed in the projected empirical quality score distribution from the BQSRv2 table, compute the required deleveling for each of the B, I, and D qualities, and on the fly emit calibrated, compressed quality scores.
-- Note the algorithm right now for determining the best intervals is both greedy (i.e., will miss the best overall choice) and potentially extremely slow. But it is enough for me to play with.
-- use the general function type.convert from read.table to automagically convert the string data to booleans, factors, and numeric types as appropriate. Vastly better than the previous behavior which only worked for numerics, in some cases.
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
-- Now include combinatorial testing for all input parameters: base quality, indel quality, continuation penalty, base identity, and indel length
-- Disabled default the results coming back are not correct
-- Currently disabled as the likelihood function doesn't pass basic unit tests
-- Also make low-level function in LikelihoodCalculationEngine protected
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.