-- Just infrastructure at this point (but with UnitTests!).
-- Capable of taking a histogram of quality scores and a target number of levels (8 for example), and mapping the full range of input quality scores down to only 8.
-- The selected quality scores are chosen to minimize the miscalibration rate of the resulting bins. I believe this adaptive approach is vastly better than the current systems being developed by EBI and NCBI
-- This infrastructure is designed to work with BQSRv2. I envision a system where we feed in the projected empirical quality score distribution from the BQSRv2 table, compute the required deleveling for each of the B, I, and D qualities, and on the fly emit calibrated, compressed quality scores.
-- Note the algorithm right now for determining the best intervals is both greedy (i.e., will miss the best overall choice) and potentially extremely slow. But it is enough for me to play with.
-- use the general function type.convert from read.table to automagically convert the string data to booleans, factors, and numeric types as appropriate. Vastly better than the previous behavior which only worked for numerics, in some cases.
-- Includes paired end status (T/F)
-- Includes count of reads used in calculation
-- Includes simple read type (2x76 for example)
-- Better handling of insert size, read length when there's no data, or the data isn't paired end by emitting NA not 0
-- ReadGroupProperties: Emits a GATKReport containing read group, sample, library, platform, center, median insert size and median read length for each read group in every BAM file.
-- Median tool that collects up to a given maximum number of elements and returns the median of the elements.
-- Unit and integration tests for everything.
-- Making name of TestProvider protected so subclasses and override name more easily
* All contexts with 'N' bases are now collapsed as uninformative
* Context size is now represented internally as a BitSet but output as a dna string
* Temporarily disabled sorted outputs because of null objects
* Turns DNA sequences (for context covariates) into bit sets for maximum compression
* Allows variable context size representation guaranteeing uniqueness.
* Works with long precision, so it is limited to a context size of 31 bases (can be extended with BigNumber precision if necessary).
* Unit Tests added
-- Now include combinatorial testing for all input parameters: base quality, indel quality, continuation penalty, base identity, and indel length
-- Disabled default the results coming back are not correct
-- Currently disabled as the likelihood function doesn't pass basic unit tests
-- Also make low-level function in LikelihoodCalculationEngine protected
-- As these represent the bulk of the StingExceptions coming from BAMSchedule and are caused by simple problems like the user providing bad input tmp directories, etc.