-- Passes significiant unit tests
-- Implicit sample creation for mom / dad when you create single samples
-- Continuing cleanup of Sample and SampleDataSource
-- These could be simplied in their downstream uses
-- Or they could be replaced with a generic getSAMFileHeaders() function and then apply the getSamples(header) as desired downstream
-- A nearly identical piece of code already lived in SampleUtils. Now there are two functions, one taking a regular header and another grabbing the merged header from the GATK engine itself. Much cleaner
If both ends of the interval falls within a deletion in the read then hardClipBothEnds would cut the right tail first including the entire deletion, then fail to cut the left tail because there would not be any bases there anymore. Fixed.
The base qualities of a consensus reads are now the average quality of the bases forming the consensus base (most common base) and the consensus quality tag now carry an array with the counts of each base in the consensus. This should increase file size but improve calling sensitivity/specificity.
preQC:
- For R 2.13 when parsing fingerprints explicitly coercing the text before parsing
- Added LOD geom_line() at +/-3 based on Tim's presentation at PM meeting (ppt to go to pipeline wiki asap)
- PF_INDEL_RATE of zero replaced with NA
- NA's are not "violations" auto filter samples since 0+NA = NA, and subset test only looks for 0 violations
- Restored plots for MEAN_READ_LENGTH, BAD_CYCLES, and MEDIAN_INSERT_SIZE by explicitly print()'ing the created plots
postQC:
- Fixed R 2.13 font scaling by moving size out of aes, except when using highlighting
- TODO: Don't know how to scale by aes for highlighting *and* use a smaller overall font size outside aes
* Includes tests that include HardClip to Read and Reference Coords.
* Changed ReadUtils.HardClipByReferenceCoordinates from private to protected to allow for testing
- full 8.5x11
- concating multiple initiatives / bait_sets
- Using NA instead of python None when WR dates are unavailable
- In new aggregations where the sample may have per library metrics, only using the sample level metrics, i.e. library is null
Updated postQC:
- Renamed some variables to assist with traceback()
- Fixed crashes on batches with two alleles or two samples such as Seminara_MC_1_09222011 or Engle_MC_2_09222011
- Added dependency tracking to PostCallingQC.scala so that the R script does try to run before the evals are complete
Other minor cleanup.
Tried to use R 2.13 compactPDF but a few issues to work out with fingerprint boxplots in preQC and geom_text font size in postQC.