-- You can create (and drop the old) GATK_LOG table with the setupDB command
-- You can load data into the database with the loadToDB command
Currently I'm pushing up all of the GATK logs into the new MySQL server setup for the gsa group. Details of the server are in the code, for those interested. All of this is part of my experimentation with Tableau for visualizing GATK run logs.
To support this, refactored code that computes consensus alleles. To ease merging of mulitple alt alleles, we create a single vc for each alt alleles and then use VariantContextUtils.simpleMerge to carry out merging, which takes care of handling all corner conditions already. In order to use this, interface to GenotypeLikelihoodsCalculationModel changed to pass in a GenomeLocParser object (why are these objects to hard to handle??).
More testing is required and feature turned off my default.
-- Call sets with indels > 50 bp in length are tagged as CNVs in the tag (following the 1000 Genomes convention) and were unconditionally checking whether the CNV is already known, by looking at the known cnvs file, which is optional. Fixed. Has the annoying side effect that indels > 50bp in size are not counted as indels, and so are substrated from both the novel and known counts for indels. C'est la vie
-- Added integration test to check for this case, using Mauricio's most recent VCF file for NA12878 which has many large indels. Using this more recent and representative file probably a good idea for more future tests in VE and other tools. File is NA12878.HiSeq.WGS.b37_decoy.indel.recalibrated.vcf in Validation_Data
This error was due to the ReadClipper change of contract. Before the read utils would return null if a read was entirely clipped, now it returns an empty (safe) GATKSAMRecord.
Some tests in this class were intermittently not being executed due
to being randomly scheduled before tests whose results they depend on.
Now the serial dependencies are enforced to avoid problematic orderings.
Previously, the initial release of a new GATK version had a version
number with only one part (eg., "1.4"). This could potentially mislead
people into thinking it's the most recent revision of a release, instead
of the least recent.
Now, initial releases will have full, three-part version numbers
(eg., "1.4-0-g472fc94") like everything else.
* outputs only the groups of read groups necessary, avoiding multiple pileup creations every call to map
* now also counts the number of variants associated with a given ROD (dbSNP) exist in the interval
* new column: interval size
* -g takes a string of read groups separated by space " "
* multiple -g creates multiple sum columns in the table
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Downsampling is now a parameter to the walker with default value of 0 (no downsampling)
* Downsampling selects reads at random at the variant region window and strives to achieve uniform coverage if possible around the desired downsampling value.
* Added integration test
* Knuth-shuffle is a simple, yet effective array permutator (hope this is good english).
* added a simple randomSubset that returns a random subset without repeats of any given array with the same probability for every permutation.
* added unit tests to both functions
* Modified cleanCigarShift to allow insertions in the beginning and end of the read
* Allowed cigars starting/ending in insertions in the systematic ReadClipper tests
* Updated all ReadClipper unit tests
* ReduceReads does not hard clip leading insertions by default anymore
* SlidingWindow adjusts start location if read starts with insertion
* SlidingWindow creates an empty element with insertions to the right
* Fixed all potential divide by zero with totalCount() (from BaseCounts)
* Updated all Integration tests
* Added new integration test for multiple interval reducing
-- Don't try to do nt 16, it's just too painful as the threading doesn't work well and it consumes a large chunk of our available slots on gsa4
-- bugfix: only do multi-threaded test for each iteration, not expanding by subiterations, so we no longer try to do 3x3 nt 16 runs
-- Automatic detection of most recent version of GATK release (just tell the script now to use 1.2, 1.3, and 1.4)
-- Uses 1.4 now
-- By default we do 9 runs of each non-parallel test
-- In PathUtils added convenience utility to find most recent release GATK jar with a specific release number