-- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread
-- Needed to debug locking issues
-- Needed to debug excessively long running integrationtests
-- Added build.xml maximum runtime for all testng tests of 10 hours. We will ultimately fail the build if it goes on for more than 10 hours
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
-- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases.
-- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase()
-- Added unit tests to ensure this functionality works correct.
-- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case.
-- Added contracts to ensure this is the case.
-- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code
The MD5s for these tests were changed in commit 87435f1074615b2cd016f042980109fd53962c8d
to match the output of a broken version of BaseRecalibration. With the patch in
commit c397102ecc1fd1d2cd8f209a8f358ab4a60b50a7, the output once again matches the
*original* MD5s for these tests, and does not vary as you increase -nct.
Final resolution to GSA-632
-- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc.
-- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update
-- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling?
-- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.
-- NCT wasn't previously recognized by Queue as needing more processors per machine. This commit fixes this. Also a potential cause of poor GATKPerformanceOverTime, in that runs with -nct could flood a node and cause it to have hundreds of cores in contention.
Caching and reusing ReadCovariates instances across reads sounds good in theory, but:
-it doesn't work unless you zero out the internal arrays before each read
-the internal arrays must be sized proportionally to the maximum POSSIBLE
recalibrated read length (5000!!!), instead of the ACTUAL read lengths
By contrast, creating a new instance per read is basically equivalent to doing an
efficient low-level memset-style clear on a much smaller array (since we use the actual
rather than the maximum read length to create it). So this should be faster than caching
instances and calling clear() but slower than caching instances and not calling clear().
Credit to Ryan to proposing this approach.
It turns out that pre-allocating the entire tree was too expensive in
terms of memory when using large values for the -mcs and -ics parameters.
Pre-allocating the first two dimensions prevents us from ever locking the
root node during a put(). Contention between threads over lower levels
of the tree should be minimal given that puts are rare compared to gets.
Also output dimensions and pre-allocation info at startup. If pre-allocation
takes longer than usual this gives the user a sense of what is causing the
delay.
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation