TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.
This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.
Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
It's now possible to run with experimental downsampling enabled
using the --enable_experimental_downsampling engine argument.
This is scheduled to become the GATK-wide default next week after
diff engine output for failing tests has been examined.
Notify all downsamplers in our pool of the current global genomic position every
DOWNSAMPLER_POSITIONAL_UPDATE_INTERVAL position changes, not every single
positional change after that threshold is first reached.
-Only used when experimental downsampling is enabled
-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.
-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.
-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.
-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
-- See https://jira.broadinstitute.org/browse/GSA-573
-- Uses InheritedThreadLocal storage so that children threads created by the NanoScheduler see the parent stubs in the main thread.
-- Added explicit integration test that checks that -nt 1, 2 and -nct 1, 2 give the same results for GLM BOTH with the UG over 1 MB.
Doesn't actually fix the problem, and adds an unnecessary delay in closing down NanoScheduler, so reverting.
This reverts commit 66b820bf94ae755a8a0c71ea16f4cae56fd3e852.
1) Better documentation on the meta data file for VariantsToBinaryPed with examples of each file type
2) MannWhitneyU can now take an argument on creation to turn off dithering. This pertains to JIRA-GSA-571 but does not fix it,
as it isn't hooked up to the command line. Next step is to add an argument to the command line where it's accessible to the
annotation classes (e.g. from either UG or the VariantAnnotator).
3) Added some dumb python scripts to deal with Plink files, and a script to convert plink binaries to VCF to help sanity check. Basically if you want to do an analysis on genotype data stored in plink binary format, your choices are:
1) Add a new module to Plink [difficulty rating: Impossible -- code obfuscation]
2) Steal plink parsing code from software (Plink/PlinkSeq/GCTA/Emacks/etc) that readds the files [difficulty rating: Oppressive -- code not modularized at all)
3) Write your own dumb stuff [difficutly rating: Annoying]
What's been added is the result of 3. It's a library so nobody else has to do this, so long as they're comfortable with python.
-- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker
-- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately.
-- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers
-- Added NanoSchedulerUnitTest for input errors
-- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results.
-- Build.xml no longer copies the unittest results verbosely
-- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler
-- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled)
-- Added unit tests for failing input iterator in NanoScheduler (fails)
-- Made ErrorThrowing NanoScheduable
-- V3 + V4 algorithm for NanoScheduler. The newer version uses 1 dedicated input thread and n - 1 map/reduce threads. These MapReduceJobs perform map and a greedy reduce. The main thread's only job is to shuttle inputs from the input producer thread, enqueueing MapReduce jobs for each one. We manage the number of map jobs now via a Semaphore instead of a BlockingQueue of fixed size.
-- This new algorithm should consume N00% CPU power for -nct N value.
-- Also a cleaner implementation in general
-- Vastly expanded unit tests
-- Deleted FutureValue and ReduceThread