-- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity). Forced me to disable reasonable contracts. Needs to be thought through
-- exactCallsLog should be optional
-- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.
-- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time.
-- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live
-- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs
-- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples
-- Added unittests for priors (flat and human)
-- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)
-- Added combinatorial unit tests for both Diploid and General (in diploid-case) for 2 and 3 alleles in all combinations of sample types (i.e., AA, AB, BB and equiv. for tri-allelic). More assert statements to ensure quality of the result.
-- Added docs (DOCUMENT YOUR CODE!) to AlleleFrequencyCalculationResult, with proper input error handling and contracts. Made mutation functions all protected
-- No longer need to call reset on your AlleleFrequencyCalculationResult -- it'd done for you in the calculation function. reset is a protected method now, so it's all cleaner and nicer this way
-- TODO still -- need to add edge-case tests for non-informative samples (0,0,0), for the impact of priors, and I need to add some way to test the result of the pNonRef
-- Added a true base class that only does truly common tasks (like manage call logging)
-- This base class provides the only public method (getLog10PNonRef) and calls into a protected compute function that's abstract
-- Split ExactAF into superclass ExactAF with common data structures and two subclasses: DiploidExact and GeneralPloidyExact
-- Added an abstract reduceScope function that manages the simplification of the input VariantContext in the case where there are too many alleles or other constraints require us to only attempt a smaller computation
-- All unit tests pass
1) GATKArgumentCollection has a command to turn off randomization if setting the seed isn't enough. Right now it's only hooked into RankSumTest.
2) RankSumTest now can be passed a boolean telling it whether to use a dithering or non-randomizing comparator. Unit tested.
3) VariantsToBinaryPed can now output in both individual-major and SNP-major mode. Integration test.
4) Updates to PlinkBed-handling python scripts and utilities.
5) Tool for calculating (LD-corrected) GRMs put under version control. This is analysis for T2D, but I don't want to lose it should something happen to my computer.
1) ValidateVariants removed in favor of direct validation VariantContexts. Integration test added to test broken contexts.
2) Enabling indel and SV output. Still bi-allelic sites only. Integration tests added for these cases.
3) Found a bug where GQ recalculation (if a genotype has PLs but no GQ) would only happen for flipped encoding. Fixed. Integration test added.
TestNG skips tests when an exception occurs in a data provider,
which is what was happening here.
This was due to an AWFUL AWFUL use of a non-final static for
ReadShard.MAX_READS. This is fine if you assume only one instance
of SAMDataSource, but with multiple tests creating multiple SAMDataSources,
and each one overwriting ReadShard.MAX_READS, you have a recipe for
problems. As a result of this the test ran fine individually, but not as
part of the unit test suite.
Quick fix for now to get the tests running -- this "mutable static"
interface should really be refactored away though, when I have time.
-Only used when experimental downsampling is enabled
-Persists read iterators across shards, creating a new set only when we've exhausted
the current BAM file region(s). This prevents the engine from revisiting regions discarded
by the downsamplers / filters, as could happen in the old implementation.
-SAMDataSource no longer tracks low-level file positions in experimental mode. Can strip
out all related code when the engine fork is collapsed.
-Defensive implementation that assumes BAM file regions coming out of the BAM Schedule
can overlap; should be able to improve performance if we can prove they cannot possibly
overlap.
-Tests a bit on the extreme side (~8 minute runtime) for now; will scale these back
once confidence in the code is gained
-- Renamed TraversalErrorManager to the more general MultiThreadedErrorTracker
-- ErrorTracker is now used throughout the NanoScheduler. In order to properly handle errors, the work previously done by main thread (submit jobs, block on reduce) is now handled in a separate thread. The main thread simply wakes up peroidically and checks whether the reduce result is available or if an error has occurred, and handles each appropriately.
-- EngineFeaturesIntegrationTest checks that -nt and -nct properly throw errors in Walkers
-- Added NanoSchedulerUnitTest for input errors
-- ThreadEfficiencyMonitoring is now disabled by default, and can be enabled with a GATK command line option. This is because the monitoring doesn't differentiate between threads that are supposed to do work, and those that are supposed to wait, and therefore gives misleading results.
-- Build.xml no longer copies the unittest results verbosely
-- Refactored error handling from HMS into utils.TraversalErrorManager, which is now used by HMS and will be usable by NanoScheduler
-- Generalized EngineFeaturesIntegrationTest to test map / reduce error throwing for nt 1, nt 2 and nct 2 (disabled)
-- Added unit tests for failing input iterator in NanoScheduler (fails)
-- Made ErrorThrowing NanoScheduable
1) SelectVariants could throw a ReviewedStingException (one of the nasty "Bug:") ones if the user requested a sample that wasn't present in the VCF. The walker now
checks for this in the initialize() phase, and throws a more informative error if the situation is detected. If the user simply wants to subset the VCF to
all the samples requested that are actually present in the VCF, the --ALLOW_NONOVERLAPPING_COMMAND_LINE_SAMPLES flag changes this UserException to a Warning,
and does the appropriate subsetting. Added integration tests for this.
2) GenotypeLikelihoods has an unsafe method getLog10GQ(GenotypeType), which is completely broken for multi-allelic sites. I marked that method
as deprecated, and added methods that use the context of the allele ordering (either directly specified or as a VC) to retrieve the appropriate GQ, and
added a unit test to cover this case. VariantsToBinaryPed needs to dynamically calculate the GQ field sometimes (because I have some VCFs with PLs but no GQ).
-- Previously these core progress metering functions were all in TraversalEngine, and available to subclasses like TraverseLoci via inheritance. The problem here is that the upcoming data threads x cpu threads parallelism requires one master copy of the progress metering shared among all traversals, but multiple instantiations of traverse engines themselves.
-- Because the progress metering code has horrible anyway, I've refactored and vastly cleaned up and simplified all of these capabilities into TraversalProgressMeter class. I've simplified down the classes it uses to work (STILL SOME TODOs in there) so that it doesn't reach into the core GATK engine all the time. It should be possible to write some nice tests for it now. By making it its own class, it can protect itself from multi-threaded access with a single synchronized printProgress function instead of carrying around multiple lock objects as before
-- Cleaned up the start up of the progress meter. It's now handled when the meter is created, so each micro scheduler doesn't have to deal with proper initialization timing any longer
-- Simplified and made clear the interface for shutting down the traversal engines. There's no a shutdown method in TraversalEngine that's called once by the MicroScheduler when the entire traversing in over. Nano traversals now properly shut down (was subtle bug I undercovered here). The printing of on traversal done metering is now handled by MicroScheduler
-- The MicroScheduler holds the single master copy of the progress meter, and doles it out to the TraversalEngines (currently 1 but in future commit there will be N).
-- Added a nice function to GenomeAnalysisEngine that returns the regions we will be processing, either the intervals requested or the whole genome. Useful for progress meter but also probably for other infrastructure as well
-- Remove a lot of the sh*ting Bean interface getting and setting in MicroScheduler that's no longer useful. The generic bean is just a shell interface with nothing in it.
-- By removing a lot of these bean accessors and setters many things are now final that used to be dynamic.
-Off by default; engine fork isolates new code paths from old code paths,
so no integration tests change yet
-Experimental implementation is currently BROKEN due to a serious issue
involving file spans. No one can/should use the experimental features
until I've patched this issue.
-There are temporarily two independent versions of LocusIteratorByState.
Anyone changing one version should port the change to the other (if possible),
and anyone adding unit tests for one version should add the same unit tests
for the other (again, if possible). This situation will hopefully be extremely
temporary, and last only until the experimental implementation is proven.
-- Confirmed that reads spanning off the end of the chromosome don't cause an exception by adding integration test for a single read that starts 7 bases from the end of chromosome 1 and spans 90 bases or so off. Added pileup integration test to ensure this behavior continues to work
-- Stateless objects are required for nano-scheduling. This means you can take the RefMetaDataTracker provided by ReadBasedReferenceOrderedView, store it way, get another from the same view, and the original one behaves the same.
-- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads. The old implementation simply failed in this case. The new code handles this correctly by forcing shards to have all of their data on a single contig.
-- Added a PrintReads integration test to ensure this behavior is correct
-- Adding test BAMs that have < 200 reads and span across contig boundaries
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
-- ReadMetaDataTracker is dead! Long live the RefMetaDataTracker. Read walkers will soon just take RefMetaDataTracker objects. In this commit they take a class that trivially extends them
-- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class.
-- This new implementation produces thread-safe objects (i.e., holds no points to shared state). Suitable for use (to be tested) with nano scheduling
-- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely.
-- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView
-- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference. Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface. If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself.
-- This commit breaks the new read walker BQSR, but Ryan knows this is coming
-- Subsequent commit will be retiring / fixing ValidateRODForReads