-- Separate updating cumulative traversal metrics from printing progress. There's now an updateCumulativeMetrics function and a printProgress() that only takes a current position
-- printProgress now soles relies on the time since the last progress to decide if it will print or not. No longer uses the number of cycles, since this isn't reliable in the case of nano scheduling
-- GenomeAnalysisEngine now maintains a pointer to the master cumulative metrics. getCumulativeMetrics never returns null, which was handled in some parts of the code but not others.
-- Update all of the traversals to use the new updateCumulativeMetrics, printProgress model
-- Added progress callback to nano scheduler. Every bufferSize elements this callback is invoked, allowing us to smoothly update the progress meter in the NanoScheduler
-- Rename MapFunction to NanoSchedulerMap and the same for reduce.
-- Refactored TraverseLoci into old linear version and nano scheduling version
-- Temp. GATK argument to say how many nano threads to use
-- Can efficiently scale to 3 threads before blocking on input
-- Instead of returning directly the result of map(), returns a MapResult object with the value and a reduceMe flag.
-- Reduce function respects the reduceMe flag
-- Code cleanup and more documentation
-- Helpful for understanding where the time goes to each bit of the code.
-- Controlled by a local static boolean, to avoid the potential overhead in general
-- TraverseReadsNano prints progress at the end of each traversal unit
-- Fix bugs in TraversalEngine printProgress
-- Synchronize the method so we don't get multiple logged outputs when two or more HMSs call printProgress before initialization at the start!
-- Fix the logic for mustPrint, which actually had the logic of mustNotPrint. Now we see the done log line that was always supposed to be there
-- Fix output formatting, as the done() line was incorrectly shifting over the % complete by 1 char as 100.0% didn't fit in %4.1f
-- Add clearer doc on -PF argument so that people know that the performance log can be generated to standard out if one wants
- VariantAnnotatorEngine changed to call genotype annotations even if pilups and allele -> likelihood mappings are not present. Current genotype annotations altered to check for null pilupes and null mappings.
-- Confirmed that reads spanning off the end of the chromosome don't cause an exception by adding integration test for a single read that starts 7 bases from the end of chromosome 1 and spans 90 bases or so off. Added pileup integration test to ensure this behavior continues to work
-- In the process uncovered two strange things
1 -- qualityScoreByFullCovariateKey was created but never used. Seems like a cache?
2 -- Discovered nasty bug in BaseRecalibrator: https://jira.broadinstitute.org/browse/GSA-534
-- These are like read filters but can be applied either on input, on output, of handled by the walker
-- Previous example of BAQ now uses the general framework
-- Resulted in massive conceptual cleanup of SAMDataSource and ReadProperties! Yeah!
-- BQSR now uses this framework. We can now do BQSR on input, on output, or within a walker
-- PrintReads now handles all read transformers in the walker in map, enabling us to parallelize PrintReads with BAQ and BQSR
-- Currently BQSR is excepting in parallel, which subsequent commit with fix
-- Removed global variable setting in GenomeAnalysisEngine for BAQ, as command line parameters are cleanly handled by ReadTransformer infrastructure
-- In principle ReadFilters are just a special kind of ReadTransformer, but this refactoring is larger than I can do. It's a JIRA entry
-- Many files touched simply due to the refactoring and renaming of classes
-- A higher level interface to declare parallelism capability of a walker. This interface means that the walker can be multi-threaded, but doesn't necessarily support TreeReducible interface, which forces you to have a combine ReduceType operation that isn't appropriate for parallel read walkers
-- Updated ReadWalkers to implement ThreadSafeMapReduce not TreeReducible
-- TraverseReadsNano modified to read in all input data before invoking maps, so the input to TraverseReadsNano is a MapData object holding the sam record, the ref context, and the refmetadatatracker.
-- Update ValidateRODForReads to be tree reducible, using synchronized map and explicitly sort the output map from locations -> counts in onTraversalDone
-- Expanded integration tests to test nt 1, 2, 4.
-- Yes, GenomeLoc.compareTo was broken. The compareTo function only considered the contig and start position, but not the stop, when comparing genome locs.
-- Updated GenomeLoc.compareTo function to account for stop. Updated GATK code where necessary to fix resulting problems that depended on this.
-- Added unit tests to ensure that hashcode, equals, and compareTo are all correct for GenomeLocs
In addition, fix for GSA-310. If supplied -rf argument does not match a known read filter, the list of read filters will be printed, and users directed to the documentation for more information.
-- Stateless objects are required for nano-scheduling. This means you can take the RefMetaDataTracker provided by ReadBasedReferenceOrderedView, store it way, get another from the same view, and the original one behaves the same.
-- Previous behavior was unnecessary and causes all sorts of problems with RODs for reads. The old implementation simply failed in this case. The new code handles this correctly by forcing shards to have all of their data on a single contig.
-- Added a PrintReads integration test to ensure this behavior is correct
-- Adding test BAMs that have < 200 reads and span across contig boundaries
-- shardSpan is only calculated when there some ROD is live in the GATK. No sense in paying the cost per read when you don't need it
-- Update contract to allow null span or unmapped span (good catch unittests!)
-- Deleted ReadMetaDataTracker
-- Added function to ReadShard to give us the span from the left most position of the reads in the shard to the right most, which is needed for the new view
-- ReadMetaDataTracker is dead! Long live the RefMetaDataTracker. Read walkers will soon just take RefMetaDataTracker objects. In this commit they take a class that trivially extends them
-- Rewrote ReadBasedReferenceOrderedView to produce RefMetaDataTrackers not the old class.
-- This new implementation produces thread-safe objects (i.e., holds no points to shared state). Suitable for use (to be tested) with nano scheduling
-- Simplified interfaces to use the simplest data structures (PeekableIterator) not the LocusAwareSeekableIterator, since I both hate those classes and this is on the long term trajectory to remove those from the GATK entirely.
-- Massively expanded DataProvider unit tests for ReadBasedReferenceOrderedView
-- Note that the old implementation of offset -> ROD in ReadRefMetaDataTracker was broken for any read not completely matching the reference. Rather than provide broken code the ReadMetaDataTracker only provides a "bag of RODs" interface. If you want to work with the relationship between the read and the RODs in your tool you need to manage the CIGAR element itself.
-- This commit breaks the new read walker BQSR, but Ryan knows this is coming
-- Subsequent commit will be retiring / fixing ValidateRODForReads
Reverting back to the original implementation, but now including write N's and write Q0's due to walkers that look at the same read multiple times in different reference windows
-- Old way (filtering for Q > 17 bases) resulted in biased FS when the site was good but there was a
systematic shift in the QUAL of REF and ALT between strands of the reads (sometimes happens)
-- New way (taking all bases) was consistent with BaseQualRankSum and other tests, but there can be
a lot of low qual reference bases on one strand in some techs (ION/PROTON/PACBIO) because of the
preference for introducing an indel vs. a mismatch.
-- This implementation allows us to have our cake and eat it to by computing both p-values, and
taking the maximum one (i.e., least significant).
-- No integration tests updated yet -- still exploring the consequences of this change
-- TraversalReadsNano only creates the NanoScheduler once, and shuts it down onTraversalDone
-- Nicer debugging output in NanoScheduler
-- ReadShard has a getBufferSize() method now
-- I'm seeing a lot of people trying to use BinaryTagCovariate in the community. They really shouldn't do this, so I moved it to private.
-- Throw an exception if its required bintag argument is missing
-- Check explicitly if user is requesting DinucCovariate and tell them that its been retired in favor of ContextCovariate
-- Show the type (Required, Experimental, Standard) of the covariates when running --list
A number of functions int he sampleDB looked to be assuming that samples could not share IDs (e.g. sample IDs are unique, so a sample present in two families could not be represented by multiple Sample objects). Added an assertion in the SampleDBBuilder to document/test this assumption.
MVLikelihoodRatio now uses the trio methods from SampleDB.
-- Groups inputs for each thread so that we don't have one thread execution per map() call
-- Added shutdown function
-- Documentation everywhere
-- Code cleanup
-- Extensive unittests
-- At this point I'm ready to integrate it into the engine for CPU parallel read walkers
– Write general NanoScheduler framework in utils.threading. Test with reading via iterator from list of integers, map is int * 2, reduce is sum. Should be efficiency using resources to do sum of 2 * (sum(1 - X)).
Done!
CPU parallelism is nano threads. Pfor across read / map / reduce. Use work queue to implement.
Create general read map reduce framework in utils. Test parallelism independently before hooking up to Locus iterator
Represent explicitly the dependency graph. Scheduler should choose the work units that are ready for computation, that are marked as "completing a computation", and then finally that maximize the number of sequent available work units. May be worth measuring expected cost for read read / map / reduce unit and use it to balance the compute
As input is single threaded just need one thread to populate inputs, which runs as fast as possible on parallel pushing data to fixed size queue. Each push creates map job and links to upcoming reduce job.
Note that there's at most one thread for IO tasks, and all of the threads can contribute to CPU tasks
-- GATKRunReports contain itemized information about the numThreads used to execute the GATK, as well as the efficiency of the use of those threads to get real work done, including time spent running, waiting, blocking, and waiting for IO
-- See https://jira.broadinstitute.org/browse/GSA-506 for more details
-- Invert logic in GATKArgumentCollection to disable monitoring, not enable. That means monitoring is on by default
-- Fix testing error in unit tests
-- Rename variables in ThreadAllocation to be clearer
-- Old version StateMonitoringThreadFactory refactored into base class ThreadEfficiencyMonitor and subclass EfficiencyMonitoringThreadFactory.
-- Base class is used by LinearMicroScheduler to monitor performance of GATK in single threaded mode
-- MicroScheduler now handles management of the efficiency monitor. Includes master thread in monitor, meaning that reduce is now included for both schedulers