-- Idea is simply to create a persistent database of all TP/FP sites on chr20 in NA12878. Individual callsets can be imported, and a consensus algorithm is run over all callsets in the database to create a consensus collection, which can be used to assess NA12878 callsets for GATK and methods development
-- Framework for representing simple VariantContexts and Genotypes in MongoDB, querying for records, and iterating over them in the GATK
-- Not hooked up to Tribble, but could be done reasonably easily now (future TODO)
-- Tools to import callsets, create consensus callsets, import and export reviews
-- Scripts to reset the knowledge base and repopulate it with the standard data files (Eric will expand)
-- Actually scales to all of chr20, includes AssessNA12878 that reads a VCF and itemizes it against the truth data set
-- ImportCallset can load OMNI, HM3, CEU best practices, mills/devine sites and genotypes, properly marking sites as poly/mono/unk as well as TP/FP/UNK based on command line parameters
-- Added shell scripts that start up a local mongo db, that connect to a local or BI hosted mongo for NA12878.db for debugging, and a setupNA12878db script that can load OMNI, HM3, CEU best practices, Mills/Devine into the db and then update the consensus.
-- Reviewed sites can be exported to a VCF, and imported again, as a mechanism to safely store the only non-recoverable data from the Mongo DB.
-- Created a NA12878DBWalker that manages the outer DB interaction, and that all MongoDB interacting walkers inherit from. Added a NA12878DBArgumentCollection.java consolating all of the common command line arguments (though strictly not necessary as all of this occurs in the root walker)
UnitTests
-- Can connect to a test knowledge base for development and unit testing
-- PolymorphicStatus, TruthStatus, SiteIterator
-- NA12878KBUnitTestBase provides simple utilities for connecting to the test mongo db, getting calls, etc
-- MongoVariantContext tests creation, matching, and encoding -> writing -> read -> decoding from the mongodb
AssessNA12878
-- Generic tool for comparing a NA12878 callset against the knowledge base. See http://gatkforums.broadinstitute.org/discussion/1848/using-the-na12878-knowledge-base for detailed documentation
-- Performs trivial filtering on FS, MQ, QD for SNPs and non-SNPs to separate out variants likely to be filtered from those that are honest-to-goodness FPs
Misc
-- Ability to provide Description for Simplified GATK report
ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples.
Also:
* Added integrationtest for co-reduction
* Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly)
DEV-200 #resolve #time 8m
Removed some generics from PluginManager for now until able to figure out syntax for requesting explicit subclass.
QStatusMessenger uses a slightly more primitive Map[String, Seq[RemoteFile]] instead of Map[ArgumentSource, Seq[RemoteFile]].
Added an QCommandPlugin.initScript utility method for handling specialized script types.
numbers larger than 999 in the Errors column were printed out with commas (which looks like a separate column).
This wasn't caught earlier because there are no integration tests covering the csv. I'll add one into unstable in a sec.
-- New tribble library now uses 64 bit sizes. The 26K VCF has so much data that low-level tribble block indices where overflowing their int size values. This includes a to-be-committed tribble jar that fixes this problem
-- See https://jira.broadinstitute.org/browse/GSA-652
-- Minor cleanup of error messages that were useful on the way to solving this monster problem
-- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread
-- Needed to debug locking issues
-- Needed to debug excessively long running integrationtests
-- Added build.xml maximum runtime for all testng tests of 10 hours. We will ultimately fail the build if it goes on for more than 10 hours
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
-- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases. This is now done by default, unless requested explicitly to preserve the original bases.
-- This is really the correct place to do this for a variety of reasons. First, you don't need to work about upper casing bases throughout the code. Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves. Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase()
-- Added unit tests to ensure this functionality works correct.
-- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case.
-- Added contracts to ensure this is the case.
-- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code
-- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded. By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached. This means that all output files are closed properly, etc.
-- Emits an info message that looks like "INFO 10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s". Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update
-- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling?
-- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.
-- NCT wasn't previously recognized by Queue as needing more processors per machine. This commit fixes this. Also a potential cause of poor GATKPerformanceOverTime, in that runs with -nct could flood a node and cause it to have hundreds of cores in contention.
Caching and reusing ReadCovariates instances across reads sounds good in theory, but:
-it doesn't work unless you zero out the internal arrays before each read
-the internal arrays must be sized proportionally to the maximum POSSIBLE
recalibrated read length (5000!!!), instead of the ACTUAL read lengths
By contrast, creating a new instance per read is basically equivalent to doing an
efficient low-level memset-style clear on a much smaller array (since we use the actual
rather than the maximum read length to create it). So this should be faster than caching
instances and calling clear() but slower than caching instances and not calling clear().
Credit to Ryan to proposing this approach.
It turns out that pre-allocating the entire tree was too expensive in
terms of memory when using large values for the -mcs and -ics parameters.
Pre-allocating the first two dimensions prevents us from ever locking the
root node during a put(). Contention between threads over lower levels
of the tree should be minimal given that puts are rare compared to gets.
Also output dimensions and pre-allocation info at startup. If pre-allocation
takes longer than usual this gives the user a sense of what is causing the
delay.
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
Also, fixes to large-scale validation script: lower -minIndelFrac threshold or else we'll kill most indels since default 0.25 is too high for pools, fix also VE stratifications and add one VE run where eval=1KG, comp=pool data and AC stratification based on 1KG annotation
The old BaseRecalibrator walker is and never will be thread-safe, since it's a
LocusWalker that uses read attributes to track state.
ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe
at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1
for testing purposes, but wait for further testing to be done before using it
for production purposes in multithreaded mode.
The ReadGroupCovariate class was not thread-safe. This led to horrible race conditions
in multithreaded runs of the BQSR where (for example) the same read group could get
inserted into the reverse lookup table twice with different IDs.
Should fix the intermittent crash reported in GSA-492.
-With this change, BQSR performance scales properly by thread rather
than gaining nothing from additional threads.
-Benefits are seen when using either -nt (HierarchicalMicroScheduler) or -nct
(NanoScheduler)
-Removes high-level locks in the recalibration engines and NestedIntegerArray
in favor of maximally-granular locks on and around manipulation of the leaf
nodes of the NestedIntegerArray.
-NestedIntegerArray now creates all interior nodes upfront rather than on
the fly to avoid the need for locking during tree traversals. This uses
more memory in the initial part of BQSR runs, but the BQSR would eventually
converge to use this memory anyway over the course of a typical run.
IMPORTANT NOTE: This does not mean it's safe to run the old BaseRecalibrator
walker with multiple threads. The BaseRecalibrator walker is and will never be
thread-safe, as it's a LocusWalker that uses read attributes to track
state information. ONLY the newer DelocalizedBaseRecalibrator can be made
thread-safe (and will hopefully be made so in my subsequent commits). This
commit addresses performance, not correctness.
-- Resolves issue GSA-515 / Nanoscheduler GSA-605 / Seems that -nct may deadlock as not reproducible
-- It seems that it's not an input error problem (or at least cannot be provoked with unit tests)
-- I'll keep an eye on this later
Bringing in the following relevant changes:
* Fixes the indel realigner N-Way out null pointer exception DEV-10
* Optimizations to ReduceReads that bring the run time to 1/3rd.
Conflicts:
protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java
DEV-10 #resolve #time 2m
-- Included logic to only add priors for alleles with sufficient evidence to be called polymorphic. If no alleles are poly make sure to add priors of first allele
-- There's been no report of problems with the nano scheduled version of TraverseLoci and TraverseReads, so I'm removing the old versions since they are no longer needed
-- Removing unnecessary intermediate base classes
-- GSA-515 / Nanoscheduler GSA-549 / https://jira.broadinstitute.org/browse/GSA-549
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic. That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele. The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating. This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime. Might be worth it through
-- This is no longer a core GATK activity, and the tests need to run for so long (2 min each) that it's just too painful to run them. Should be re-eabled if we come to care about this capability again, or if we can run these tests all in parallel in the future.
-- The old way of overloading constructors and calling super didn't work (might have been a consequence of merge). This is the right way to do the copy constructor with the call to super()
-- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case
-- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations
-- Added systematic unit tests covering this implementation, and comparing it to others
-- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases
-- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right
-- Changed UG / HC to use this one via the StandardCallerArgumentCollection
-- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums
-- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests
-- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base
-- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X. Added unit test to ensure this behavior is correct
-- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order. Add contract to ensure this doesn't ever happen again
-- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model
--
-- These two classes were really the same, and now they are actually the same!
-- Cleanuped the interfaces, removed duplicate data
-- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed)
-- Moved goodProbability and goodProbabilityVector utilities to MathUtils. Very useful for contracts!
Changed ReflectionUtils.getGenericTypes' order of looking for @ClassType since the primitive generic wasn't completely erased, only changed to Object which is incorrect.
More fixes to @Arguments labeled as java.io.File via incorrect @Input annotation.
Put in a default undocumented implementation of @Argument doc() to match the one added to @Input.
The CompressionStash is now responsible for keeping track of all intervals that must be kept uncompressed by all samples. In general this is a list generated by a tumor sample that will enforce all normal samples to abide.
- Updated ReduceReads integration tests
- Sliding Window is now using the CompressionStash (single sample).
DEV-104 #resolve #time 3m
-- Ensures that the posteriors remain within reasonable ranges. Fixed bug where normalization of posteriors = {-1e30, 0.0} => {-100000, 0.0} which isn't good. Now tests ensure that the normalization process preserves log10 precision where possible
-- Updated MathUtils to make this possible
-- Remove capability to truncate genotype likelihoods -- this wasn't used and isn't really useful after all
-- Added lots of contracts and docs, still more to come.
-- Created a default makeMaxLikelihoods function in ReferenceDiploidExactAFCalc and DiploidExactAFCalc so that multiple subclasses don't just do the default thing
-- Generalized reference bi-allelic model in IndependentAllelesDiploidExactAFCalc so that in principle any bi-allelic reference model can be used.
-- Fixed minor numerical stability issue in AFCalcResult
-- posterior of joint A/B/C is 1 - (1 - P(D | AF_b == 0)) x (1 - P(D | AF_c == 0)), for any number of alleles, obviously. Now computes the joint posterior like this, and then back-calculates likelihoods that generate these posteriors given the priors. It's not pretty but it's the best thing to do
-- Superceded by IndependentAFCalc
-- Added support to read in an ExactModelLog in AFCalcPerformanceTest and run the independent alleles model on it.
-- A few misc. bug fixes discovered during running the performance test
Added targets to build.xml to effectively 'mvn install' packaged GATK/Queue from ant.
TODO: Versions during 'mvn install' are hardcoded at 0.0.1 until a better versioning scheme that works with maven dependencies has been identified.
Modified the SAMFileWriterArgumentTypeDescriptor to accept output bam files that are null if they're not required (in the @Output annotation).
This change enables the nWayOut parameter for the IndeRealigner and ReduceReads to operate optionally while maintaining the original single way out.
[#DEV-10 transition:31 resolution:1]
-- Before this branch, the EXACT calculation implementation was largely based on historical choices in the UnifiedGenotyper. The code was badly organized, there were no unit tests, and the Diploid EXACT calculation was super slow O(n.samples ^ n.alt.alleles)
-- Reorganized code into a single class AFCalc superclass that carries out the calculation and an AFCalcResult object that contains only the information we should expose to code users, and is well-validated.
-- Implement a new model for the multi-allelic exact calculation that sweeps for each alt allele B all likelihoods into a bi-allelic model XB where X is all alleles != B, and calls these all separately using the reference bi-allelic model. It produces identical quals for the bi-allelic case but slightly different results for multi-allelics due to a genuine model difference in that this Independent model doesn't penalize fully all genotype configurations as occurs in the Reference multi-allelic implementation. However, it seems after much debate that the reference model is doing the wrong thing, so in fact the Independent model seems correct. This code isn't the default implementation yet, simply because I want to do some cleanup and discuss with the methods group before enabling.
-- Constrained search model implemented, but will be deleted in a subsequent code cleanup
-- Massive (40K) suite of unit tests the exact models, which are passing for the reference and the independent alleles exact model.
-- Restored -- but isn't 100% hooked up -- the original clean bi-allelic model for Ryan to pass his optimized logless version on.
-- The only way to create these AFCalc objects is through an AFCalcFactory, which again validates its arguments. The AFCalcFactory.Calculation enum exposes calculations to the UG / HC as the AFModel.
-- Separated AFCalc from UG, into its own package that could in principle be pushed into utils now
-- Created a simple main[] function to run performance tests of the EXACT model.
-- Updating integration tests, confirming that results for the original EXACT model are as expected given our new more rigorous application of likelihoods, priors, and posteriors
-- Fix basic logic bug in AFCalcResult.isPolymorphic and UnifiedGenotypeEngine, where isNonRef really meant isRef. Not ideal. Finally caught by some tests, but good god it almost made it into the code
-- Now takes the Math.abs of the phred-scaled confidence so that we don't see -0.0
-- Massive new suite of unit tests to ensure that bi-allelic and tri-allele events are called properly with all models, and that the IndependentAllelesDiploidExactAFCalc calls events with up to 4 alt alleles correctly. ID'd some of the bugs below
-- Fix sort order bug in IndependentAllelesDiploidExactAFCalc caught by new unit tests
-- Fix bug in GeneralPloidyExactAFCalc where the AFCalcResult has meaningless values in the likelihoods when no there we no informative GLs.
-- UnifiedGenotyperEngine uses only the alleles used in genotyping, not the original alleles, when considering which alleles to include in output
-- AFCalcFactory has a more informative info message when looking for and selecting an exact model to use in genotyping
-- New capabilities in IndependentAllelesDiploidExactAFCalc to actually apply correct theta^n.alt.allele prior.
-- Tests that theta^n.alt.alleles is being applied correctly
-- Bugfix: keep in logspace when computing posterior probability in toAFCalcResult in AFCalcResultTracker.java
-- Bugfix: use only the alleles used in genotyping when assessing if an allele is polymorphic in a sample in UnifiedGenotyperEngine
-- AFCalcFactory is the only way to make AFCalcs now. There's a nice ordered enum there describing the models and their ploidy and max alt allele restrictions. The factory makes it easy to create them, and to find models that work for you given your ploidy and max alt alleles.
-- AFCalc no longer has UAC constructor -- only AFCalcFactory does. Code cleanup throughout
-- Enabling more unit tests, all of which almost pass now (except for IndependentAllelesDiploidExactAFCalc which will be fixed next)
-- It's now possible to run the UG / HC with any of the exact models currently in the system.
-- Code cleanup throughout the system, reorganizing the unit tests in particular
-- Continuing to get IndependentAllelesDiploidExactAFCalc working correctly. A long way towards the right answer now, but still not there
-- Restored (but not tested) OriginalDiploidExactAFCalc, the clean diploid O(N) version for Ryan
-- MathUtils.normalizeFromLog10 no longer returns -Infinity when kept in log space, enforces the min log10 value there
-- New convenience method in VariantContext that looks up the allele index in the alleles
-- AFCalcResult now sports a isPolymorphic and getLog10PosteriorAFGt0ForAllele functions that allow you to ask individually whether specific alleles we've tried to genotype are polymorphic given some confidence threshold
-- Lots of contracts for AFCalcResult
-- Slowly killing off AFCalcResultsTracker
-- Fix for the way UG checks for alt alleles being polymorphic, which is now properly conditioned on the alt allele
-- Change in behavior for normalizeFromLog10 in MathUtils: now sets the log10 for 0 values to -10000, instead of -Infinity, since this is really better to ensure that we don't have -Infinity values traveling around the system
-- ExactAFCalculationModelUnitTest now checks for meaningful pNonRef values for each allele, uncovering a bug in the GeneralPloidy (not fixed, related to Eric's summation issue from long ago that was reverted) in that we get different results for diploid and general-ploidy == 2 models for multi-allelics.
-- All of the code now uses the AFCalc object, not the not package protected AFCalcResultTracker. Nearly all unit tests pass (expect for a contract failing one that will be dealt with in subsequent commit), due to -Infinity values from normalizeLog10.
-- Changed the way that UnifiedGenotyper decides if the best model is non-ref. Previously looked at the MAP AC, but the MAP AC values are no longer provided by AFCalcResult. This is on purpose, because the MAP isn't a meaningful quantity for the exact model (i.e., everything is going to go to MLE AC in some upcoming commit). If you want to understand why come talk to me. Now uses the isPolymorphic function and the EMIT confidence, so that if pNonRef > EMIT then the site is poly, otherwise it's mono.
-- Renamed old class AFCalcResultTracker. This object is now allocated by the AFCalc itself, since it is heavy-weight and was badly optimized in the UG with a thread-local variable. Now, since there's already a AFCalc thread-local there, we get that optimization for free.
-- Removed the interface to provide the AFCalcResultTracker to getlog10PNonRef.
-- Wrote new, clean but unused AFCalcResult object that will soon replace the tracker as the external interface to the AFCalc model results, leaving the tracker as an internal tracker structure. This will allow me to (1) finally test things exhaustively, as the contracts on this class are clear (2) finalize the IndependentAllelesDiploidExactAFCalc class as it can work with a meaningfully defined result across each object
-- This model separates each of N alt alleles, combines the genotype likelihoods into the X/X, X/N_i, and N_i/N_i biallelic case, and runs the exact model on each independently to handle the multi-allelic case. This is very fast, scaling at O(n.alt.alleles x n.samples)
-- Many outstanding TODOs in order to truly pass unit tests
-- Added proper unit tests for the pNonRef calculation, which all of the models pass
-- Now contained in a package called afcalc
-- Extracted standard alone classes from private static classes in ExactAF
-- Most fields are now private, with accessors
-- Overall cleaner organization now
-- Now there's no duplication between exact old and constrained models. The behavior is controlled by an overloaded abstract function
-- No more static function to access the linear exact model -- you have to create the surrounding class. Updated code in the system
-- Everything passes unit tests
-- walks over the genotypes in VC, and computes for each alt allele the maximum AC we need to consider in that alt allele dimension. Does the calculation based on the PLs in each genotype g, choosing to update the max AC for the alt alleles corresponding to that PL. Only takes the first lowest PL, if there are multiple genotype configurations with the same PL value. It takes values in the order of the alt alleles.
Validation of GenomeLocs in the FilePointer class was extremely inefficient
when the GenomeLocs were added one at a time rather than all at once.
Appears to mostly fix GSA-604
-- Right now the state of the AFCaclulationResult can be corrupt (ie, log10 likelihoods can be -Infinity). Forced me to disable reasonable contracts. Needs to be thought through
-- exactCallsLog should be optional
-- Update UG integration tests as the calculation of the normalized posteriors is done in a marginally different way so the output is rounded slightly differently.
-- UnifiedGenotyperEngine no longer keeps a thread local double[2] array for the normalized posteriors array. This is way heavy-weight compared to just making the array each time.
-- Added getNormalizedPosteriorOfAFGTZero and getNormalizedPosteriorOfAFzero to AFResult object. That's the place it should really live
-- Add tests for priors, uncovering bugs in the contracts of the tri-allelic priors w.r.t. the AC of the MAP. Added TODOs
-- AFResult now tracks the number of evaluations (turns through the model calculation) so we can now compute the scaling of exact model itself as a function of n samples
-- Added unittests for priors (flat and human)
-- Discovered nasty general ploidy bug (enabled with Guillermo_FIXME)