In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.
If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.
Now the PLs and AD get correctly selected down.
While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just
broke it out into smaller chunks as per our best practices.
Added unit tests and updated integration tests.
Addressed reviews.
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.
Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
while not converged:
* AC = [external AC] + [sample AC]
* Prior = DirichletMultinomial[AC]
* Posteriors = [sample GL + Prior]
* sample AC = MLEAC(Posteriors)
This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.
Fully unit tested.
Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.
* add optional "starting from fastq" path to the pipeline
* add mark duplicates (optionally) to the pipeline
* add an option to run with the mouse data (without dbsnp and with single ended fastq)
* add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
* add option to filter or include reads with N in the cigar string
* add parameter to allow keeping the intermediate files
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block
-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.
Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.
This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.
We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)
Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
-This was a dependency of the test suite, but not the GATK proper,
which caused problems when running the test suite on the packaged
GATK jar at release time
-Use GATKVCFUtils.readVCF() instead
-Switch to using new GSA AWS account for storage of phone home data
-Use DNS-compliant bucket names, as per Amazon's best practices
-Encrypt publicly-distributed version of credentials. Grant only PutObject
permission, and only for the relevant buckets.
-Store non-distributed credentials in private/GATKLogs/newAWSAccountCredentials
for now -- need to integrate with existing python/shell scripts
later to get the log downloading working with the new account
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
- Make -rod required
- Document that contaminationFile is currently not functional with HC
- Document liftover process more clearly
- Document VariantEval combinations of ST and VE that are incompatible
- Added a caveat about using MVLR from HC and UG.
- Added caveat about not using -mte with -nt
- Clarified masking options
- Fixed docs based on Erics comments