-Added docs for ERC mode in HC
-Move RecalibrationPerformance walker since to private since it is experimental and unsupported
-Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
-Improved error msg for conflict between per-interval aggregation and -nt
-Minor clean up in exception docs
-Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
-Added more detailed info to GenotypeConcordance doc based on Chris forum post
-Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
-Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
-Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.
For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line. In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path). For example:
foo.vcf
VCF,name=bar bar.vcf
Note that a VCF list file name must end with '.list'.
Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.
In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns
create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils
create integration test for that class
small change in a comment in FullProcessingPipeline
last commit:
Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)
- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs
- Allow the user to print the split positions (not working proparly currently)
* add overall walker GATKDocs
* add explanation for skip parameter and make it advanced
* reverse the logic on exculding unmapped reads for clarity
* fix read length calculation to no longer include indels
ps: I am not sure how useful this walker is (I didn't write it) but the skip logic is poor and
calculates the entire statistic for the reads it is eventually going to skip. This would be an easy
fix, but only worth our time if people actually use this.
The update contains:
1. documentation changes for VariantContext and Allele (which used to discuss the now obsolete null allele)
2. better error messages for VCFs containing complex rearrangements with breakends
3. instead of failing badly on format field lists with '.'s, just ignore them
Also, there is a trivial change to use a more efficient method to remove a bunch of attributes from a VC.
Delivers PT#s 59675378, 59496612, and 60524016.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine
In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants. I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.
Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
to @Argument annotations for command-line arguments
-"minValue" and "maxValue" specify hard limits that generate an exception if violated
-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated
-Works only for numeric arguments (int, double, etc.) with @Argument annotations
-Only considers values actually specified by the user on the command line, not default values
assigned in the code
As requested by Geraldine
In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.
If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.
Now the PLs and AD get correctly selected down.
While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just
broke it out into smaller chunks as per our best practices.
Added unit tests and updated integration tests.
Addressed reviews.
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.
Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
while not converged:
* AC = [external AC] + [sample AC]
* Prior = DirichletMultinomial[AC]
* Posteriors = [sample GL + Prior]
* sample AC = MLEAC(Posteriors)
This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.
Fully unit tested.
Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.
* add optional "starting from fastq" path to the pipeline
* add mark duplicates (optionally) to the pipeline
* add an option to run with the mouse data (without dbsnp and with single ended fastq)
* add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
* add option to filter or include reads with N in the cigar string
* add parameter to allow keeping the intermediate files
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block
-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.
Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.
This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.
We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)
Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel