Here are the git moved directories in case other files need to be moved during a merge:
git-mv private/java/src/ private/gatk-private/src/main/java/
git-mv private/R/scripts/ private/gatk-private/src/main/resources/
git-mv private/java/test/ private/gatk-private/src/test/java/
git-mv private/testdata/ private/gatk-private/src/test/resources/
git-mv private/scala/qscript/ private/queue-private/src/main/qscripts/
git-mv private/scala/src/ private/queue-private/src/main/scala/
git-mv protected/java/src/ protected/gatk-protected/src/main/java/
git-mv protected/java/test/ protected/gatk-protected/src/test/java/
git-mv public/java/src/ public/gatk-framework/src/main/java/
git-mv public/java/test/ public/gatk-framework/src/test/java/
git-mv public/testdata/ public/gatk-framework/src/test/resources/
git-mv public/scala/qscript/ public/queue-framework/src/main/qscripts/
git-mv public/scala/src/ public/queue-framework/src/main/scala/
git-mv public/scala/test/ public/queue-framework/src/test/scala/
Changes:
-------
<NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles.
When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake.
Integration test MD5 have been changed accordingly.
Additional fix:
--------------
Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick.
In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.
Problem:
matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.
Reports:
https://www.pivotaltracker.com/s/projects/793457/stories/62471780https://www.pivotaltracker.com/s/projects/793457/stories/61082450
Changes:
The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.
MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.
Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
Problem: the codec was written to take in consensus pileups produced with pileup -c option (which consists of 10 or 13 fields per line depending on the variant type) but errored out on the basic pileup format (which only has 6 fields per line). This was inconsistent and confusing to users.
Solution: I added a switch in the parsing to recognize and handle both cases more appropriately, and updated related docs. While I was at it I also improved error messages in CheckPileup, which now emits User Error: Bad Input exceptions when reporting mismatches. Which may not be the best thing to do (ultimately they're not really errors, they're just reporting unwelcome results) but it beats emitting Runtime Exceptions.
Tested by CheckPileupIntegrationTest which tests both format cases.
-Added docs for ERC mode in HC
-Move RecalibrationPerformance walker since to private since it is experimental and unsupported
-Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
-Improved error msg for conflict between per-interval aggregation and -nt
-Minor clean up in exception docs
-Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
-Added more detailed info to GenotypeConcordance doc based on Chris forum post
-Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
-Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
-Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.
For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line. In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path). For example:
foo.vcf
VCF,name=bar bar.vcf
Note that a VCF list file name must end with '.list'.
Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.
In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns
create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils
create integration test for that class
small change in a comment in FullProcessingPipeline
last commit:
Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)
- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs
- Allow the user to print the split positions (not working proparly currently)
* add overall walker GATKDocs
* add explanation for skip parameter and make it advanced
* reverse the logic on exculding unmapped reads for clarity
* fix read length calculation to no longer include indels
ps: I am not sure how useful this walker is (I didn't write it) but the skip logic is poor and
calculates the entire statistic for the reads it is eventually going to skip. This would be an easy
fix, but only worth our time if people actually use this.
The update contains:
1. documentation changes for VariantContext and Allele (which used to discuss the now obsolete null allele)
2. better error messages for VCFs containing complex rearrangements with breakends
3. instead of failing badly on format field lists with '.'s, just ignore them
Also, there is a trivial change to use a more efficient method to remove a bunch of attributes from a VC.
Delivers PT#s 59675378, 59496612, and 60524016.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine
In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants. I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.
Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
to @Argument annotations for command-line arguments
-"minValue" and "maxValue" specify hard limits that generate an exception if violated
-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated
-Works only for numeric arguments (int, double, etc.) with @Argument annotations
-Only considers values actually specified by the user on the command line, not default values
assigned in the code
As requested by Geraldine
In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.
If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.
Now the PLs and AD get correctly selected down.
While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just
broke it out into smaller chunks as per our best practices.
Added unit tests and updated integration tests.
Addressed reviews.
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.
Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
while not converged:
* AC = [external AC] + [sample AC]
* Prior = DirichletMultinomial[AC]
* Posteriors = [sample GL + Prior]
* sample AC = MLEAC(Posteriors)
This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.
Fully unit tested.
Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.
* add optional "starting from fastq" path to the pipeline
* add mark duplicates (optionally) to the pipeline
* add an option to run with the mouse data (without dbsnp and with single ended fastq)
* add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
* add option to filter or include reads with N in the cigar string
* add parameter to allow keeping the intermediate files
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.