Same changes fixed the problem for GenotypeGVCFs and CombineGVCFs.
Stories:
- https://www.pivotaltracker.com/story/show/77626044
- https://www.pivotaltracker.com/story/show/77626854
Changes:
- Generalized the code for the merging in GATKVariantContextUtils to cope
with ploidy != 2.
- GenotypeGVCFs now check that the input's ploidy conform to the '-ploidy'
argument.
- Moved out Refernce Confidence VC merging code from GATKVariantContextUtils
so that we can keep new code in protected.
Caveats:
- GenotypeGVCFs only can deal with input files that have the same ploidy in
all positions; the one that the user MUST indicate in the -ploidy argument
(if different to the default 2).
- CombineGVCFs won't necessarely complain if its passed mixed ploidy
inputs but you won't be able to genotype it with GenotypeGVCFs.
Test:
- Removed deprecated unit tests for GATKVariantContextUtils.
- Moved unit-tests regarding GVCF merging from GATKVariantContextUtilsUnitTest
to ReferenceConfidenceVariantContextUtilsUnitTest.
- Added unit test for new code for mapping genotype indices between allele
index encoding in GenotypeLikelihoodCalculator.
- GenotypeGVCFs and CombineGVCFs original integration test are unaffected
by the change.
- Added tetraploid run integration tests to check on non-diploid execution
of GenotypeGVCFs and CombineGVCFs.
Changed tests and scripts to use gatkdir full path instead of relative testdata/qscripts symbolic links.
Although symlinks not created, left the symlink deletion script execution with a comment about future removal.
Re-enabled example UG pipeline queue test.
Replaced all hardcoded strings of {public,private}/testdata with BaseTest variables.
Refactored temp list creation method from ListFileUtilsUnitTest to BaseTest.createTempListFile.
Removed list files with hardcoded paths, now using createTempListFile instead with private test dir variable.
We do this for technical reasons, mostly because we don't genotype in the HC anymore; it's all
done downstream by GenotypeGVCFs so we can't be sure that the genotype will be hom var. Also,
there are steps in the downstream pipeline where genotypes can change, so assuming anything in
the HC is a bad idea, and if we have phasing info in the het state, we want to propagate that forward.
Now, PGT tag fixing happens downstream in GenotypeGVCFs.
While I was in there I also cleaned up the code a bit and fixed a bug where annotation was happening
before genotype creation when using the --includeNonVariantSites argument.
Added tests accordingly.
* This is a shortcut for people who have multi-sample BAMs but would like to use GVCF mode. Rather than creating single-sample BAMs with PrintReads, one could use the --sample_name argument to HaplotypeCaller to specify the single sample to make calls on
* Completes PT 73075482
Story:
https://www.pivotaltracker.com/story/show/77250524
Changes:
- Remove the annotating code in GeneralPloidyExactAFCalc (GPEAFC) class.
- Added the asAlleleList to GenotypeAlleleCounts class and get (GPEAFC) to use that instead of implementing its own (nicer and more reusable code).
- Removed the explicit addition of AlleleCountBySample fields to the VCF header by the walker initialize
- Added utility methods in Utils to wrap and int[] array into a List<Integer>, and double[] array into a List<Double> efficiently.
Test:
- Added unit-testing for asAlleleList in GenotypeAlleleCountsUnitTest (within testFirst and testNext).
- Added unit-testing for new methods in Utils : asList(int[]) and asList(double[])
- Changed UG General Ploidy test to add explicitly those annotations.
- Non-trivial changes in integration tests involving non-diploid runs (namelly haploid and tetraploid) as they are not showing
those annotations anylonger, so the MD5s have been changed accordingly.
It turns out that there can be some really complex situations even with a single sample where
there are lots of unphasable hets around a hom. Previously we were trying to phase each of the
hets against the hom, but that wasn't correct. Instead we now detect that situation and don't
attempt to phase anything.
Added a unit test to cover this situation.
New annotation for low= and high-confidence de novos (only annotates biallelics)
FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations
Restrict population priors based on discovered allele count to be valid for 10 or more samples.
VariantAnnotator/FS behavior changes slightly: VA used to output zeros for FS if there was no strand bias info, now skips FS output (but will still show FS in header)
Changes in several walker to use new sample, allele closed lists and new GenotypingEngine constructors signatures
Rebase adoption of new calculation system in walkers
1. It is now turned on by default
2. It now phases homozygous variants
3. Most importantly, it also phases variants that are always on opposite haplotypes
Changed the INFO keys to be PID and PGT, as described in the header.
If any pair of variants occurs on all used haplotypes together, then we propagate that information into the gVCF.
Can be enabled with the --tryPhysicalPhasing argument.
Stories:
https://www.pivotaltracker.com/story/show/70222086https://www.pivotaltracker.com/story/show/67961652
Changes:
Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM.
Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set.
Updated some integration test md5s.
Stories:
https://www.pivotaltracker.com/story/show/70222086https://www.pivotaltracker.com/story/show/67961652
Changes:
Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM.
Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set.
Updated some integration test md5s.
Fixing GraphBased bugs with new master code
Fixed ReadLikelihoods.changeReads difficult to spot bug.
Changed PairHMM interface to fix a bug
Fixed missing changes for various PairHMM implementations to get them to use the new structure.
Fixed various bugs only detectable when running with full sample(s).
Believe to have fixed the lack of annotations in UG runs
Fixed integrationt test MD5s
Updating some md5s
Fixed yet another md5 probably left out by mistake
The array structure should be faster to populate and query (no properly benchmarked) and reduce memory footprint considerably.
Nevertheless removing PairHMM factor (using likelihoodEngine Random) it only achieves a speed up of 15% in some example WGS dataset
i.e. there are other bigger bottle necks in the system. Bamboo tests also seem to run significantly faster with this change.
Stories:
https://www.pivotaltracker.com/story/show/70222086https://www.pivotaltracker.com/story/show/67961652
Changes:
- ReadLikelihoods added to substitute Map<String,PerSampleReadLikelihoods>
- Operation that involve changes in full sets of ReadLikelihoods have been moved into that class.
- Simplified a bit the code that handles the downsampling of reads based on contamination
Caveats:
- Still we keep Map<String,PerReadAlleleLikelihoodsMap> around to pass to annotators..., didn't feel like change the interface of so many public classes in this pull-request.
In particular, it was possible to specify arguments for Files or Compound types without values
Added a special "none" value for annotations, since a bare "-A" is no longer allowed
Delivers PT 71792842 and 59360374
Story:
https://www.pivotaltracker.com/story/show/73440292
Changes:
- Just add the conditional in HaplotypeCaller#initialize
Testing:
- Nothing added, checked locally, trivial change that would eventually be removed anyway.
Don't expand out source nodes for tail merging, since that's a head merging action only.
This shows up as a bug only because we now allow merging tails against non-reference paths.
- Edited intervals merging docs for correctness & clarity
- Edited VQSR arg docs and made mode required (+added -mode SNP to VQSR tests)
- Moved PaperGenotyper to Toy Walkers to declutter the actually useful docs
- Moved GenotypeGVCFs to Variant Discovery category and clarified a few points
- Clarified that the -resource argument depends on using the -V:tag format
- Clarified how the pcr indel model works
- Added caveat for -U ALLOW_N_CIGAR_READS
- Added MathJax support for displaying equations in GATKDocs
- Updated HC example commands and caveats
This is useful for e.g. cases where there are SNPs on insertions. Before tails were forced to be merged
(incorrectly) only to a reference node, but now they can be merged to any path in the graph from which they
directly branch.
Also, I've transferred over Ryan's code to refuse to process kmer sizes such that there are non-unique kmers
in the reference sequence with them.
-- Global mismapping penalty was only applied to the reference haplotype. This led to problems with overlapping events, mostly STR haplotypes. Now the penalty is applied to every haplotype.
-- We subset the reads down to only those which overlap the event (after assembly based realignment) for likelihood calculations.
In these cases, where the alignment contains multiple indels, we output a single complex
variant instead of the multiple partial indels.
We also re-enable dangling tail recovery by default.
-- AD,DP will now correspond directly to the reads that were used to construct the PLs
-- RankSumTests, etc. will use the bases from the realigned reads instead of the original alignments
-- There is now no additional runtime cost to realign the reads when using bamout or GVCF mode
-- bamout mode no longer sets the mapping quality to zero for uninformative reads, instead the read will not be given an HC tag
(Right now it only works if all members of the trio are called.)
Takes posteriors as input, defaulting to PLs
Added annotations for possible de novos for us in full genotype refinement pipeline
Added family priors to CGP integration test.
Changed CGP to use PP tag instead of GP tag because posteriors are Phred-scaled. Updated CGP integration test md5s to reflect change.
- New arguments are nda, hets, indelHeterozygosity, stand_call_conf, stand_emit_conf, ploidy, and maxAltAlleles
- Addresses PT 70110918
- To do this, moved those arguments out of the StandardCallerArgumentCollection into a new GenotypeCalculationArgumentCollection, which is now included as a member of SCAC
-They are now only computed when necessary
-Log10Cache is dynamically resizable, either by calling get() on an out-of-range value or by calling ensureCacheContains
-Log10FactorialCache and JacobianLogTable are initialized to a fixed size on first access and are not resizable
-Addresses PT 69124396
-Make BaseTest.createTempFile() mark any possible corresponding index files for deletion on exit
-Make WalkerTest mark shadow BCF files and auxiliary for deletion on exit
-Make VariantRecalibrationWalkersIntegrationTest mark PDF files for deletion on exit
-- disabling HC+VA integration test because, as noted in the comments, it keeps switching PairHMM implementations and giving different results at a particular site used in that particular test
Stories:
- https://www.pivotaltracker.com/story/show/69577868
Changes:
- Added a epsilon difference tolerance in weight comparisons.
Tests:
- Added HaplotypeCallerIntegrationTest#testDifferentIndelLocationsDueToSWExactDoubleComparisonsFix
- Updated md5 due to minor likelihood changes.
- Disabled a test for PathUtils.calculateCigar since does not work and is unclear what is causing the error (needs original author input)
To reduce merge conflicts, this commit modifies contents of files, while file renamings are in previous commit.
See previous commit message for list of changes.
To reduce merge conflicts, this commit only renames files, while file modifications are in next commit.
Some updates/fixes here are actually included in the next commit.
= Maven updates
Moved artifacts to new package names:
* private/queue-private -> private/gatk-queue-private
* private/gatk-private -> private/gatk-tools-private
* public/gatk-package -> protected/gatk-package-distribution
* public/queue-package -> protected/gatk-queue-package-distribution
* protected/gatk-protected -> protected/gatk-tools-protected
* public/queue-framework -> public/gatk-queue
* public/gatk-framework -> public/gatk-tools-public
New poms for new artifacts and packages:
* private/gatk-package-internal
* private/gatk-queue-package-internal
* private/gatk-queue-extensions-internal
* protected/gatk-queue-extensions-distribution
* public/gatk-engine
Updated references to StingText.properties to GATKText.properties.
Updated ant-bridge.sh to use gatk.* properties instead of sting.*.
= Engine updates
Renaming files containing engine parts from o.b.gatk.tools to o.b.gatk.engine.
Changed package references from tools to engine for CommandLineGATK, GenomeAnalysisEngine, ReadMetrics, ReadProperties, and WalkerManager.
Changed package reference tools.phonehome to engine.phonehome.
Renamed classes *Sting* to *GATK*, such as ReviewedGATKException.
= Test updates
Moved gatk example resources.
Moved test engine files from tools to engine packages.
Moved resources for phonehome to proper package.
Moved test classes under o.b.gatk into packages:
* o.b.g.utils.{BaseTest,ExampleToCopyUnitTest,GATKTextReporter,MD5DB,MD5Mismatch,TestNGTestTransformer}
* o.b.g.engine.walkers.WalkerTest
Updated package names in DependencyAnalyzerOutputLoaderUnitTest's data.
= Queue updates
Moving queue scripts to location where generated extensions can be used.
Renamed *.q to *.scala, updating licenses previously missed by git hooks.
Moved queue extensions to new artifact gatk-queue-extensions.
Fixed import statments frequently merge-conflicting on FullProcessingPipeline.scala.
= BWA
Added README on how to obtain and include bwa as a library.
Updated libbwa build.
Fixed packaged names under bwa/java implementation.
Updated contents of BWCAligner native implementation.
= Other fixes
Don't duplicate the resource bundle entries by both unpacking *and* appending.
(partial fix) Staged engine and utils poms to build GATKText.properties, once Utils random generator dependency on GATK engine is fixed.
Re-enabled custom testng listeners/reporters and moved testng dependencies to the gatk-root.
Updated comments referencing Sting with GATK.
Moved a couple untangled classes from gatk-tools-public to gatk-utils and gatk-engine.