Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric.
Metrics are generalized by the Metric interface. (for generic access)
Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)
-- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0
-- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.
* Make most classes final, others package local
* Move to diagnostics.diagnosetargets package
* Aggregate statistics and walker classes on the same package for simplified visibility.
* Make status list a LinkedList instead of a HashSet
A plugin enabled implementation of DiagnoseTargets
Summarized Changes:
-------------------
* move argument collection into Thresholder object
* make thresholder object private member of all statistics classes
* rework the logic of the mate pairing thresholds
* update unit and integration tests to reflect the new behavior
* Implements Locus Statistic plugins
* Extend Locus Statistic plugins to determine sample status
* Export all common plugin functionality into utility class
* Update tests accordingly
[fixes#48465557]
* remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps.
* add gatkdocs to all the parameters
* clean up the logic on callable status a bit (still need to be re-worked into a plugin system)
* update integration tests
This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply.
[fixes#48442663]
Problem
-------
Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval
Solution
--------
Only allow low median coverage check if there are reads
[fixes#48442675]
Problem
-------
Diagnose targets was skipping intervals when they were not covered by any reads.
Solution
--------
Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file.
Summarized Changes
------------------
* Outputs all intervals it iterates over, even if uncovered
* Outputs leftover intervals in the end of the traversal
* Updated integration tests
[fixes#47813825]
-- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases. In this case we'd eliminate the reference source vertex. Fixed by checking for this condition and aborting the simplification
-- Update MD5s, including minor improvements
-- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC. Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data
-- Reduce the min usable base qual to 8 by default in the HC. In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL. This is a bit too aggressive of a requirement, so I lowered it to 8.
-- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller
NA12878 MEM chr20:10-11
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
branch SNPS 1216 0 2 194 0
branch INDELS 312 2 13 71 7
master SNPS 1214 0 4 194 1
master INDELS 309 2 16 71 10
-- Update MD5s in the integration tests to reflect these two new changes
* Moved redundant code out of UGEngine
* Added overloaded methods that assume p=0.5 for speed efficiency
* Added unit test for the binomialCumulativeProbability method
The Problem:
Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
regions in an exome) causes RR (with default settings) to consider it a variant region. This
seriously hurts compression performance.
The Solution:
1. We now use a probabilistic model for determining whether we can create a consensus (in other
words, whether we can error correct a site) instead of the old ratio threshold. We calculate
the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
2. We also allow het compression globally, not just at known sites. So if we cannot create a
consensus at a given site then we try to perform het compression; and if we cannot perform het
compression that we just don't reduce the variant region. This way very wonky regions stay
uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
locus get het compressed.
Details:
1. -minvar is now deprecated in favor of -min_pvalue.
2. Added integration test for bad pvalue input.
3. -known argument still works to force het compression only at known sites; if it's not included
then we allow het compression anywhere. Added unit tests for this.
4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
Before finalizing het compression, we now check for insertions or other variant regions (usually due
to multi-allelics) which can render a region incompressible (and we back out if we find one). We
were checking for excessive softclips before, but now we add these tests too.
5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
out instead of backing out all of them.
6. We no longer create a mini read at the stop of the variant window for het compression. Instead, we
allow it to be part of the next global consensus.
7. The coverage test is no longer run systematically on all integration tests because the quals test
supercedes it. The systematic quals test is now much stricter in order to catch bugs and edge cases
(very useful!).
8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
for a consensus was affected by good and bad bases/reads).
9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
with insertions from a header.
Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples:
1000G EUR samples:
10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 50 0 5 8 1
per_sample INDELS 6 0 7 2 1
pooled SNPS 49 0 6 8 1
pooled INDELS 5 0 8 2 1
100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 144 0 22 28 1
per_sample INDELS 28 1 16 9 11
pooled SNPS 143 0 23 28 1
pooled INDELS 27 1 17 9 11
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
-- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad
-- It's useful to know which sites have been used in the training of the model. The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE
-- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels
Problem
--------
the logless HMM scale factor (to avoid double under-flows) was 10^300. Although this serves the purpose this value results in a complex mantissa that further complicates cpu calculations.
Solution
---------
initialize with 2^1020 (2^1023 is the max value), and adjust the scale factor accordingly.
-- The PairHMM no longer allows us to create haplotypes with 0 bases. The UG indel caller used to create such haplotypes. Now we assign -Double.MAX_VALUE likelihoods to such haplotypes.
-- Add integration test to cover this case, along with private/testdata BAM
-- [Fixes#47523579]
The Problem
----------
Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result.
Solution
--------
The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes.
Summarized Changes
------------------
* Added the recacheReadValue check to flush the matrix (hapStartIndex = 0)
* Updated related MD5's
Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9
-- Decreasing the match value means that we no longer think that ACTG vs. ATCG is best modeled by 1M1D1M1I1M, since we don't get so much value for the middle C match that we can pay two gap open penalties to get it.
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:
ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.
ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.
-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all
General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
-- Ensure that BQSR works properly for an Ion Torrent BAM. (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
-- old algorithm was O(kmerSize * readLen) for each read. New algorithm is O(readLen)
-- Added real unit tests for the addKmersFromReads to the graph. Using a builder is great because we can create a MockBuilder that captures all of the calls, and then verify that all of the added kmers are the ones we'd expect.
-- The previous creation algorithm used the following algorithm:
for each kmer1 -> kmer2 in each read
add kmers 1 and 2 to the graph
add edge kmer1 -> kmer2 in the graph, if it's not present (does check)
update edge count by 1 if kmer1 -> kmer2 already existed in the graph
-- This algorithm had O(reads * kmers / read * (getEdge cost + addEdge cost)). This is actually pretty expensive because get and add edges is expensive in jgrapht.
-- The new approach uses the following algorithm:
for each kmer1 -> kmer2 in each read
add kmers 1 and 2 to a kmer counter, that counts kmer1+kmer2 in a fast hashmap
for each kmer pair 1 and 2 in the hash counter
add edge kmer1 -> kmer2 in the graph, if it's not present (does check) with multiplicity count from map
update edge count by count from map if kmer1 -> kmer2 already existed in the graph
-- This algorithm ensures that we add very much fewer edges
-- Additionally, created a fast kmer class that lets us create kmers from larger byte[]s of bases without cutting up the byte[] itself.
-- Overall runtimes are greatly reduced using this algorith
-- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.
-- The previous version would enter into an infinite loop in the case where we have a graph that looks like:
X -> A -> B
Y -> A -> B
So that the incoming vertices of B all have the same sequence. This would cause us to remodel the graph endless by extracting the common sequence A and rebuilding exactly the same graph. Fixed and unit tested
-- Additionally add a max to the number of simplification cycles that are run (100), which will throw an error and write out the graph for future debugging. So the GATK will always error out, rather than just go on forever
-- After 5 rounds of simplification we start keeping a copy of the previous graph, and then check if the current graph is actually different from the previous graph. Equals here means that all vertices have equivalents in both graphs, as do all edges. If the two graphs are equal we stop simplifying. It can be a bit expensive but it only happens when we end up cycling due to the structure of the graph.
-- Added a unittest that goes into an infinite loop (found empirically in running the CEU trio) and confirmed that the new approach aborts out correctly
-- #resolves GSA-924
-- See https://jira.broadinstitute.org/browse/GSA-924 for more details
-- Update MD5s due to change in assembly graph construction
-- HC now throws a UserException if this model is provided. Documented this option as not being supported in the HC in the docs for EXACT_GENERAL_PLOIDY
-- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward.
-- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.
-- Extension increased to 200 bp
-- Min prune factor defaults to 0
-- LD merging enabled by default for complex variants, only when there are 10+ samples for SNP + SNP merging
-- Active region trimming enabled by default
-- The kbest paths algorithm now takes an explicit set of starting and ending vertices, which is conceptually cleaner and works for either the cycle or no-cycle models. Allowing cycles can be re-enabled with an HC command line switch.
-- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes. This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype. All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).
-- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension). Radically speeds up calculations when using large active region extensions. The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible. The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes
-- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter. The previous error corrector was just broken (conceptually) and was disabled by default in the engine. Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java
-- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler
-- outgoingVerticesOf and incomingVerticesOf return a list not a set now, as the corresponding values must be unique since our super directed graph doesn't allow multiple edges between vertices
-- Make DeBruijnGraph, SeqGraph, SeqVertex, and DeBruijnVertex all final
-- Cache HashCode calculation in BaseVertex
-- Better docs before the pruneGraph call
-- The previous version of the head merging (and tail merging to a lesser degree) would inappropriately merge source and sinks without sufficient evidence to do so. This would introduce large deletion events at the start / end of the assemblies. Refcatored code to require 20 bp of overlap in the head or tail nodes, as well as unit tested functions to support this.
-- Goes through the graph looking for chains to zip, accumulates the vertices of the chains, and then finally go through and updates the graph in one big go. Vastly more efficient than the previous version, but unfortunately doesn't actually work now
-- Also incorporate edge weight propagation into SeqGraph zipLinearChains. The edge weights for all incoming and outgoing edges are now their previous value, plus the sum of the internal chain edges / n such edges
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator. For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed. Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well. Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp. In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M. The new code supports this. UnitTested and documented as well. LDMerger handles case where merging two alleles results in a no-op event. Merging CA/C + A/AA -> CAA/CAA -> no op. Handles this case by removing the two events. UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right). Here's the new algorithm:
* Compute probability that two variants are in phase with each other and that no
* compound hets exist in the population.
*
* Implemented as a likelihood ratio test of the hypothesis:
*
* x11 and x22 are the only haplotypes in the populations
*
* vs.
*
* all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
*
* Now, since we have to have both variants in the population, we exclude the x11 & x11 state. So the
* p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
*
* Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
*
* - P(x11 & x12 & x21) -- we have hom-ref and both hets
* - P(x22 & x12 & x21) -- we have hom-alt and both hets
* - P(x22 & x12) -- one haplotype is 22 and the other is het 12
* - P(x22 & x21) -- one haplotype is 22 and the other is het 21
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)
Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.
Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)
[fixes#47399227]
* As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270
* This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF
* Added integration test to confirm failure with User Error
* Removed illegal header line in KB test VCF that was causing related tests to fail.
-- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create
duplicate active regions.
-- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside
of the current interval.
-- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.
-- Graphs with cycles from the bottom node to one of the middle nodes would introduce an infinite cycle in the algorithm. Created unit test that reproduced the issue, and then fixed the underlying issue.
-- Only try to genotype PASSing records in the alleles file
-- Don't attempt to genotype multiple records with the same start location. Instead take the first record and throw a warning message.
-- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask.
-- Added integration test, and made sure we also test with a BED rod.
* Moved to protected for packaging purposes.
* Cleaned up and removed debugging output.
* Fixed logic for epsilons so that we really only test significant differences between BAMs.
* Other small fixes (e.g. don't include low quality reduced reads in overall qual).
* Most RR integration tests now automatically run the quals test on output.
* A few are disabled because we expect them to fail in various locations (e.g. due to downsampling).
The Problem:
------------
the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records".
How this fixes it:
------------------
This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec.
Summarized changes:
-------------------
* Remove keep_all_pg_records option from setupWriter utility methos in Util
* Update all walkers to now replace the last @PG tag of the same walker (if it already exists)
* Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable
* Simplify the multiple implementations to setupWriter
Bamboo:
-------
http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31
Issue Tracker:
--------------
[fixes 47100885]
-- Corrected logic to pick biallelic vc to left align.
-- Added integration test to make sure this feature is tested and feature to trim bases is also tested.
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
# Initial conditions were not being set properly
# Emission probabilities in the last row were not adding up to 1
The following commit fixes both by
# averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
# discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)
Summarized changes:
* Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
* Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
* Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
* Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
* Rename metric lengths to read and haplotype lengths for clarity
* Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
* Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
* Remove unnecessary parameters from updateCell()
* Fix the expected probabilities coming from the exact model in PairHMMUnitTest
* Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
* Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.
[fix 47164949]
that are tested), resulting in slightly different numbers of calls to the RNG, and ultimately
different sets of selected variants.
This commits updates the md5 values for the validation site selector integration test to reflect
these new random subsets of variants that are selected.
-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
Currently, the multi-allelic test is covering the following case:
Eval A T,C
Comp A C
reciprocate this so that the reverse can be covered.
Eval A C
Comp A T,C
And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.
This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:
Eval: A G,T 0/0 2/0 2/2 1/1
Comp: A C,T 0/0 1/0 0/0 0/0
Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:
Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)
Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.
Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
- dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
- vectorSum
- array shuffle, random subset
- countOccurances (general forms, the char form is used in the codebase)
- getNMaxElements
- array permutation
- sorted array permutation
- compare floats
- sum() (for integer arrays and lists).
Final keyword was extensively added to MathUtils.
The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).
The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.
In addition, more extensive tests were added for
- logBinomialCoefficient (Newton's identity should always hold)
- logFactorial
- log10sumlog10 and its approximation
All unit tests pass
-- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges. Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger.
-- Refactored haplotype caller infrastructure into graphs package, code cleanup
-- Cleanup new merging / splitting algorithms, with proper docs and unit tests
-- Fix bug in zipping of linear chains. Because the multiplicity can be 0, protect ourselves with a max function call
-- Fix BaseEdge.max unit test
-- Add docs and some more unit tests
-- Move error correct from DeBruijnGraph to DeBruijnAssembler
-- Replaced uses of System.out.println with logger.info
-- Don't make multiplicity == 0 nodes look like they should be pruned
-- Fix toString of Path
-- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base). This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph. Probably we should think hard about this. This commit fixes this logic, so we zip the graph between pruning
-- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink. In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink. The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph. Other routines that use the same traversal idiom refactored to use this iterator. Added unit tests for all of these capabilities.
-- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)
-- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer. The previous test would fail to catch this case. Now its not possible
-- Lots of code cleanup and refactoring in this push. Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph.
-- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count.
-- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear
-- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length. Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error
-- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence. Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument.
-- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods.
-- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate
-- Don't write out an errorcorrected graph when debugging and error correction isn't enabled
* It is now cleaner and easier to test; added tests for newly implemented methods.
* Many fixes to the logic to make it work
* The most important change was that after triggering het compression we actually need to back it out if it
creates reads that incorporated too many softclips at any one position (because they get unclipped).
* There was also an off-by-one error in the general code that only manifested itself with het compression.
* Removed support for creating a het consensus around deletions (which was broken anyways).
* Mauricio gave his blessing for this.
* Het compression now works only against known sites (with -known argument).
* The user can pass in one or more VCFs with known SNPs (other variants are ignored).
* If no known SNPs are provided het compression will automatically be disabled.
* Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
strandedness from normal reduced reads.
* GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
* This allows us to update the FisherStrand annotation to count het compressed reduced reads
towards the FS calculation.
* [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
backwards compatible.]
* Updated integration tests accordingly with new het compressed bams (both for RR and UG).
* In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
RR properly, so I fixed it too.
* Also, the test in the UG engine for determining whether there are too many overlapping
deletions is updated to handle RR.
* I added a special hook in the RR integration tests to additionally run the systematic
coverage checking tool I wrote earlier.
* AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
not lost from original to reduced bam.
* This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
from all but 1 sample (now fixed).
* AssessReducedCoverage moved from private to protected for packaging reasons.
* #resolve GSA-639
At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
-- Generalizes previous node merging and splitting approaches. Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph. Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons). Fully UnitTested. These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node.
-- This new suite of algorithms is fully integrated into the HC, replacing previous approaches
-- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph. This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
-- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph. If the number of ops would exceed this threshold, the original graph is used.
-- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction. So we live with this limitations while we continue to explore better algorithms
-- Updating MD5s to reflect changes in assembly algorithms
-- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices. Fixed. In the process simplified the entire algorithm and codebase
-- Resolves https://jira.broadinstitute.org/browse/GSA-884
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
-- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive
-- UnitTest for isRootOfDiamond along with key bugfix detected while testing
-- Fix up the equals methods in BaseEdge. Now called hasSameSourceAndTarget and seqEquals. A much more meaningful naming
-- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs
-- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence
-- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself
-- Massive expansion of the SeqGraph unit tests. We now really test out the zipping and branch merging code.
-- Near final cleanup of the current codebase
-- DeBruijnVertex cleanup and optimizations. Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[]. Optimize the code to make use of this constraint
-- Only minor differences, with improvement in allele discovery where the sites differ. The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome
-- Split Path from inner class of KBestPaths
-- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation
-- Path now properly typed throughout the code
-- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast