-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
Currently, the multi-allelic test is covering the following case:
Eval A T,C
Comp A C
reciprocate this so that the reverse can be covered.
Eval A C
Comp A T,C
And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.
This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:
Eval: A G,T 0/0 2/0 2/2 1/1
Comp: A C,T 0/0 1/0 0/0 0/0
Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:
Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)
Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.
Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
- dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
- vectorSum
- array shuffle, random subset
- countOccurances (general forms, the char form is used in the codebase)
- getNMaxElements
- array permutation
- sorted array permutation
- compare floats
- sum() (for integer arrays and lists).
Final keyword was extensively added to MathUtils.
The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).
The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.
In addition, more extensive tests were added for
- logBinomialCoefficient (Newton's identity should always hold)
- logFactorial
- log10sumlog10 and its approximation
All unit tests pass
-- These new algorithms are more powerful than the restricted diamond merging algoriths, in that they can merge nodes with multiple incoming and outgoing edges. Together the splitter + merger algorithms will correctly merge many more cases than the original headless and tailless diamond merger.
-- Refactored haplotype caller infrastructure into graphs package, code cleanup
-- Cleanup new merging / splitting algorithms, with proper docs and unit tests
-- Fix bug in zipping of linear chains. Because the multiplicity can be 0, protect ourselves with a max function call
-- Fix BaseEdge.max unit test
-- Add docs and some more unit tests
-- Move error correct from DeBruijnGraph to DeBruijnAssembler
-- Replaced uses of System.out.println with logger.info
-- Don't make multiplicity == 0 nodes look like they should be pruned
-- Fix toString of Path
-- Previous algorithms were applying pruneGraph inappropriately on the raw sequence graph (where each vertex is a single base). This results in overpruning of the graph, as prunegraph really relied on the zipping of linear chains (and the sharing of weight this provides) to avoid over-pruning the graph. Probably we should think hard about this. This commit fixes this logic, so we zip the graph between pruning
-- In this process ID's a fundamental problem with how we were trimming away vertices that occur on a path from the reference source to sink. In fact, we were leaving in any vertex that happened to be accessible from source, any vertices in cycles, and any vertex that wasn't the absolute end of a chain going to a sink. The new algorithm fixes all of this, using a BaseGraphIterator that's a general approach to walking the base graph. Other routines that use the same traversal idiom refactored to use this iterator. Added unit tests for all of these capabilities.
-- Created new BaseGraphIterator, which abstracts common access patterns to graph, and use this where appropriate
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)
-- Though not intended, it was possible to create reference graphs with cycles in the case where you started the graph with a homopolymer of length > the kmer. The previous test would fail to catch this case. Now its not possible
-- Lots of code cleanup and refactoring in this push. Split the monolithic createGraphFromSequences into simple calls to addReferenceKmersToGraph and addReadKmersToGraph which themselves share lower level functions like addKmerPairFromSeqToGraph.
-- Fix performance problem with reduced reads and the HC, where we were calling add kmer pair for each count in the reduced read, instead of just calling it once with a multiplicity of count.
-- Refactor addKmersToGraph() to use things like addOrUpdateEdge, now the code is very clear
-- The previous version would generate graphs that had no reference bases at all in the situation where the reference haplotype was < the longer read length, which would cause the kmer size to exceed the reference haplotype length. Now return immediately with a null graph when this occurs as opposed to continuing and eventually causing an error
-- The error correction algorithm can break the reference graph in some cases by error correcting us into a bad state for the reference sequence. Because we know that the error correction algorithm isn't ideal, and worse, doesn't actually seem to improve the calling itself on chr20, I've simply disabled error correction by default and allowed it to be turned on with a hidden argument.
-- In the process I've changed a bit the assembly interface, moving some common arguments us into the LocalAssemblyEngine, which are turned on/off via setter methods.
-- Went through the updated arguments in the HC to be @Hidden and @Advanced as appropriate
-- Don't write out an errorcorrected graph when debugging and error correction isn't enabled
* It is now cleaner and easier to test; added tests for newly implemented methods.
* Many fixes to the logic to make it work
* The most important change was that after triggering het compression we actually need to back it out if it
creates reads that incorporated too many softclips at any one position (because they get unclipped).
* There was also an off-by-one error in the general code that only manifested itself with het compression.
* Removed support for creating a het consensus around deletions (which was broken anyways).
* Mauricio gave his blessing for this.
* Het compression now works only against known sites (with -known argument).
* The user can pass in one or more VCFs with known SNPs (other variants are ignored).
* If no known SNPs are provided het compression will automatically be disabled.
* Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
strandedness from normal reduced reads.
* GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
* This allows us to update the FisherStrand annotation to count het compressed reduced reads
towards the FS calculation.
* [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
backwards compatible.]
* Updated integration tests accordingly with new het compressed bams (both for RR and UG).
* In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
RR properly, so I fixed it too.
* Also, the test in the UG engine for determining whether there are too many overlapping
deletions is updated to handle RR.
* I added a special hook in the RR integration tests to additionally run the systematic
coverage checking tool I wrote earlier.
* AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
not lost from original to reduced bam.
* This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
from all but 1 sample (now fixed).
* AssessReducedCoverage moved from private to protected for packaging reasons.
* #resolve GSA-639
At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
-- Generalizes previous node merging and splitting approaches. Can split common prefixes and suffices among nodes, build a subgraph representing this new structure, and incorporate it into the original graph. Introduces the concept of edges with 0 multiplicity (for purely structural reasons) as well as vertices with no sequence (again, for structural reasons). Fully UnitTested. These new algorithms can now really simplify diamond configurations as well as ones sources and sinks that arrive / depart linearly at a common single root node.
-- This new suite of algorithms is fully integrated into the HC, replacing previous approaches
-- SeqGraph transformations are applied iteratively (zipping, splitting, merging) until no operations can be performed on the graph. This further simplifies the graphs, as splitting nodes may enable other merging / zip operations to go.
-- added calls to representativeCount() of the pileup instead of using ++
-- renamed CallableLoci integration test
-- added integration test for reduce read support on callable loci
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
-- Simply don't do more than MAX_CORRECTION_OPS_TO_ALLOW = 5000 * 1000 operations to correct a graph. If the number of ops would exceed this threshold, the original graph is used.
-- Overall the algorithm is just extremely computational expensive, and actually doesn't implement the correct correction. So we live with this limitations while we continue to explore better algorithms
-- Updating MD5s to reflect changes in assembly algorithms
-- Previous version was just incorrectly accumulating information about nodes that were completely eliminated by the common suffix, so we were dropping some reference connections between vertices. Fixed. In the process simplified the entire algorithm and codebase
-- Resolves https://jira.broadinstitute.org/browse/GSA-884
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
-- Don't clone sequence upon construction or in getSequence(), as these are frequently called, memory allocating routines and cloning will be prohibitively expensive
-- UnitTest for isRootOfDiamond along with key bugfix detected while testing
-- Fix up the equals methods in BaseEdge. Now called hasSameSourceAndTarget and seqEquals. A much more meaningful naming
-- Generalize graphEquals to use seqEquals, so it works equally well with Debruijn and SeqGraphs
-- Add BaseVertex method called seqEquals that returns true if two BaseVertex objects have the same sequence
-- Reorganize SeqGraph mergeNodes into a single master function that does zipping, branch merging, and zipping again, rather than having this done in the DeBruijnAssembler itself
-- Massive expansion of the SeqGraph unit tests. We now really test out the zipping and branch merging code.
-- Near final cleanup of the current codebase
-- DeBruijnVertex cleanup and optimizations. Since kmer graphs don't allow sequences longer than the kmer size, the suffix is always a byte, not a byte[]. Optimize the code to make use of this constraint
-- Only minor differences, with improvement in allele discovery where the sites differ. The test of an insertion at the start of the MT no longer calls a 1 bp indel at position 0 in the genome
-- Split Path from inner class of KBestPaths
-- Use google MinMaxPriorityQueue to track best k paths, a more efficient implementation
-- Path now properly typed throughout the code
-- Path maintains a on-demand hashset of BaseEdges so that path.containsEdge is fast
-- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
-- Error correction algorithm for the assembler. Only error correct reads to others that are exactly 1 mismatch away
-- The assembler logic is now: build initial graph, error correct*, merge nodes*, prune dead nodes, merge again, make haplotypes. The * elements are new
-- Refactored the printing routines a bit so it's easy to write a single graph to disk for testing.
-- Easier way to control the testing of the graph assembly algorithms
-- Move graph printing function to DeBruijnAssemblyGraph from DeBruijnAssembler
-- Simple protected parsing function for making DeBruijnAssemblyGraph
-- Change the default prune factor for the graph to 1, from 2
-- debugging graph transformations are controllable from command line
-- Previous version would not trim down soft clip bases that extend beyond the active region, causing the assembly graph to go haywire. The new code explicitly reverts soft clips to M bases with the ever useful ReadClipper, and then trims. Note this isn't a 100% fix for the issue, as it's possible that the newly unclipped bases might in reality extend beyond the active region, should their true alignment include a deletion in the reference. Needs to be fixed. JIRA added
-- See https://jira.broadinstitute.org/browse/GSA-822
-- #resolve #fix GSA-822
-- Added a -dontGenotype mode for testing assembly efficiency
-- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted
-- Annotations were being called on VariantContext that might needed to be trimmed. Simply inverted the order of operations so trimming occurs before the annotations are added.
-- Minor cleanup of call to PairHMM in LikelihoodCalculationEngine
-- Use default VQSR params of QD, FS, DP and MQ for SNPs, with ReadPosRankSum and HaplotypeScore for UG SNPs
-- Add combine variants to GeneralCallingPipelin
-- Fix incorrect intervals in HaplotypeCaller in GeneralCallingPipeline.scala
-- GCP now emits tables for VCFs by default
-- GCP runs HC first before UG
-- GeneralCallingPipeline now jointly calls input BAMs, not separately processes them. Ready to handle CEU trio calling
-- Assess NA12878 on the particularly well reviewed 10-11mb in addition to all of 20
-- Use 4G for HC
In particular, someone produced a tandem repeat site with 57 alt alleles (sic) which made the caller blow up.
Inelegant fix is to detect if # of alleles is > our max cached capacity, and if so, emit an informative warning and skip site.
-- Added unit test to UG engine to cover this case.
-- Commit to posterity private scala script currently used for 1000G indel consensus (still very much subject to changes).
GSA-878 #resolve
--Based on existing code in GenomeAnalysisEngine
--Hashmaps hold mapping of deprecated tool name to version number and recommended replacement (if any)
--Using FastUtils for maps; specifically Object2ObjectMap but there could be a better type for Strings...
--Added user exception for deprecated annotations
--Added deprecation check to AnnotationInterfaceManager.validateAnnotations
--Run when annotations are initialized
--Made annotation sets instead of lists
--Refactored listAnnotations basic method out of VA into HelpUtils
--HelpUtils.listAnnotations() is now called by both VA and the new ListAnnotations utility (lives in sting.tools)
--This way we keep the VA --list option but we also offer a way to list annotations without a full valid VA command-line, which was a pain users continually complained about
--We could get rid of the VA --list option altogether ...?
--Mostly doc block tweaks
--Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.