-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
Currently, the multi-allelic test is covering the following case:
Eval A T,C
Comp A C
reciprocate this so that the reverse can be covered.
Eval A C
Comp A T,C
And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.
This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:
Eval: A G,T 0/0 2/0 2/2 1/1
Comp: A C,T 0/0 1/0 0/0 0/0
Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:
Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)
Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.
Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
- dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
- vectorSum
- array shuffle, random subset
- countOccurances (general forms, the char form is used in the codebase)
- getNMaxElements
- array permutation
- sorted array permutation
- compare floats
- sum() (for integer arrays and lists).
Final keyword was extensively added to MathUtils.
The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).
The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.
In addition, more extensive tests were added for
- logBinomialCoefficient (Newton's identity should always hold)
- logFactorial
- log10sumlog10 and its approximation
All unit tests pass
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)
* It is now cleaner and easier to test; added tests for newly implemented methods.
* Many fixes to the logic to make it work
* The most important change was that after triggering het compression we actually need to back it out if it
creates reads that incorporated too many softclips at any one position (because they get unclipped).
* There was also an off-by-one error in the general code that only manifested itself with het compression.
* Removed support for creating a het consensus around deletions (which was broken anyways).
* Mauricio gave his blessing for this.
* Het compression now works only against known sites (with -known argument).
* The user can pass in one or more VCFs with known SNPs (other variants are ignored).
* If no known SNPs are provided het compression will automatically be disabled.
* Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
strandedness from normal reduced reads.
* GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
* This allows us to update the FisherStrand annotation to count het compressed reduced reads
towards the FS calculation.
* [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
backwards compatible.]
* Updated integration tests accordingly with new het compressed bams (both for RR and UG).
* In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
RR properly, so I fixed it too.
* Also, the test in the UG engine for determining whether there are too many overlapping
deletions is updated to handle RR.
* I added a special hook in the RR integration tests to additionally run the systematic
coverage checking tool I wrote earlier.
* AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
not lost from original to reduced bam.
* This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
from all but 1 sample (now fixed).
* AssessReducedCoverage moved from private to protected for packaging reasons.
* #resolve GSA-639
At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
-- added calls to representativeCount() of the pileup instead of using ++
-- renamed CallableLoci integration test
-- added integration test for reduce read support on callable loci
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful
-- DeBruijnAssembler functions are no longer static. This isn't the right way to unit test your code
-- An a HaplotypeCaller command line option to use low-quality bases in the assembly
-- Refactored DeBruijnGraph and associated libraries into base class
-- Refactored out BaseEdge, BaseGraph, and BaseVertex from DeBruijn equivalents. These DeBruijn versions now inherit from these base classes. Added some reasonable unit tests for the base and Debruijn edges and vertex classes.
-- SeqVertex: allows multiple vertices in the sequence graph to have the same sequence and yet be distinct
-- Further refactoring of DeBruijnAssembler in preparation for the full SeqGraph <-> DeBruijnGraph split
-- Moved generic methods in DeBruijnAssembler into BaseGraph
-- Created a simple SeqGraph that contains SeqVertex objects
-- Simple chain zipper for SeqGraph that reproduces the results for the mergeNode function on DeBruijnGraphs
-- A working version of the diamond remodeling algorithm in SeqGraph that converts graphs that look like A -> Xa, A -> Ya, Xa -> Z, Ya -> Z into A -> X -> a, A -Y -> a, a -> Z
-- Allow SeqGraph zip merging of vertices where the in vertex has multiple incoming edges or the out vertex has multiple outgoing edges
-- Fix all unit tests so they work with the new SeqGraph system. All tests passed without modification.
-- Debugging makes it easier to tell which kmer graph contributes to a haplotype
-- Better docs and unit tests for BaseVertex, SeqVertex, BaseEdge, and KMerErrorCorrector
-- Remove unnecessary printing of cleaning info in BaseGraph
-- Turn off kmer graph creation in DeBruijnAssembler.java
-- Only print SeqGraphs when debugGraphTransformations is set to true
-- Rename DeBruijnGraphUnitTest to SeqGraphUnitTest. Now builds DeBruijnGraph, converts to SeqGraph, uses SeqGraph.mergenodes and tests for equality.
-- Update KBestPathsUnitTest to use SeqGraphs not DebruijnGraphs
-- DebruijnVertex now longer takes kmer argument -- it's implicit that the kmer length is the sequence.length now
-- Added a -dontGenotype mode for testing assembly efficiency
-- However, it looks like this has a very negative impact on the quality of the results, so the code should be deleted
--Based on existing code in GenomeAnalysisEngine
--Hashmaps hold mapping of deprecated tool name to version number and recommended replacement (if any)
--Using FastUtils for maps; specifically Object2ObjectMap but there could be a better type for Strings...
--Added user exception for deprecated annotations
--Added deprecation check to AnnotationInterfaceManager.validateAnnotations
--Run when annotations are initialized
--Made annotation sets instead of lists
--Refactored listAnnotations basic method out of VA into HelpUtils
--HelpUtils.listAnnotations() is now called by both VA and the new ListAnnotations utility (lives in sting.tools)
--This way we keep the VA --list option but we also offer a way to list annotations without a full valid VA command-line, which was a pain users continually complained about
--We could get rid of the VA --list option altogether ...?
--Mostly doc block tweaks
--Added @DocumentedGATKFeature to some walkers that were undocumented because they were ending up in "uncategorized". Very important for GSA: if a walker is in public or protected, it HAS to be properly tagged-in. If it's not ready for the public, it should be in private.
-- @Output isn't required for AssessNA12878
-- Previous version would could non-variant sites in NA12878 that resulted from subsetting a multi-sample VC to NA12878 as CALLED_BUT_NOT_IN_DB sites. Now they are properly skipped
-- Bugfix for subsetting samples to NA12878. Previous version wouldn't trim the alleles when subsetting down a multi-sample VCF, so we'd have false FN/FP sites at indels when the multi-sample VCF has alleles that result in the subset for NA12878 having non-trimmed alleles. Fixed and unit tested now.
-- Simply caps PairHMM likelihoods from rising above 0 by taking the min of the likelihood and 0. Will be properly fixed in GATK 2.5 with better PairHMM implementation.
Increase one timeout, restore others that were only timing out due to the
Java crypto lib bug to their original values.
-DOUBLE timeout for NanoSchedulerUnitTest.testNanoSchedulerInLoop()
-REDUCE timeout for EngineFeaturesIntegrationTest to its original value
-REDUCE timeout for MaxRuntimeIntegrationTest to its original value
-REDUCE timeout for GATKRunReportUnitTest to its original value
- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
- VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
- AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
- Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
- Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
- Removed unnecessary check in AverageAltAlleleLength
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:
I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
* The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
* The logic for @Output is now:
* if required==true then -o MUST be provided or a User Error is generated.
* if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
* this is the default behavior (i.e. @Output with no modifiers).
* if required==false and defaultToStdout==false then the output object is null.
* use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).
* I have updated walkers so that previous behavior has been maintained (as best I could).
* In general, all @Outputs with default long/short names have required=false.
* Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
* I added unit tests for @Output changes with David's help (thanks!).
* #resolve GSA-837
* ClippingOp updated to incorporate Ns in the hard clips.
* ReadUtils.getReadCoordinateForReferenceCoordinate() updated to account for Ns.
* Added test that covers the BQSR case we saw.
* Created GSA-856 (for Mauricio) to add lots of tests to ReadUtils.
* It will require refactoring code and not in the scope of what I was willing to do to fix this.
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation. Differences are just minor updates to the FS
-- Code was undocumented, big, and not well tested. All three things fixed.
-- Currently not passing, but the framework works well for testing
-- Added concat(byte[] ... arrays) to utils
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
to be overridden via a constructor argument, and use a timeout
of 300 seconds for tests. The timeout remains 30 seconds in all
other cases.
-Change integration tests that themselves dispatch farm jobs
into pipeline tests. Necessary because some farm nodes are
not set up as submit hosts. Pipeline tests are still run
directly on gsa4.
-Bump up the timeout for the MaxRuntimeIntegrationTest even more
(was still occasionally failing on the farm!)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).
Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
-Make MaxRuntimeIntegrationTest more lenient by assuming that startup overhead
might be as long as 120 seconds on a very slow node, rather than the original
assumption of 20 seconds
-In TraverseActiveRegionsUnitTest, write temp bam file to the temp directory, not
to the current working directory
-SimpleTimerUnitTest: This test was internally inconsistent. It asserted that
a particular operation should take no more than 10 milliseconds, and then asserted
again that this same operation should take no more than 100 microseconds (= 0.1 millisecond).
On a slow node it could take slightly longer than 100 microseconds, however.
Changed the test to assert that the operation should require no more than 10000 microseconds
(= 10 milliseconds)
-change global default test timeout from 20 to 40 minutes (things just take longer
on the farm!)
-build.xml: allow runtestonly target to work with scala test classes
* ReadTransformers can say they must be first, must be last, or don't care.
* By default, none of the existing ones care about ordering except BQSR (must be first).
* This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
* The engine now orders the read transformers up front before applying iterators.
* The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
* Added unit tests.
By default all output is assigned to stdout if a -o is not provided. Technically this makes @Output a not required parameter, and the documentation is misleading because it's reading from the annotation.
GSA-820 #resolve
Too many users (with RNASeq reads) are hitting these exceptions that were never supposed to happen. Let's give them (and us) a better and clearer error message.
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment. Refactored out the big main and supplementary static methods. Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode. This test only ensures that the code runs and doesn't exception out. It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype. Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made. Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter