Name cache was filling up with names of all reads in entire file, which for large file eventually
consumes all of memory. Only keep read name cache for the reads that are together in one variant
region, so that a pair of reads within the same variant region will still be joined via read name.
Otherwise the ability to connect a read to its mate is lost.
Update MD5s in integration test to reflect altered output.
Add new integration test that confirms that pair within variant region is joined by read name.
-- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads. In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value. This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR.
-- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm
-- Update md5s
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation. Differences are just minor updates to the FS
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
* Allow RR to write its BAM to stdout by setting required=true for @Output.
* Fixed bug in sliding window where a break in coverage after a long stretch without
a variant region was causing a doubling of all the reads before the break.
* Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods.
* Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).
Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment. Refactored out the big main and supplementary static methods. Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode. This test only ensures that the code runs and doesn't exception out. It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype. Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made. Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
This is to facilitate the current experiment with class-level test
suite parallelism. It's our hope that with these changes, we can get
the runtime of the integration test suite down to 20 minutes or so.
-UnifiedGenotyper tests: these divided nicely into logical categories
that also happened to distribute the runtime fairly evenly
-UnifiedGenotyperPloidy: these had to be divided arbitrarily into two
classes in order to halve the runtime
-HaplotypeCaller: turns out that the tests for complex and symbolic
variants make up half the runtime here, so merely moving these into
a separate class was sufficient
-BiasedDownsampling: most of these tests use excessively large intervals
that likely can't be reduced without defeating the goals of the tests. I'm
disabling these tests for now until they can either be redesigned to use smaller
intervals around the variants of interest, or refactored into unit tests
(creating a JIRA for Yossi for this task)
* Removed from codebase NestedHashMap since it is unused and untested.
* Integration tests change because the BQSR CSV is now sorted automatically.
* Resolves GSA-732
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
with PrintReads
-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
on the UnifiedGenotyper
-was previously set to 30, which seems far too aggressive given that with
ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any
pileup returned by LIBS
-250 is a more conservative default used by the UG
-can adjust down/up later based on further experiments (GSA-699 will
remain open)
-verified with Ryan that all integration test differences are either
innocent or represent an improvement
GSA-699
The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the
BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of
that class; so if the optimized tests were run first we were unintentionally running those same tests
again with the basic ones (but expecting different results).
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads
GSA-774 #resolve
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
-- Increase the allowed runtime of one UG integration test
-- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before. Some updates can push this right over the edge. Increased limit
-- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster
-- Updating MD5s due to more correct quality utils. DuplicatesWalkers quality estimates have changed. One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different
-- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test.
-- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
- got md5s from a interim version that does not have the per-sample downsampling hookedup
- added an integration test that forces the result from flat-downsampling to equal that which results from an equivalent flat contamination file
-- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads.
-- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast)
-- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling).
-- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test
-- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).
-- New HMM has two impacts on MD5s. First, all indel calls with UG and all calls by HC no longer have the HaplotypeScore computed. This is for the good, especially given the computational cost of this annotationa and unclear value for HC. Second, the BaseQualityRankSum values are changing by tiny amounts because of the changes in the HMM likelihoods.
-- Disabled three tests from Yossi that cause strange MD5 differences with calls for HC, created a JIRA for him to enable and fix
-- Disabled the non-deterministic GGA test. Assigned JIRA to Guillermo
-- With this push I expect all integration tests to pass
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
-- Base distribution optionally includes deletions
-- Implemented an optional filtered coverage distribution option
-- Integration tests added for every feature of the traversal
This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage.
GSATDG-45 #resolve
* After consulting Tim/David/Mauricio we determined that the md5 changes were due to different encodings of binary arrays in samjdk
* However, it made no functional difference to the results (confirmed by Eric) so we agreed to update md5s
* Also, the header of one of the test bams was malformed but old picard jar didn't perform checks so it only started failing now
* Fixed the bam
- Throws user exception if it is.
- Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
- Added test to check this is working.
- Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
- Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
- Added comment about not changing the program record name, as per reviewer comments.
- Removed unused variable.
- It's now written into the recal report so that it can be used in the PrintReads step.
- Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before!
- Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR.
(This is my last non-branch commit; all future pushes will follow new GATK practices)
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
- Uncovered small bug in the fix that I added yesterday, which is now fixed properly.
- Uncovered massive general bug: polyploid consensus is totally busted for deletions (because of call to read.getReadBases()[readPos]).
- Need to consult Mauricio on what to do here (are we supporting het compression for deletions? (Insertions are definitely not supported)
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
_per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
* Fixed implementation of polyploid (het) compression in RR.
* The test for a usable site was all wrong. Worked out details with Mauricio to get it right.
* Added comprehensive unit tests in HeaderElement class to make sure this is done right.
* Still need to add tests for the actual polyploid compression.
* No longer allow non-diploid het compression; I don't want to test/handle it, do you?
* Added nearly full coverage of tests for the BaseCounts class.
-- Testing that cycles in the reference graph fail graph construction appropriately.
-- Minor bug fix in assembly with reduced reads.
Added some docs and contracts to SimpleDeBruijnAssembler
Added a unit test to SimpleDeBruijnAssembler
Part 1 of Variant Annotator Unit tests: PerReadAlleleLikelihoodMap
- Added contract enforcement for public methods
- Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method
- added method documentation for non getters/setters
- finals, finals everywhere
- Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.
- ReduceReads by default now sets up-front ReadWalker downsampling to 40x per start position.
- This is the value I used in my tests with Picard to show that memory issues pretty much disappeared.
- This should hopefully take care of the memory issues being reported on the forum.
- Added javadocs to SlidingWindow (the main RR class) to follow GATK conventions.
- Added more unit tests to increase coverage of BaseCounts class.
- Added more unit tests to test I/D operators in the SlidingWindow class.
- Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure).
- Also added minor cleanup of BaseUtils
I've confirmed via a script that all of these differences only
involve the version number bump in the BAM headers and nothing
else:
< @HD VN:1.0 GO:none SO:coordinate
---
> @HD VN:1.4 GO:none SO:coordinate
testing the adding of reads into the SlidingWindow plus consensus creation. Will flesh these out more after I take care of
some other items on my plate.
-- All functions tested. In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads. New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans. Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region. The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed. Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit
Conflicts:
public/java/test/org/broadinstitute/variant/VariantBaseTest.java
-Moved some of the more specialized / complex VariantContext and VCF utility
methods back to the GATK.
-Due to this re-shuffling, was able to return things like the Pair class back
to the GATK as well.
One of the fixes was critical: SlidingWindow was not converting between global and relative positions correctly.
Besides not being correct, it was resulting in a massive slow down of the RR traversal.
That fix definitely breaks at least one of the integration tests, but it's not worth changing md5s now because I'll be
changing things all over RR for the next few days, so I am going to let that test fail indefinitely until I can confirm
general correctness of the tool.
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items:
1) Docs
2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp)
3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)
2. While making the previous fix and unifying FS for SNPs and indels, I noticed that FS was slightly broken in the general case for indels too; fixed.
3. I also fixed a minor bug in the Allele Biased Downsampling code for reduced reads.
I've resigned myself instead to create a mapping from Allele to Haplotype. It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).
As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...
Completed todo item: for sites like
(eval)
20 12345 A C
20 12345 A AC
(comp)
20 12345 A C
20 12345 A ACCC
the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally
(eval)
20 12345 A AC
(comp)
20 12345 A ACCC
would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively.
+ integration test
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs. Until the HC uses meta data though, this won't work.
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list. Not ideal)
-- Trivial read count differences in HC
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script
- Added an optional argument to BaseRecalibrator to produce sorted GATKReport Tables
- Modified BSQR Integration Tests to include the optional argument. Tests now produce sorted tables
The weird part is that the comments claimed it was doing what it was supposed to, but it didn't actually do it.
Now we maintain the last header element of the consensus (but without bases and quals) if it adjoins an element with an insertion.
Added the user's test file as an integration test.
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.
-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)
-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)
-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
Reads that are soft-clipped off the contig (before the beginning of the contig) were being soft-clipped to position 0 instead of 1 because of an off-by-one issue. Fixed and included in the integration test.
-- Uses high-performance local writer backed by byte array that writes the entire VCF line in some write operation to the underlying output stream.
-- Fixes problems with indexing of unflushed writes while still allowing efficient block zipping
-- Same (or better) IO performance as previous implementation
-- IndexingVariantContextWriter now properly closes the underlying output stream when it's closed
-- Updated compressed VCF output file
-Switch back to the old implementation, if needed, with --use_legacy_downsampler
-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState
-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer
-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.
-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.
-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
As reported by Menachem Fromer: a critical bug in AFCalcResult:
Specifically, the implementation:
public boolean isPolymorphic(final Allele allele, final double log10minPNonRef) {
return getLog10PosteriorOfAFGt0ForAllele(allele) >= log10minPNonRef;
}
seems incorrect and should probably be:
getLog10PosteriorOfAFEq0ForAllele(allele) <= log10minPNonRef
The issue here is that the 30 represents a Phred-scaled probability of *error* and it's currently being compared to a log probability of *non-error*.
Instead, we need to require that our probability of error be less than the error threshold.
This bug has only a minor impact on the calls -- hardly any sites change -- which is good. But the inverted logic effects multi-allelic sites significantly. Basically you only hit this logic with multiple alleles, and in that case it'\s including extra alt alleles incorrectly, and throwing out good ones.
Change was to create a new function that properly handles thresholds that are PhredScaled quality scores:
/**
* Same as #isPolymorphic but takes a phred-scaled quality score as input
*/
public boolean isPolymorphicPhredScaledQual(final Allele allele, final double minPNonRefPhredScaledQual) {
if ( minPNonRefPhredScaledQual < 0 ) throw new IllegalArgumentException("phredScaledQual " + minPNonRefPhredScaledQual + " < 0 ");
final double log10Threshold = Math.log10(QualityUtils.qualToProb(minPNonRefPhredScaledQual));
return isPolymorphic(allele, log10Threshold);
}
ReduceReads now co-reduces bams if they're passed in toghether with multiple -I. Co-reduction forces every variant region in one sample to be a variant region in all samples.
Also:
* Added integrationtest for co-reduction
* Fixed bug with new no-recalculation implementation of the marksites object where the last object wasn't being removed after finalizing a variant region (updated MD5's accordingly)
DEV-200 #resolve #time 8m
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered. Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state. This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
The MD5s for these tests were changed in commit 87435f1074615b2cd016f042980109fd53962c8d
to match the output of a broken version of BaseRecalibration. With the patch in
commit c397102ecc1fd1d2cd8f209a8f358ab4a60b50a7, the output once again matches the
*original* MD5s for these tests, and does not vary as you increase -nct.
Final resolution to GSA-632
-- I'm committing because there's some kind of fundamental problem with the ReadCovariates cache, in that historical data isn't being cleared / computed properly, and I'd rather it fail for a while than leave it in JIRA.
-- The integration tests test the -nct with PrintReads to get 1, 2, 4 and the 4 fails. But that's because of this incorrect calculation
-- Updating GATKPerformanceOverTime with the new @ClassType annotation
The old BaseRecalibrator walker is and never will be thread-safe, since it's a
LocusWalker that uses read attributes to track state.
ONLY the newer DelocalizedBaseRecalibrator is believed likely to be thread-safe
at this point. It is safe to run the DelocalizedBaseRecalibrator with -nct > 1
for testing purposes, but wait for further testing to be done before using it
for production purposes in multithreaded mode.
Bringing in the following relevant changes:
* Fixes the indel realigner N-Way out null pointer exception DEV-10
* Optimizations to ReduceReads that bring the run time to 1/3rd.
Conflicts:
protected/java/src/org/broadinstitute/sting/gatk/walkers/compression/reducereads/SlidingWindow.java
DEV-10 #resolve #time 2m
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic. That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele. The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating. This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime. Might be worth it through
-- Created a JIRA ticket https://jira.broadinstitute.org/browse/GSA-623 for Guillermo to look at the differences as the multi-allelic nature of many sites seems to change with the new more protected infrastructure. This may be due to implementation issues in the pooled caller, problems with my interface, or could be a genuine improvement.
-- Potentially a very fast implementation (it's very clean) but restricted to the biallelic case
-- A starting point for future bi-allelic only optimized (logless) or generalized (bi-allelic general ploidy) implementations
-- Added systematic unit tests covering this implementation, and comparing it to others
-- Uncovered a nasty normalization bug in StateTracker that was capping our likelihoods at 0, even after summing up multiple likelihoods, which is just not safe to do and was causing us to lose likelihood in some cases
-- Removed the restriction that a likelihood be <= 0 in StateTracker, and the protection for these cases in GeneralPloidyExactAFCalc which just wasn't right
-- Changed UG / HC to use this one via the StandardCallerArgumentCollection
-- Update the AFCalcFactory.Calculation to have a getDefault() value instead of having a duplicate entry in the enums
-- GeneralPloidyExactAFCalc turns -Infinity values into -Double.MAX_VALUE, so our calculations pass unit tests
-- Bugfix for GeneralPloidyGenotypeLikelihoodsCalculationModel, return a null VC when the only allele we get from our final alleles to use method is the reference base
-- Fix calculation of reference posteriors when P(AF == 0) = 0.0 and P(AF == 0) = X for some meaningful value of X. Added unit test to ensure this behavior is correct
-- Fix horrible sorting bug in IndependentAllelesDiploidExactAFCalc that applied the theta^N priors in the wrong order. Add contract to ensure this doesn't ever happen again
-- Bugfix in GLBasedSampleSelector, where VCs without any polymorphic alleles were being sent to the exact model
--
-- These two classes were really the same, and now they are actually the same!
-- Cleanuped the interfaces, removed duplicate data
-- Added lots of contracts, some of which found numerical issues with GeneralPloidyExactAFCalc (which have been patched over but not fixed)
-- Moved goodProbability and goodProbabilityVector utilities to MathUtils. Very useful for contracts!