-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
-- Added CAPILLARY and HELICOS platforms as required by spec 1.4
-- Added extensive unit tests to ensure NGSPlatform functions work as expected.
-- Fixed some NPE bugs for reads that don't have RGs or PLs in their RG fields
- Throws user exception if it is.
- Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
- Added test to check this is working.
- Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
- Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
- Added comment about not changing the program record name, as per reviewer comments.
- Removed unused variable.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
- Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
_per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
-- The progress meter isn't started until the GATK actually calls execute on the microscheduler. Now we get a message saying "Creating shard strategy" while this (expensive) operation runs
- Added contract enforcement for public methods
- Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method
- added method documentation for non getters/setters
- finals, finals everywhere
- Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.
-- Bringing code up to document, style, and code coverage specs
-- Move GATKRunReportUnitTest to private
-- Fully expand GATKRunReportUnitTests to coverage writing and reading GATKRunReport to local disk, to standard out, to AWS.
-- Move documentation URL from GATKRunReport to UserException
-- Delete a few unused files from s3GATKReport
-- Added capabilities to GATKRunReport to make testing easier
-- Added capabilities to deserialize GATKRunReports from an InputStream
- Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure).
- Also added minor cleanup of BaseUtils
I've confirmed via a script that all of these differences only
involve the version number bump in the BAM headers and nothing
else:
< @HD VN:1.0 GO:none SO:coordinate
---
> @HD VN:1.4 GO:none SO:coordinate
These patches to GATKBAMIndex are causing massive BAM index reading errors in
combination with the latest version of Picard. The bug is either in the patches
themselves or in the underlying SeekableBufferedStream class they rely on. Until
the cause can be identified, we are temporarily backing out these changes so that
we can continue to run with the latest Picard/Tribble.
This reverts commits:
81483ec21e528790dfa719d18cdee27d577ca98e
68cf0309db490b79eecdabb4034987ff825ffea8
54bb68f28ad5fe1b3df01702e9c5e108106a0176
This is a necessary prerequisite for the org.broadinstitute.variant migration.
-Picard and sam-jdk go from version 1.67.1197 to 1.84.1337
-Picard-private goes from version 2375 to 2662
-Tribble goes from version 119 to 1.84.1337
-RADICALLY trimmed down the list of classes we extract from Picard-private
(jar goes from 326993 bytes to 6445 bytes!)
Resources must be in a subdirectory called "resources" in the package
hierarchy to be picked up by the packaging system. Adding each resource
manually to the jars in build.xml does not cause the resource to be
added to the standalone GATK jar when we package the GATK, so it's best
to always use this convention.
If a read had an existing BAQ tag, was clipped by our engine, and couldn't have the BAQ recalculated (for whatever reason), then we would
fail in the BQSR because we would default to using the old tag (which no longer matched the length of the read bases).
The right thing to do here is to remove the old BAQ tag when RECALCULATE and ADD_TAG are the BAQ modes used but BAQ cannot be recalculated.
Added a unit test to ensure that the tags are removed in such a case.
-- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text. This will stop AWS from emailing me (they crawl the web looking for keys)
-- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome.
-- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle
-- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself. Now how to get the data in the resouce is straightforward
-- Refactored md5 calculation code from a byte[] into Utils. Added unit tests
-- Committing the encrypted AWS keys
-- #resolves https://jira.broadinstitute.org/browse/GSA-730
-- Example combinatorial unit tests, plus unit tests that create reads and bam files, pileups, variant context (from scratch and from a file), and genome locs