- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
- VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
- AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
- Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
- Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
- Removed unnecessary check in AverageAltAlleleLength
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:
I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
* The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
* The logic for @Output is now:
* if required==true then -o MUST be provided or a User Error is generated.
* if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
* this is the default behavior (i.e. @Output with no modifiers).
* if required==false and defaultToStdout==false then the output object is null.
* use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).
* I have updated walkers so that previous behavior has been maintained (as best I could).
* In general, all @Outputs with default long/short names have required=false.
* Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
* I added unit tests for @Output changes with David's help (thanks!).
* #resolve GSA-837
* ClippingOp updated to incorporate Ns in the hard clips.
* ReadUtils.getReadCoordinateForReferenceCoordinate() updated to account for Ns.
* Added test that covers the BQSR case we saw.
* Created GSA-856 (for Mauricio) to add lots of tests to ReadUtils.
* It will require refactoring code and not in the scope of what I was willing to do to fix this.
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads. Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands. This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called. Added new GATKSAMRecord method setReducedCounts() that does the right thing. Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation. Differences are just minor updates to the FS
-- Code was undocumented, big, and not well tested. All three things fixed.
-- Currently not passing, but the framework works well for testing
-- Added concat(byte[] ... arrays) to utils
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
to be overridden via a constructor argument, and use a timeout
of 300 seconds for tests. The timeout remains 30 seconds in all
other cases.
-Change integration tests that themselves dispatch farm jobs
into pipeline tests. Necessary because some farm nodes are
not set up as submit hosts. Pipeline tests are still run
directly on gsa4.
-Bump up the timeout for the MaxRuntimeIntegrationTest even more
(was still occasionally failing on the farm!)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).
Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
* ReadTransformers can say they must be first, must be last, or don't care.
* By default, none of the existing ones care about ordering except BQSR (must be first).
* This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
* The engine now orders the read transformers up front before applying iterators.
* The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
* Added unit tests.
By default all output is assigned to stdout if a -o is not provided. Technically this makes @Output a not required parameter, and the documentation is misleading because it's reading from the annotation.
GSA-820 #resolve
Too many users (with RNASeq reads) are hitting these exceptions that were never supposed to happen. Let's give them (and us) a better and clearer error message.
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment. Refactored out the big main and supplementary static methods. Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode. This test only ensures that the code runs and doesn't exception out. It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype. Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made. Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
* Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown.
* Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases.
* Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs.
* Added more unit tests for add functionality of GLSS.
* Resolves GSA-775.
-- Refactor initialization routine into BadSitesWriter. This now adds the GQ and DP genotype header lines which are necessarily if the input VCF doesn't have proper headers
-- GATKVariantContextUtils subset to biallelics now tolerates samples with bad GL values for multi-allelics, where it just removes the PLs and issues a warning.
* Split the cases into reads that don't have a RG at all vs. those with a RG that's not defined in the header.
* Added integration tests to make sure that the correct error is thrown.
* Resolved GSA-407.
* Removed from codebase NestedHashMap since it is unused and untested.
* Integration tests change because the BQSR CSV is now sorted automatically.
* Resolves GSA-732
This walker was not updated since 2009, and users were getting wrong answers when running it with ReduceReads. I don't want to deal with this because DiagnoseTargets does everything this walker does.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
The GATK engine does not behave correctly when contigs are indexed
differently in the reads sequence dictionaries vs. the reference
sequence dictionary, and the inconsistently-indexed contigs are included
in the user's intervals. For example, given the dictionaries:
Reference dictionary = { chrM, chr1, chr2, ... }
BAM dictionary = { chr1, chr2, ... }
and the interval "-L chr1", the engine would fail to correctly retrieve
the reads from chr1, since chr1 has a different index in the two dictionaries.
With this patch, we throw an exception if there are contig index differences
between the dictionaries for reads and reference, AND the user's intervals
include at least one of the mismatching contigs.
The user can disable this exception via -U ALLOW_SEQ_DICT_INCOMPATIBILITY
In all other cases, dictionary validation behaves as before.
I also added comprehensive unit tests for the (previously-untested)
SequenceDictionaryUtils class.
GSA-768 #resolve
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
-- AssessNA12878 now breaks out multi-allelics into bi-allelic components. This means that we can properly assess multi-allelic calls against the bi-allelic KB
-- Refactor AssessNA12878, moving into assess package in KB. Split out previously private classes in the walker itself into separate classes. Added real docs for all of the classes.
-- Vastly expand (from 0) unit tests for NA12878 assessments
-- Allow sites only VCs to be evaluated by Assessor
-- Move utility for creating simple VCs from a list of string alleles from GATKVariantContextUtilsUnitTest to GATKVariantContextUtils
-- Assessor bugfix for discordant records at a site. Previous version didn't handle properly the case where one had a non-matching call in the callset w.r.t. the KB, so that the KB element was eaten during the analysis. Fixed. UnitTested
-- See GSA-781 -- Handle multi-allelic variants in KB for more information
-- Bugfix for missing site counting in AssessNA12878. Previous version would count N misses for every missed value at a site. Not that this has much impact but it's worth fixing
-- UnitTests for BadSitesWriter
-- UnitTests for filtered and filtering sites in the Assessor
-- Cleanup end report generation code (simply the code). Note that instead of "indel" the new code will print out "INDELS"
-- Assessor DoC calculations now us LIBS and RBPs for the depth calculation. The previous version was broken for reduced reads. Added unit test that reads a complex reduced read example and matches the DoC of this BAM with the output of the GATK DoC tool here.
-- Added convenience constructor for LIBS using just SAMFileReader and an iterator. It's now easy to create a LIBS from a BAM at a locus. Added advanceToLocus function that moves the LIBS to a specific position. UnitTested via the assessor (which isn't ideal, but is a proper test)
-- Now supports trimming the alleles from both the reverse and forward direction.
-- Added lots of unit tests for forwrad allele trimming, as well as creating VC from forward and reverse trimming.
-- Added docs and tests for the code, to bring it up to GATK spec
These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons):
* Don't unnecessarily overload Allele.getBases() in the Haplotype class.
* Haplotype.getBases() was calling clone() on the byte array.
* Added a constructor to Allele (and Haplotype) that takes in an Allele as input.
* It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated).
* Rev'ed the variant jar accordingly.
For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.
-- top-level walker type (locus, read etc)
-- parallelism options (nt or nct)
-- annotation type (for Variant Annotations)
-- downsampling settings that override engine defaults
-- reference window size
-- active region settings
-- partitionBy info
-- Added getClazzAnnotations() as hub to retrieve various annotations values and class properties through reflection
-- Added getReadFilters() method to retrieve Read Filter annotations
-- getReadFilters() uses recursion to walk up the inheritance to also capture superclass annotations
-- getClazzAnnotations() stores collected info in doc handler root, which is unit.forTemplate in Doclet
-- Modified FreeMarker template to use the Readfilters info (displayed after arg table, before additional capabilities)
-- Tadaaa :-) #GSATDG-63 resolve
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads
GSA-774 #resolve
-- added a logger.error with a more descriptive message of what the most likely cause of the error is
Typical error happens when a walker's global variable is not initialized properly (usually in test conditions). The old error message was very hard to understand "Could not create module because of an exception of type NullPointerException ocurred caused by exception null"
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet. Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
-- log10 functions in QualityUtils allow -Infinity to allow log10(0.0) values
-- Fix edge condition of log10OneMinusX failing with Double.MIN_VALUE
-- Fix another edge condition of log10OneMinusX failing with a small but not min_value double
-- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0. This is all because there wasn't a unit test.
-- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value. Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual. Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils. Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
@argument (-)-version
(should this be @hidden?)
Prints out the version to System.out and quit(0)
No tests. (any ideas on how to test this would be happily accepted)
This helps a lot since FileChannel is very low-level and traversing the BAMIndex involves lots of short reads.
- Fixed a deterioration in BAMIndex due to rev'ed picard (see below)
- Added unit tests for SeekableBufferedStream
- Added integrationTests for GATKBAMIndex (in PileupWalkerIntegrationTest)
- Added a runtime-test to verify that the amount read equals the amount requested.
- Added failing tests with expectedExceptions
- Used a DataProvider to make code nicer
-- Renamed ValidatePileup to CheckPileup since validation is reserved word
-- Renamed AlignmentValidation to CheckAlignment (same as above)
-- Refactored category definitions to use constants defined in HelpConstants
-- Fixed a couple of minor typos and an example error
-- Reorganized the GATKDocs index template to use supercategories
-- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)
-- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads.
-- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast)
-- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling).
-- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test
-- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).
-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses. This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length. I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10. This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM. All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes. Fixed bug. Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit. Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum. This involved moving some initialize() code into the computeLikelihoods function. That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
-- Added CAPILLARY and HELICOS platforms as required by spec 1.4
-- Added extensive unit tests to ensure NGSPlatform functions work as expected.
-- Fixed some NPE bugs for reads that don't have RGs or PLs in their RG fields
- Throws user exception if it is.
- Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
- Added test to check this is working.
- Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
- Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
- Added comment about not changing the program record name, as per reviewer comments.
- Removed unused variable.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
- Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
_per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
-- The progress meter isn't started until the GATK actually calls execute on the microscheduler. Now we get a message saying "Creating shard strategy" while this (expensive) operation runs
- Added contract enforcement for public methods
- Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method
- added method documentation for non getters/setters
- finals, finals everywhere
- Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.
-- Bringing code up to document, style, and code coverage specs
-- Move GATKRunReportUnitTest to private
-- Fully expand GATKRunReportUnitTests to coverage writing and reading GATKRunReport to local disk, to standard out, to AWS.
-- Move documentation URL from GATKRunReport to UserException
-- Delete a few unused files from s3GATKReport
-- Added capabilities to GATKRunReport to make testing easier
-- Added capabilities to deserialize GATKRunReports from an InputStream
- Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure).
- Also added minor cleanup of BaseUtils
These patches to GATKBAMIndex are causing massive BAM index reading errors in
combination with the latest version of Picard. The bug is either in the patches
themselves or in the underlying SeekableBufferedStream class they rely on. Until
the cause can be identified, we are temporarily backing out these changes so that
we can continue to run with the latest Picard/Tribble.
This reverts commits:
81483ec21e528790dfa719d18cdee27d577ca98e
68cf0309db490b79eecdabb4034987ff825ffea8
54bb68f28ad5fe1b3df01702e9c5e108106a0176
This is a necessary prerequisite for the org.broadinstitute.variant migration.
-Picard and sam-jdk go from version 1.67.1197 to 1.84.1337
-Picard-private goes from version 2375 to 2662
-Tribble goes from version 119 to 1.84.1337
-RADICALLY trimmed down the list of classes we extract from Picard-private
(jar goes from 326993 bytes to 6445 bytes!)
Resources must be in a subdirectory called "resources" in the package
hierarchy to be picked up by the packaging system. Adding each resource
manually to the jars in build.xml does not cause the resource to be
added to the standalone GATK jar when we package the GATK, so it's best
to always use this convention.
If a read had an existing BAQ tag, was clipped by our engine, and couldn't have the BAQ recalculated (for whatever reason), then we would
fail in the BQSR because we would default to using the old tag (which no longer matched the length of the read bases).
The right thing to do here is to remove the old BAQ tag when RECALCULATE and ADD_TAG are the BAQ modes used but BAQ cannot be recalculated.
Added a unit test to ensure that the tags are removed in such a case.
-- Has the overall effect that the GATK user AWS keys are no longer visible in the gatk source as plain text. This will stop AWS from emailing me (they crawl the web looking for keys)
-- Added utility EncryptAWSKeys that takes as command line arguments the GATK user AWS access and secret keys, encrypts them with the GATK private key, and writes out the resulting file to resources in phonehome.
-- GATKRunReport now decrypts as needed these keys using the GATK public key as resources in the GATK bundle
-- Refactored the essential function of Resource (reading the resource) from IOUtils into the class itself. Now how to get the data in the resouce is straightforward
-- Refactored md5 calculation code from a byte[] into Utils. Added unit tests
-- Committing the encrypted AWS keys
-- #resolves https://jira.broadinstitute.org/browse/GSA-730
-- Moved previously inner class to MRUCachingSAMSequenceDictionary, and unit test to 100% coverage
-- Fully document all functions in GenomeLocParser
-- Unit tests for things like parsePosition (shocking it wasn't tested!)
-- Removed function to specifically create GenomeLocs for VariantContexts. The fact that you must incorporate END attributes in the context means that createGenomeLoc(Feature) works correctly
-- Depreciated (and moved functionality) of setStart, setStop, and incPos to GenomeLoc
-- Unit test coverage at like 80%, moving to 100% with next commit
-- The new version is roughly 2x faster than the previous version. The key here was to cleanup the workflow for validateGenomeLoc and remove the now unnecessary synchronization blocks from the CachingSequencingDictionary, since these are now thread local variables
-- #resolves https://jira.broadinstitute.org/browse/GSA-724
-- All functions tested. In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads. New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans. Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region. The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed. Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit
Conflicts:
public/java/test/org/broadinstitute/variant/VariantBaseTest.java
-Moved some of the more specialized / complex VariantContext and VCF utility
methods back to the GATK.
-Due to this re-shuffling, was able to return things like the Pair class back
to the GATK as well.
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
With LegacyLocusIteratorByState deleted, the legacy downsampling implementation
was already non-functional. This commit removes all remaining code in the
engine belonging to the legacy implementation.
-- New algorithm will only try to create an active region if there's at least maxREgionSize + propagation distance states in the list. When that's true, we are guaranteed to actually find a region. So this algorithm is not only truly correct but as super fast, as we only ever do the search for the end of the region when we will certainly find one, and actually generate a region.
-- Helped ID more bugs in the ActivityProfile, necessitating a new algorithm for popping off active regions. This new algorithm requires that at least maxRegionSize + prob. propagation distance states have been examined. This ensures that the incremental results are the same as you get reading in an entire profile and running getRegions on the full profile
-- TODO is to remove incremental search start algorithm, as this is no longer necessary, and nicely eliminates a state variable I was always uncomfortable with
-- Now records the position of the current locus, as well as that of the last read. Necessary when passing through regions with no reads. The previous version would keep accumulating empty active regions, and never discharge them until end of traversal (if there was no reads in the future) or until a read was finally found
-- Protected a call to logger.debug with if ( logger.isDebugEnabled()) to avoid a lot of overhead in writing unseen debugger logging information
-- GATKSAMRecords now cache the result of the getAdapterBoundary, allowing us to avoid repeating a lot of work in LIBS
-- Added unittests to cover adapter clipping
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events. The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
-- Previously we allowed band pass filter size to be specified along with the sigma. But now that sigma is controllable from walkers and from the command line, we instead compute the filter size given the kernel from the sigma, including all kernel points with p > 1e-5 in the kernel. This means that if you use a smaller kernel you get a small band size and therefore faster ART
-- Update, as discussed with Ryan, the sigma and band size to 17 bp for HC (default ART wide) and max band size of 50 bp
-- Based on the new incremental activity profile
-- Unit Tested! Fixed a few bugs with the old band pass filter
-- Expand IncrementalActivityProfileUnitTest to test the band pass filter as well for basic properties
-- Add new UnitTest for BandPassIncrementalActivityProfile
-- Added normalizeFromRealSpace to MathUtils
-- Cleanup unused code in new activity profiles
-- The incremental version now processes active regions as soon as they are ready to be processed, instead of waiting until the end of the shard as in the previous version. This means that ART walkers will now take much less memory than previously. On chr20 of NA12878 the majority of regions are processed with as few as 500 reads in memory. Over the whole chr20 only 5K reads were ever held in ART at one time.
-- Fixed bug in the way active regions worked with shard boundaries. The new implementation no longer see shard boundaries in any meaningful way, and that uncovered a problem that active regions were always being closed across shard boundaries. This behavior was actually encoded in the unit tests, so those needed to be updated as well.
-- Changed the way that preset regions work in ART. The new contract ensures that you get exactly the regions you requested. the isActive function is still called, but its result has no impact on the regions. With this functionality is should be possible to use the HC as a generic assembly by forcing it to operate over very large regions
-- Added a few misc. useful functions to IncrementalActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality. Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
I've resigned myself instead to create a mapping from Allele to Haplotype. It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
-- Allows us to make a stream of reads or an index BAM file with read having the following properties (coming from n samples, of fixed read length and aligned to the genome with M operator, having N reads per alignment start, skipping N bases between each alignment start, starting at a given alignment start)
-- This stream can be handed back to the caller immediately, or written to an indexed BAM file
-- Update LocusIteratorByStateUnitTest to use this functionality (which was refactored from LIBS unit tests and ArtificialSAMUtils)
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0? When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).
As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases. This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests. Fortunately that walker is being deprecated soon...
This way, we don't need to create a new Allele for every read/Haplotype pair to be placed in the PerReadAlleleLikelihoodMap (very inefficient). Also, now we can easily get the Haplotype associated with the best allele for a given read.
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs. Until the HC uses meta data though, this won't work.
-- Add an option to not allocate always ArrayLists of targetSampleSize, but rather the previous size + MARGIN. This helps for LIBS as most of the time we don't need nearly so much space as we allow
-- consumeFinalizedItems returns an empty list if the reservior is empty, which it often true for our BAM files with low coverage
-- Allow empty sample lists for SamplePartitioner as these are used by the RefTraversals and other non-read based traversals
Make the reservoir downsampler use a linked list, rather than a fixed sized array list, in the expectFewOverflows case
-- Instead of storing a list of list of alignment starts, which is expensive to manipulate, we instead store a linear list of alignment starts. Not grouped as previously. This enables us to simplify iteration and update operations, making them much faster
-- Critically, the downsampler still requires this list of list. We convert back and forth between these two representations as required, which is very rarely for normal data sets (WGS NA12878 on chr20 is 0.2%, 4x WGS is even less).
-- No longer update the total counts in each per-sample state manager, but instead return delta counts that are updated by the overall ReadStateManager
-- One step on the way to improving the underlying representation of the data in PerSampleReadStateManager
-- Make LocusIteratorByState final
-- Use a linked hash map instead of a hash map since we want to iterate through the map fairly often
-- Ensure that we call doneSubmittingReads before getting reads for samples. This function call fell out before and since it wasn't enforced I only noticed the problem while writing comments
-- Don't make unnecessary calls to contains for map. Just use get() and check that the result is null
-- Use a LinkedList in PassThroughDownsampler, since this is faster for add() than the existing ArrayList, and we were's using random access to any resulting
-- Made LIBSPerformance a full featured CommandLineProgram, and it can be used to assess the LIBS performance by reading a provided BAM
-- ReadStateManager now provides a clean interface to iterate in sample order the per-sample read states, allowing us to avoid many map.get calls
-- Moved updateReadStates to ReadStateManager
-- Removed the unnecessary wrapping of an iterator in ReadStateManager
-- readStatesBySample is now a LinkedHashMap so that iteration occurs in LIBS sample order, allowing us to avoid many unnecessary calls to map.get iterating over samples. Now those are just map native iterations
-- Restructured collectPendingReads for simplicity, removing redundant and consolidating common range checks. The new piece is code is much clearer and avoids several unnecessary function calls
-- Only ReadBackedPileupImpl (concrete class) and ReadBackedPileup (interface) live, moved all functionality of AbstractReadBackedPileup into the impl
-- ReadBackedPileupImpl was literally a shell class after we removed extended events. A few bits of code cleanup and we reduced a bunch of class complexity in the gatk
-- ReadBackedPileups no longer accept pre-cached values (size, nMapQ reads, etc) but now lazy load these values as needed
-- Created optimized calculation routines to iterator over all of the reads in the pileup in whatever order is most efficient as well.
-- New LIBS no longer calculates size, n mapq, and n deletion reads while making pileups.
-- Added commons-collections for IteratorChain
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
-- Optimizations to AlignmentStateMachine
-- Properly count deletions. Added unit test for counting routines
-- AlignmentStateMachine.java is no longer recursive
-- Traversals now use new LIBS, not the old one
-- AlignmentStateMachine does what SAMRecordAlignmentState should really do. It's correct in that it's more accurate than the LIB_position tests themselves. This is a non-broken, correct implementation. Needs cleanup, contracts, etc.
-- This version is like 6x slower than the original implementation (according to the google caliper benchmark here). Obvious optimizations for future commit