-- The previously expanded ones are actually the missing values in the range. The previous ranges were correct. Removed the TODO to confirm them, as they are now officially confirmed
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly
-- Added Write method to BCF2 types that directly converts int value to byte stream. Deleted writeRawBytes(int)
-- encodeTypeDescriptor semi-inlined into encodeType so that the tests for overflow are done in just one place
-- Faster implementation of determineIntegerType for int[] values
-- BCF2Type enum has an overloaded method to read the type as an int from an input stream. This gets rid of a case statement and replaces it with just minimum tiny methods that should be better optimized. As side effect of this optimization is an overall cleaner code organization
-- All low-level reads throw IOException instead of catching it directly. This allows us to not try/catch in readByte, improving performance by 5% or so
-- Optimize encodeTypeDescriptor with final variables. Avoid using Math.min instead do inline comparison
-- Inlined willOverflow directly in its single use
-- Old version converted doubles directly from strings. New version uses VariantContext getAttributeAsDouble() that looks at the values directly to determine how to convert from Object to Double (via Double.valueOf, (Double), or (Double)(Integer)).
-- getAttributeAsDouble() is now smart in converting integers to doubles as needed
-- Removed unnecessary logging info in BCF2Codec
-- Added integration tests to ensure that VQSR works end-to-end with BCF2 using sites version of the file khalid sent to me
-- Added vqsr.bcf_test.snps.unfiltered.bcf file for this integration test
-- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold
-- Penalty is now a phred-scaled p-value not the raw chi2 value
-- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning
-- Basically I was treating the context history in the wrong direction, effectively predicting the further bases in the context based on the closer one. Totally backward. Updated the code to build the tree in the right direction.
-- Added a few more useful outputs for analysis (minPenalty and maxPenalty)
-- Misc. cleanup of the code
-- Overall I'm not 100% certain this is even the right way to think about the problem. Clearly this is producing a reasonable output but the sum of chi2 values over the entire tree is just enormous. Perhaps a MCMC convergence / sampling criterion would be a better way to think about this problem?
-- Better output file name defaults
-- Fixed nasty bug where I included non-existant quals in the contexts to process because they showed up in the Cycle covariate
-- Data is processed in qual order now, so it's easier to see progress
-- Logger messages explaining where we are in the process
-- When in UPDATE mode we still write out the information for an equivalent prune by depth for post analysis
-- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees
-- Docs, algorithm descriptions, etc so that it makes sense what's going on
-- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR
-- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)
-- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results
-- Added new class BCFVersion that consolidates all of the version management of BCF
-- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs
-- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles
-- Update a few pieces of code to get previous correct behavior
-- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct)
-- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default
-- Uses chi2 test for independences to determine if subcontext is worth representing. Give excellent visual results
-- Writes out analysis output file producing excellent results in R
-- Trivial reformatting of MathUtils
-- Reorganize functions in RecalDatum so that error rate can be computed indepentently. Added unit tests. Removed equals() method, which is a buggy without it's associated implementation for hashcode
-- New class RecalDatumTree based on QualIntervals that inherits from RecalDatum but includes the concept of sub data
-- VisualizeContextTree now uses RecalDatumTree and can trivially compute the penalty function for merging nodes, which it displays in the graph
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration. It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers. As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces. This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
-- Moved Datum, the now unnecessary superclass, into RecalDatum
-- Fixed some obviously dangerous synchronization errors in RecalDatum, though these may not have caused problems because they may not have been called in parallel mode
-- Check if a traversal error occurred in the last shard
-- Catch ExecutionException from the TreeReducer and throw as our HMS execption
-- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself
-- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)
-- Better error message when a traveral error occurs (a real bug)
-- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times
-- A bit of cleanup in WalkerTest
We will use DocumentedGATKFeatures to create categories in our documentation. Eric I guess will be in charge of this. We need to remove walkers and think how to categorize everything.
Tools can be hidden from GATKdocs with the @Hidden annotation
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
-- VariantFiltration now properly sets passFilters in VC
-- BCF2 writer now properly decodes lazy BCF genotype data that it uses. Improper use generated a horrible subtle bug but the good news is that the extra checks I put in (unnecessarily a few days ago) caught the bug!
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
-- Always decode genotypes block when writing out a BCF file. If the header changes (and we currently don't know this easily) then the dictionary keys used in the genotypes block may be invalid. Temporarily added a private static boolean that turns off writing of the blocks until Eric and his team rewrite the header.
Signed-off-by: Mark DePristo <depristo@broadinstitute.org>
GATKDocs looks for a key on gsa4, and updates the forum with new walker if it exists.
More changes were made to the GATKDocs. Works nicely with bootstrap on and offline.
Cleaned up the code as well
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
Parameter wasn't working outside of the BQSR walker. It now takes the information on the recalibration report in other tools (PrintReads for example) and treats all reads as coming from the defined default platform.
* Did not touch archived walkers... those can be named whatever.
* Kept abstract classes that end in Walker untouched (e.g. LocusWalker, ReadWalker, ...)
* Renamed a few inner classes due to conflict when stripping off Walker from their outer classes: ContigStats, FlagStats and FastaStats.
-- Heng wants to use 0x0? to represent any missing type value, which in our implementation was invalid. Updated our codebase to support this construct. Heng said he'll update the BCF2 quick reference.
-- Enabled integration test reading Heng's ex2.bcf file
-- GATK now only warns in the case where the END info field isn't the same (or +1 due to padding) as the getEnd() function as determined by the GATK. Turns out there's a single record in the 1000G SV call set that doesn't have the right length
-- VariantContextTestProvider now tests that X = Y where X -> writing -> reading -> writing -> reading = Y for a variety of variant context inputs X
-- Added integration test reading 1000G SV chr1 calls (from Chris)
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status. This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF. The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- TODO: actually run integration tests when I have an internet connection
-- If eval has genotypes and comp has genotypes, then subset the genotypes of comp down to the samples being evaluated when considering TP, FP, FN, TN status. This is important in the case where you want to use this to assess, for example, the quality of calls on NA12878 but you have a CEU trio comp VCF. The previous version was counting sites polymorphic in mom against the calls in NA12878.
-- Added testdata VCF and integrationtests to ensure this behavior continues in the future
-- This actually proved to be a problem with Ion Torrent data where the base quality can be quite low, and so we need to include Q15 bases for calling effectively.
Move the RunReport S3 upload process onto a separate thread with a timeout allowing the parent to continue.
Signed-off-by: Khalid Shakir <kshakir@broadinstitute.org>
* BinaryTag covariate is Experimental, not Standard (this was breaking integration tests)
* New parameter in the Recalibration report requires new MD5 for one of the integration tests.
-- getMetaData now split into getMetaDataInSortedOrder() [old functionality] and getMetaDataInOriginalOrder() [according to the header order]. Important as BCF uses the order of elements in the header in the offsets to keys, and we were automatically sorting the BCF2 header which is out of order in samtools and the whole system was going crazy
-- Updating GATK code to use the appropriate header function (this is why so many files have changed)
-- BCF2 code was busted in not differentiating PASS from . from FILTER in VC (tests coming that will actually stress this)
-- Bugfix for adding contig lines to BCF2 header dictionary
-- VCFHeader metaData no longer sorted internally. The system now maintains the data in header order, and only sorts output as requested in API
-- VCFWriter and BCF2Writer now explictly sort their header lines
-- Don't allow filters to be added that are PASS in the contract
When hard-clipping predict when the read is going to be fully hard clipped to the point where only soft/hard-clips are left in the read and preemptively eliminate the read before the SAMRecord mathematics on malformed cigars kills the GATK.
-- GenotypeBuilder now sorts the list of filter strings so that the output is in a consistent order
-- calculateChromosomeCounts removes the AC/AF fields entirely when there are no alt alleles, to be on VCF spec for A defined info field values
-- Fixed bug in VariantDataManager that this validation mode was intended to detect going forward
-- Still no VariantRecalibrationWalkersIntegrationTest for indels with BCF2 but that's because LowQual is missing from test VCF
-- Bugfix for VCFDiffableReader: don't add null filters to object
-- BCF2Codec uses new VCFAlleleClipper to handle clipping / unclipping of alleles
-- AbstractVCFCodec: decodeLoc uses full decode() [still doesn't decode genotypes] to avoid dangerous code duplication. Refactored code that clipped alleles and determined end position into updateBuilderAllelesAndStop method that uses new VCFAlleleClipper. Fixed bug by ensuring the VCF codec always uses the END field in the INFO when it's provided, not just in the case where the there's a biallelic symbolic allele
-- Brand new home for allele clipping / padding routines in VCFAlleleClipper. Actually documented this code, which results in lots of **** negative comments on the code quality. Eric has promised that he and Ami are going to rethink this code from scratch. Fixed many nasty bugs in here, cleaning up unnecessary branches, etc. Added UnitTests in VCFAlleleClipper that actually test the code full. In the process of testing I discovered lots of edge cases that don't work, and I've commented out failing tests or manually skipped them, noting how this tests need to be fixed. Even introduced some minor optimizations
-- VariantContext: validateAllele was broken in the case where there were mixed symbolic and concrete alleles, failing validation for no reason. Fixed.
-- Added computeEndFromAlleles() function to VariantContextUtils and VariantContextBuilder for convenience calculating where the VC really ends given alleles
--
-- refactored allele clipping / padding code into VCFAlleleClipping class, and added much needed docs and TODOs for methods dev guys
-- Added real unit tests for (some) clipping operations in VCFUtilsUnitTest
-- Previous version was reading the size of the encoded genotypes vector for each genotype. This only worked because I never wrote out genotype field values with > 15 elements. Mauricio's killer DiagnoseTargets VCF uncovered the bug. Unfortunately since symbolic allele clipping is still busted those tests are still diabled
-- GenotypeContext getMaxPloidy was returning -1 in the case where there are no genotypes, but the answer should be 0.
-- They now throw an error, as its really unsafe to write out ./. as a special case in the VCFWriter as occurred previously.
-- Added convenience method in VariantContextUtils.addMissingSamples(vc, allSamples) that returns a complete VC where samples are given ./. Genotype objects
-- This allows us to properly pass tests of creating / writing / reading VCFs and BCFs, which previously differed because the VC from the VCF would actually be different from its original VC
-- Updated UG, UGEngine, GenotypeAndValidateWalker, CombineVariants, and VariantsToVCF to manage the master list of samples they are writing out and addMissingSamples via the VCU function
-- Don't use DP for average interval depth but rather AVG_INTERVAL_DP, which is a float now, not an int
-- Don't add PASS filter value to genotypes, as this is actually considered failing filters in the GATK. Genotype filters should be empty for PASSing sites
Updated HSP to use new padding arguments instead of flank intervals file, plus latest QC evals.
IntervalUtils return unmodifiable lists so that utilities don't mutate the collections.
Added a JavaCommandLineFunction.javaGCThreads option to test reducing java's automatic GC thread allocation based on num cpus.
Added comma to list of characters to convert to underscores in GridEngine job names so that GE JSV doesn't choke on the -N values.
JobRunInfo handles the null done times when jobs crash with strange errors.
-- Previously VCF header lines of count type G assumed that the sample would be diploid.
-- Generalized the code to take a VariantContext and return the right result for G count types by calling into the correct numGenotypes in GenotypeLikelihoods class
-- renamed calcNumGenotypes to numGenotypes, which uses a static cache in the class
-- calcNumGenotypes is private, and is used to build the static cache or to compute on the fly for uncached No. allele / ploidy combinations
-- VariantContext calls into getMaxPloidy in GenotypesContext, which caches the max ploidy among samples
-- Added extensive unit tests that compare A and G type values in genotypes
-- allowMissingVCFHeaders is now part of -U argument. If you want specifically unsafe VCF processing you need -U LENIENT_VCF_PROCESSING. Updated lots of files to use this
-- LENIENT_VCF_PROCESSING disables on the fly VCF header cleanup. This is now implemented via a member variable, not a class variable, which I believe was changing the GATK behavior during integration tests, causing some files to fail that pass when run as a single test because the header reading behavior was changing depending on previous failures.
-- Just completely wrong.
-- BCF2 shadowBCF now checks that the shadow bcf can be written to avoid /dev/null.bcf problem
-- Added samtools ex2.bcf file for decoding to our integrationtests
* field attributesCanBeModified - a null attributes object can't be modified in its current state
* method makeAttributesModifiable() - initialize a null attributes object to empty
-- Added MLEAC and MLEAF format lines to PoolCallerWalker
-- VariantFiltrationWalker now throws an error when JEXL variables cannot be found (XXX < 0.5) but passes through (albeit with a disgusting warning) when a variable is found but its value is a bad type (AF < 0.5) where AF == [0.04,0.00] at multi-allelic variation
-- Allow values to pass assertEquals in VariantContextTestProvider when one file contains X=[null, null] and the other has X missing
-- Update to 2.1.1 from 2.0
-- VariantFiltrationWalker now allows you to run with type unsafe selects, which all default to false when matching. So "AF < 0.5" works even in the presence of multi-allelics now.
--
-- MLAC and MLAF in PoolCaller now use standard MLE_AC and MLE_AF
-- VCFDiffableReader disables onTheFly fixing of VCF header fields so comparisons are easier when headers are changing
-- Flag fields with FLAG_KEY=0 are parsed as though FLAG_KEY were entirely absent in AbstractVCFCodec to fix bug where FLAG_KEY=0 was being translated into FLAG_KEY in output VCF, making a false flag value a true one
-- Fix the GT field value in VariantContextTestProviders so it isn't fixed 1000s of times during testing
-- Keys whose value is null are put into the VariantContext info attributes now
-- Created public static UnifiedGenotyper.getHeaderInfo that loads UG standard header lines, and use this in tools like PoolCaller
-- Created VCFStandardHeaderLines class that keeps standard header lines in the GATK in a single place. Provides convenient methods to add these to a header, as well as functionality to repair standard lines in incoming VCF headers
-- VCF parsers now automatically repair standard VCF header lines when reading the header
-- Updating integration tests to reflect header changes
-- Created private and public testdata directories (public/testdata and private/testdata). Updated tests to use test
-- SelectHeaders now always updates the header to include the contig lines
-- SelectVariants add UG header lines when in regenotype mode
-- Renamed PHRED_GENOTYPE_LIKELIHOODS_KEY to GENOTYPE_PL_KEY
-- Bugfix in BCF2 to handle lists of null elements (can happen in genotype field values from VCFs)
-- Throw error when VCF has unbounded non-flag values that don't have = value bindings
-- By default we no longer allow writing of BCF2 files without contig lines in the header
-- Moved GENOTYPE_KEY vcf header line to VCFConstants. This general migration and cleanup is on Eric's plate now
-- Updated HC to initialize the annotation engine in an order that allows it to write a proper VCF header. Still doesn't work...
-- Updating integration test files. Moved many more files into public/testdata. Updated their headers to all work correctly with new strict VCF header checking.
-- Bugfix for TandemRepeatAnnotation that must be unbounded not A count type as it provides info for the REF as well as each alt
-- No longer add FALSE values to flag values in VCs in VariantAnnotatorEngine. DB = 0 is never seen in the output VCFs now
-- Fixed bug in VCFDiffableReader that didn't differeniate between "." and "PASS" VC filter status
-- Unconditionally add lowQual Filter to UG output VCF files as this is in some cases (EMIT_ALL_SITES) used when the previous check said it wouldn't be
-- VariantsToVCF now properly writes out the GT FORMAT field
-- BCF2 codec explodes when reading symbolic alleles as I literally cannot figure out how to use the allele clipping code. Eric said he and Ami will clean up this whole piece of instructure
-- Fixed bug in BCF2Codec that wasn't setting the phase field correctly. UnitTested now
-- PASS string now added at the end of the BCF2 dictionary after discussion with Heng
-- Fixed bug where I was writing out all field values as BigEndian. Now everything is LittleEndian.
-- VCFHeader detects the case where a count field has size < 0 (some of our files have count = -1) and throws a UserException
-- Cleaned up unused code
-- Fixed bug in BCF2 string encoder that wasn't handling the case of an empty list of strings for encoding
-- Fixed bug where all samples are no called in a VC, in which case we (like the VCFwriter) write out no called diploid genotypes for all samples
-- We always write the number of genotype samples into the BCF2 nSamples header. How we can have a variable number of samples per record isn't clear to me, as we don't have a map from missing samples to header names...
-- Removed old filtersWereAppliedToContext code in VCF as properly handle unfiltered, filtered, and PASS records internally
-- Fastpath function getDisplayBases() in allele that just gives you the raw bytes[] you'd see for an Allele
-- Genotype fields no longer differentiate between unfiltered, filtered, and PASS values. Genotype objects are all PASS implicitly, or explicitly filtered. We only write out the FT values if at least one sample is filtered. Removed interface functions and cleaned up code
-- Refactored padAllele code from createVariantContextWithPaddedAlleles into the function padAllele so that it actually works. In general, **** NEVER COPY CODE **** if you need to share funcitonality make a function, that's why there were invented!
-- Increased the default number of records to read for DiffObjects to 1M
-- The GATK VCFWriter now enforces by default that all INFO, FILTER, and FORMAT fields be properly defined in the header. This helps avoid some of the low-level errors I saw in SelectVariants. This behavior can be disable in the engine with the --allowMissingVCFHeaders argument
-- Fixed broken annotations in TandemRepeat, which were overwriting AD instead of defining RPA
-- Optimizations to VariantEval, removing some obvious low-hanging fruit all in the subsetting of variants by sample
-- SelectVariants header fixes -- Was defining DP for the info field as a FORMAT field, as for AC, AF, and AN original
-- Performance optimizations in BCF2 codec and writer
-- using arrays not lists for intermediate data structures
-- Create once and reuse an array of GenotypeBuilders for the codec, avoiding reallocating this data structure over and over
-- VCFHeader (which needs a complete rewrite, FYI Eric)
-- Warn and fix on the way flag values with counts > 0
-- GenotypeSampleNames are now stored as a List as they are ordered, and the set iteration was slow. Duplicates are detected once at header creation.
-- Explicitly track FILTER fields for efficient lookup in their own hashmap
-- Automatically add PL field when we see a GL field and no PL field
-- Added get and has methods for INFO, FILTER, and FORMAT fields
-- No longer add AC and AF values to the INFO field when there's no ALT allele
-- Memory efficient comparison of VCF and BCF files for shadow BCF testing. Now there's no (memory) constraint on the size of the files we can compare
-- Because of VCF's limited floating point resolution we can only use 1 sig digit for comparing doubles between BCF and VCF
* Sites with more soft clipped bases than regular will force-trigger a variant region
* No more unclipping/reclipping, RR machinery now handles soft clips natively.
* implemented support for base insertion and base deletion quality scores in synthetic and regular reads.
* GATKSAMRecord clone() now creates a fresh object for temporary attributes if one is present.
note: SAMRecords create a shallow copy of the tempAttribute object which was causing multiple reads (that came from the same read) to have their temporary attributes modified by one another inside reduce reads. Beware, if you're not using GATKSAMRecord!
-- Inline encodeString that doesn't go via List<Byte> intermediate
-- Inline encodeString that uses byte[] directly so that we can go from Allele.getBytes() => BCF2
-- Fast paths for Atomic Float and Atomic Integer values avoiding intermediate list creation
-- Final UG integration test update
-- encodeTyped in BCF2Encoder now with specialized versions for int, float, and string, avoiding unnecessary intermediate list creation and dynamic type checking. encodeTypedMissing also includes inline operations now instead of using Collections.emptyList() version. Lots of contracts. User code updated to use specialized versions where possible
-- Misc code refactoring
-- Updated VCF float formating to always include 3 sig digits for values < 1, and 2 for > 1. Updating MD5s accordingly
-- Expanded testing of BCF2Decoder to really use all of the encodeTyped* operations
-- Cleanup a few contracts
-- BCF2FieldManager uses new VCFHeader accessors for specific info and format fields
-- A few simple optimizations
-- VCF header samples stored in String[] in the writer for fast access
-- getCalledChrCount() uses emptySet instead of allocating over and over empty hashset
-- VariantContextWriterStorage now creates a 1MB buffered output writer, which results in 3x performance boost when writing BCF2 files
-- A few editorial comments in VCFHeader
-- Final merge conflicts resolved
-- BCF2Writer now supports case where a sample is present in the header but the sample isn't in the VC, in which case we create an empty sample and encode that
-- Replaced getAttributes with getDP() and not the old style getAttribute, where appropriate
-- Added getAnyAttribute and hasAnyAttribute that actually does the expensive work of seeing if the key is something like GT, AD or another inline datum, and returns it. Very expensive but convenient.
-- Fixed nasty subsetting bug in SelectVariants with excluding samples
-- Generalized VariantsToTable to work with new inline attributes (using getAnyAttribute) as well as GT
-- Bugfix for dropping old style GL field values
-- Added test to VCFWriter to ensure that we have the sample number of samples in the VC as in the header
-- Bugfix for Allele.getBaseString to properly show NO_CALL alleles
-- getGenotypeString in Genotype returns "NA" instead of null for ploidy == 0 genotypes
-- Cleanup some (but not all) VCF3 files. Turns out there are lots so...
-- Refactored gneotype parser from VCFCodec and VCF3Codec into a single shared version in AbstractVCFCodec. Now VCF3 properly handles the new GenotypeBuilder interface
-- Misc. bugfixes in GenotypeBuilder
-- Now only includes leaf nodes in the summary, i.e., summaries of the form "*.*....*.X", which are really the most valuable to see. This calculation can be accomplished in linear time for N differences, rather than the previous O(n^2) algorithm
-- Now computes the max number of elements to read correctly. Counts now the size of the entire element tree, not just the count of the roots, which was painful because the trees vary by orders of magnitude in size.
-- Because of this we can enforce a meaningful, useful value for the max elements in MD5 or 100K, and this works well.
-- Added integration test for new leaf and old pairwise calculations
-- Bugfix for Utils.join(sep, int[]) that was eating the first element of the AD, PL fields
-- BCFFieldEncoder and writers divide up the task of formatting values (atomic or vector, ints, strings, floats, etc) from the task of writing these out at the sites or genotypes level.
-- Allows us to create efficient encoders for specific combinations of header fields, such as int[] encoded values with exactly 3 values
-- Currently only used for INFO fields, but subsequent commit will include optimized genotype field encoder
-- Allowed us to naturally support encoding of lists of strings
-- Bugfixes in VariantContextUtils introduced in genotype -> genotypebuilder conversion
-- Fixes for integration test failures
-- Enabling contig updates
-- WalkerTest now prints out relative paths where possible to make cut/paste/run easier
-- As values in VCs are becoming their native Java types the VCFWriter needs to own proper float formating.
-- Created a smart float formatter in VCFWriter, with unit tests
-- Removed makePrecisionFormatStringFromDenominatorValue and its uses
-- Fix broken contracted
-- Refactored some code from the encoder to utils in BCF2
-- HaplotypeCaller's GenotypingEngine was using old version of subset to context. Replaced with a faster call that I think is correct. Ryan, please confirm.
-- FastGenotypes are the default in the engine. Use --useSlowGenotypes engine argument to return to old representation
-- Cleanup of BCF2Codec. Good error handling. Added contracts and docs.
-- Added a few more contacts and docs to BCF2Decoder
-- Optimized encodePrimitive in BCF2Encoder
-- Removed genotype filter field exceptions
-- Docs and cleanup of BCF2GenotypeFieldDecoders
-- Deleted unused BCF2TestWalker
-- Docs and cleanup of BCF2Types
-- Faster version of decodeInts in VCFCodec
-- BCF2Writer
-- Support for writing a sites only file
-- Lots of TODOs for future optimizations
-- Removed lack of filter field support
-- No longer uses the alleleMap from VCFWriter, which was a Allele -> String, now uses Allele -> Integer which is faster and more natural
-- Lots of docs and contracts
-- Docs for GenotypeBuilder. More filter creation routines (unfiltered, for example)
-- More extensive tests in VariantContextTestProfiler, including variable length strings in genotypes and genotype filters. Better genotype comparisons
-- decodeIntArray in BCF2 decoder allows us to more efficiently read ints and int[] from stream directly into Genotype object
-- Code cleanup / contracts added were appropriate
-- V2 will have a yet more optimized path...
-- Eliminated the large intermediate map from field name to list of list<Integer> values needed to create genotypes without the GenotypeBuilder. The new code is cleaner and simply fills in an array of GenotypeBuilders as it moves through the column layout in BCF2
-- Now we create once decoders specialized for each GT field (GT, AD, etc) that can be optimized for putting data into the GenotypeBuilder. In a subsequent commit these will actually use lower level BCF2 decoders to create the low-level ints and int[], avoiding the intermediate List<Integer> form
-- Reduced the amount of data further to be computed in the DiffEngine. The DiffEngine algorithm needs to be rethought to be efficient...
-- Builder now provides a depreciated log10pError function to make a new GQ value
-- Genotype is an abstract class, with most of the associated functions implemented here and not in the derived Fast and Slow versions
-- Lots of contracts
-- Bugfixes throughout
-- The way I was handling the contig offset ordering wasn't correct. Now the contigs are always indexed in the order in which their corresponding populate() functions are called, so that the order of the contigs is given by the order in which they are in the file, or in our refDict. It has nothing to do with the contig index itself.
-- SelectVariants no longers prints all samples to the screen if you aren't selecting any explicitly
-- Created a new Genotype interface with a more limited set of operations
-- Old genotype object is now SlowGenotype. New genotype object is FastGenotype. They can be used interchangable
-- There's no way to create Genotypes directly any longer. You have to use GenotypeBuilder just like VariantContextBuilder
-- Modified lots and lots of code to use GenotypeBuilder
-- Added a temporary hidden argument to engine to use FastGenotype by default. Current default is SlowGenotype
-- Lots of bug fixes to BCF2 codec and encoder.
-- Feature additions
-- Now properly handles BCF2 -> BCF2 without decoding or encoding from scratch the BCF2 genotype bytes
-- Cleaned up semantics of subContextFromSamples. There's one function that either rederives or not the alleles from the subsetted genotypes
-- MASSIVE BUGFIX in SelectVariants. The code has been decoding genotypes always, even if you were not subsetting down samples. Fixed!
-- Created new clean FastGenotype and GenotypeBuilder classes with contracts to enforce expected behavior and correctness. Tested utility of this approach by rewritting -- and then commenting out -- a path in BCF2Codec that could use this new code. Much cleaner interface now, but not yet hooked up to anything
-- Disabled SHADOW_BCF generation and generating contigs in the output VCFs automatically to ensure that the current code bases integration tests, before switching the code to new Genotype class
-- Code cleanup. Moved "AD" to VCFConstants under GENOTYPE_ALLELIC_DEPTHS. Uses in code replaced with constant
-- Refactored BCF2Codec into a LazyGenotypesDecoder object that provides on-demand genotype decoding of BCF2 data blocks a la VCFCodec.
-- VCFHeader has getters for sampleNamesInOrder and sampleNameToOffset instead of protected variables directly accessed by vcfcodec
This is in response to a request from Mauricio to make it easier
to use the downsamplers with GATKSAMRecords (as opposed to SAMRecords)
without having to do any cumbersome typecasting. Sadly, Java
language limitations make this sort of solution the best choice.
Thanks to Khalid for his feedback on this issue.
Also:
-added a unit test to verify GATKSAMRecord support with no typecasting required
-added some unit tests for the FractionalDownsampler that Mauricio will/might be using
-moved classes from private to public to better sync up with my local development
branch for engine integration
Moved some stuff in the DiagnoseTargets walker to the more general ThresHolder class
Minor tweaks
FindCoveredIntervals supports Gathering
FindCoveredIntervals outputs an interval list instead of GATKReport
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
* Re-wrote the sliding window approach to allow the variant region not to clip the reads that overlap it.
* Updated consensus to include only reads that were not passed on by the variant region, header counts are updated on the fly to avoid recompute
* Added soft clipped bases to ReduceReads analysis by unclipping high quality soft-clips then re-clipping after reduce reads
* Updated all integration tests