- Fix for M_Trieb's error report on the forum, and addition of integration tests to cover the walker.
- Addition of StructuralIndel as a class of variation within the VariantContext. These are for variants with a full alt allele that's >150bp in length.
- Adaptation of the MVLikelihoodRatio to work for a set of trios (takes the max over the trios of the MVLR)
- InsertSizeDistribution changed to use the new gatk report output (it was previously broken)
- RetrogeneDiscovery changed to be compatible with the new gatk report
- A maxIndelSize argument added to SelectVariants
- ByTranscriptEvaluator rewritten for cleanliness
- VariantRecalibrator modified to not exclude structural indels from recalibration if the mode is INDEL
- Documentation added to DepthOfCoverageIntegrationTest (no, don't yell at chartl ;_; )
Also sorry for the long commit history behind this that is the result of fixing merge conflicts. Because this *also* fixes a conflict (from git stash apply), for some reason I can't rebase all of them away. I'm pretty sure some of the commit notes say "this note isn't important because I'm going to rebase it anyway".
-- When merging multiple VCF records at a site, the combined VCF record has the QUAL of the first VCF record with a non-MISSING QUAL value. The previous behavior was to take the max QUAL, which resulted in sometime strange downstream confusion.
-- Removed REFERENCE_BASES option. You only have REFERENCE now. There's no efficiency savings for the REFERENCE_BASES option any longer, since the reference bases are loaded lazy so if you don't use them there's effectively no cost to making the RefContext that could load them.
Major idea is that per-read haplotype likelihoods are now stored in a single unified object of class PerReadAlleleLikelihoodMap. Class implementation in theory hides internal storage details from outside work (still may need work cleaning up interface), and this object(or rather, a Map from Sample->perReadAlleleLikelihoodMap) is produced by UGCalcLikelihoods. The genotype calculation is also able to potentially use this info if needed. All InfoFieldAnnotations now get an extra argument with this map. Currently, this map is only produced for indels in UG, or for all variants within HaplotypeCaller. If this map is absent (SNPs in UG), the old Pileup interface is used, but it's avoided whenever possible. FORMAT annotations are not yet changed but will be focus of second step. Major benefit will be that annotations will be able to very easily discard non-informative reads for certain events. HaplotypeCaller also uses this new class, and no longer hard-codes the mapping of allele ->list(reads) but instead uses the same objects and interfaces as the rest of the modules. Code still needs further testing/cleaning/reviewing/debugging
-- BCF2 now determines whether it can safely write out raw genotype blocks, which is true in the case where the VCF header of the input is a complete, ordered subset of the output header. Added utilities to determine this and extensive unit tests (headerLinesAreOrderedConsistently)
-- Cleanup collapseStringList and exploreStringList for new unit tests of BCF2Utils. Fixed bug in edge case that never occurred in practice
-- VCFContigHeaderLine now provides its own key (VCFHeader.CONTIG_KEY) directly instead of requiring the user to provide it (and hoping its right)
-- More ways to access the data in VCFHeader
-- BCF2Writer uses a cache to avoid recomputing unnecessarily whether raw genotype blocks can be emitted directly into the output
-- Optimization of fullyDecodeAttributes -- attributes.size() is expensive and unnecessary. We just guess that on average we need ~10 elements for the attribute map
-- CombineVariants optimization -- filters are online HashSet but are sorted at the end by creating a TreeSet
-- makeCombinations is now makePermutations, and you can request to create the permutations with or without replacement
-- CombineVariants is now TreeReducible!
-- Integration tests running in parallel all pass except one (will fix) due to incorrect use of db=0 flag on input from old VCF format
-- Includes header page
-- Table of arguments (Arguments)
-- Summary of counts (RecalData0)
-- Summary of counts by qual (RecalData1)
-- Fixed bug in output that resulted in covariates list always being null (updated md5s accordingly)
-- BQSR.R loads all relevant libaries now, include gplots, grid, and gsalib to run correctly