Problem:
matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.
Reports:
https://www.pivotaltracker.com/s/projects/793457/stories/62471780https://www.pivotaltracker.com/s/projects/793457/stories/61082450
Changes:
The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.
MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.
Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
2. Wrapped _mm_empty() with ifdef SIMD_TYPE_SSE
3. OpenMP disabled
4. Added code for initializing PairHMM's data inside initializePairHMM -
not used yet
SSE compilation warning.
2. Added code to dynamically select between AVX, SSE4.2 and normal C++ (in
that order)
3. Created multiple files to compile with different compilation flags:
avx_function_prototypes.cc is compiled with -xAVX while
sse_function_instantiations.cc is compiled with -xSSE4.2 flag.
4. Added jniClose() and support in Java (HaplotypeCaller,
PairHMMLikelihoodCalculationEngine) to call this function at the end of
the program.
5. Removed debug code, kept assertions and profiling in C++
6. Disabled OpenMP for now.
In unifying the arguments it was clear that the values were inconsistent throughout the code, so now there's a
single value that is intended to be more liberal in what it allows in (in an attempt to increase sensitivity).
Very little code actually changes here, but just about every md5 in the HC integration tests are different (as
expected). Added another integration test for the new argument.
To be used by David R to test his per-branch QC framework: does this commit make the HC look better against the KB?
1. Moved computeLikelihoods from PairHMM to native implementation
2. Disabled debug - debug code still left (hopefully, not part of
bytecode)
3. Added directory PairHMM_JNI in the root which holds the C++
library that contains the PairHMM AVX implementation. See
PairHMM_JNI/JNI_README first
It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now.
Since I thought that it might prove useful to others, I moved it to protected and added integration tests.
GERALDINE: NEW TOOL ALERT!
The code comments very clearly state that INFO fields shouldn't be propagated into the output,
but someone must have accidentally changed it afterwards. This is just a simple one-line fix
to make sure the code adhered to the comments.
Delivers #63333488.
-Added docs for ERC mode in HC
-Move RecalibrationPerformance walker since to private since it is experimental and unsupported
-Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
-Improved error msg for conflict between per-interval aggregation and -nt
-Minor clean up in exception docs
-Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
-Added more detailed info to GenotypeConcordance doc based on Chris forum post
-Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
-Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
-Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
It is true that indels of length > 1 have higher QUALS than those of length = 1. But for the HC those
QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger. So we no
longer normalize by indel length (which massively over-penalizes larger events and effectively drops their
QD to 0).
For the UG the previous normalization also wasn't perfect. Now we divide the indel length by a factor
of 3 to make sure that QD is consistent over the range of indel lengths.
Integration tests change because QD is different for indels.
Also, got permission from Valentin to archive a failing test that no longer applies.
Thanks to Kurt on the GATK forum for pointing this all out.
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs. Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.
For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line. In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path). For example:
foo.vcf
VCF,name=bar bar.vcf
Note that a VCF list file name must end with '.list'.
Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
-- New -a argument in the VQSR for specifying additional data to be used in the clustering
-- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.
For example, this tool can be used for processing bowtie RNA-seq data.
Each read with k N-cigar elemments is plit to k+1 reads. The split is done by hard clipping the bases rest of the bases.
In order to do it, few changes were introduced to some other clipping methods:
- make a segnificant change in ClippingOp.hardClip() that prevent the spliting of read with cigar: 1M2I1N1M3I.
- change getReadCoordinateForReferenceCoordinate in ReadUtil to recognize Ns
create unitTests for that walker:
- change ReadClipperTestUtils to be more general in order to use its code and avoid code duplication
- move some useful methods from ReadClipperTestUtils to CigarUtils
create integration test for that class
small change in a comment in FullProcessingPipeline
last commit:
Address review comments:
- move to protected under walkers/rnaseq
- change the read splitting methods to be more readable and more efficiant
- change (minor changes) some methods in ReadClipper to allow the changes in split reads
- add (minor change) one method to CigarUtils to allow the changes in split reads
- change ReadUtils.getReadCoordinateForReferenceCoordinate to include possible N in the cigar
- address the rest of the review comments (minor changes)
- fix ReadUtilsUnitTest.testReadWithNs acoording to the defult behaviour of getReadCoordinateForReferenceCoordinate (in case of refernce index that fall into deletion, return the read index of the base before the deletion).
- add another test to ReadUtilsUnitTest.testReadWithNs
- Allow the user to print the split positions (not working proparly currently)
This is a tool that we use internally validate the ReduceReads development. I think it should be
private. There is no need to improve docs.
[delivers #54703398]
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine
In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process. So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants. I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.
Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
Renamed it CalculateGenotypePosteriors.
Also, moved the utility code to a proper utility class instead of where Chris left it.
No actual code modifications made in this commit.
In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.
If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
Previously, we would strip out the PLs and AD values since they were no longer accurate. However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.
Now the PLs and AD get correctly selected down.
While I was in there I also refactored some related code in subsetDiploidAlleles(). There were no real changes there - I just
broke it out into smaller chunks as per our best practices.
Added unit tests and updated integration tests.
Addressed reviews.
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.
New HC Options (both Advanced and Hidden):
==========================================
--likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)
Specifies what engine should be used to generate read vs haplotype likelihoods.
PairHMM : standard full-PairHMM approach.
GraphBased : using the assembly graph to accelarate the process.
Random : generate random likelihoods - used for benchmarking purposes only.
--heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)
It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)
COMBO_MIN : use the smallest kmerSize with all haplotypes.
COMBO_MAX : use the larger kmerSize with all haplotypes.
MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.
Major code changes:
===================
* Introduce multiple likelihood calculation engines (before there was just one).
* Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.
* Added yet another PairHMM implementation with a different API in order to spport
local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).
Major components:
================
* FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations
* GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
of the graph-based likelihood approach.
* GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
to calcualte the likelihoods using the graph as an scafold.
* HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
used by GraphBasedLikelihoodCalculationEngineInstance to do its work.
* ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.
Remove mergeCommonChains from HaplotypeGraph creation
Fixed bamboo issues with HaplotypeGraphUnitTest
Fixed probrems with HaplotypeCallerIntegrationTest
Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest
Fixed ReadThreadingLikelihoodCalculationEngine issues
Moved event-block iteration outside GraphBased*EngineInstance
Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem
Added a bit more documentation to EventBlockSearchEngine
Fixing some private - protected dependency issues
Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments
Fixed FastLoglessPairHMM public -> protected dependency
Fixed probrem with HaplotypeGraph unit test
-- For very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build the Gaussian mixture model.
-- Annotations are ordered by the difference in means between known and novel instead of by their standard deviation.
-- Removed the training set quality score threshold.
-- Now uses 2 gaussians by default for the negative model.
-- Num bad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.
-- Model plots are now generated much faster.
-- Stricter threshold for determining model convergence.
-- All VQSR integration tests change because of these changes to the model.
-- Add test for downsampling of training data.
For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing
the min MQ filter.
Added unit test.
Delivers PT#61567540.
His code was excessively clipping reads because it was looking at their cigar string instead of just
the read length. This meant that it was basically impossible to call large deletions in UG even with
perfect evidence in the reads (as reported by Craig D).
Integration tests change because (IMO after looking at sites in IGV) reads with indels similar to the one
being genotyped used to be given too much likelihood and now give less.
Added unit tests for new methods.
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.
Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
while not converged:
* AC = [external AC] + [sample AC]
* Prior = DirichletMultinomial[AC]
* Posteriors = [sample GL + Prior]
* sample AC = MLEAC(Posteriors)
This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.
Fully unit tested.
Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
There was already a note in the code about how wrong the implementation was.
The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails.
Delivers PT #61069820.
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.
PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.
CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.
LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.
PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
Making the usage more clear since the parameter is being used over and over to define baited
regions. Updated the headers accordingly and made it more readable.
Quick fix the missing column header in the QualifyMissingIntervals
report.
Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
* add a length of the overlaping interval metric as per CSER request
* standardized the distance metrics to be positive when fully overlapping and the longest off-target tail (as a negative number) when not overlapping
* add gatkdocs to the tool (finally!)
* add a new column to do what I have been doing manually for every project, understand why we got no usable coverage in that interval
* add unit tests -- this tool is now public, we need tests.
* slightly better docs -- in an effort to produce better docs for this tool
most people don't care about excessive coverage (unless you're very
particular about your analysis). Therefore the best possible default
value for this is Integer.maxValue so it doesn't get in the way.
Itemized Changes:
* change maximumCoverage threshold to Integer.maxValue
[delivers #57353620]
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block
-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
PairHMMSyntheticBenchmark and PairHMMEmpirical benchmark were written to test the banded pairHMM, and were archived along with it. I returned them to the test directory for use in benchmarking the ArrayLoglessPairHMM. I commented out references to the banded pairHMM (which was left in archive), rather than removing those references entirely.
Renamed PairHMMEmpiricalBenchmark to PairHMMBandedEmpiricalBenchmark and returned it to the archive. It has a few problems for use as a general benchmark, including initializing the HMM too frequently and doing too much setup work in the 'time' method. However, since the size selection and debug printing are useful for testing the banded implementation, I decided to keep it as-is and archive it alongside with the other banded pairHMM classes. I did fix one bug that was causing the selectWorkingData function to return prematurely. As a result, the benchmark was only evaluating 4-40 pairHMM calls instead of the desired "maxRecords".
I wrote a new PairHMMEmpiricalBenchmark that simply works through a list of data, with setup work and hmm-initialization moved to its own function. This involved writing a new data read-in function in PairHMMTestData. The original was not maintaining the input data in order, the end result of which would be an over-estimate of how much caching we are able to do. The new benchmark class more closely mirrors real-world operation over large data.
It might be cleaner to fix some of the issues with the BandedEmpiricalBenchmark and use one read-in function. However, this would involve more extensive changes to:
PairHMMBandedEmpiricalBenchmark
PairHMMTestData
BandedLoglessPairHMMUnitTest
I decided against this as the banded benchmark and unit test are archived.
Returned Logless Caching implementation to the default in all cases. Changing to the array version will await performance benchmarking
Refactored many pieces of functionality in ArrayLoglessPairHMM into their own methods.
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.
Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.
This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.
We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)
Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
-This was a dependency of the test suite, but not the GATK proper,
which caused problems when running the test suite on the packaged
GATK jar at release time
-Use GATKVCFUtils.readVCF() instead
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
There is now a command-line option to set the model to use in the HC.
Incorporated Ryan's current (unmerged) branch in for most of these changes.
Because small changes to the math can have drastic effects, I decided not to let users tweak
the calculations themselves. Instead they can select either NONE, CONSERVATIVE (the default),
or AGGRESSIVE.
Note that any base insertion/deletion qualities from BQSR are still used.
Also, note that the repeat unit x repeat length approach gave very poor results against the KB,
so it is not included as an option here.
- Make -rod required
- Document that contaminationFile is currently not functional with HC
- Document liftover process more clearly
- Document VariantEval combinations of ST and VE that are incompatible
- Added a caveat about using MVLR from HC and UG.
- Added caveat about not using -mte with -nt
- Clarified masking options
- Fixed docs based on Erics comments
-- When provided, this argument causes us to only emit the selected samples into the VCF. No INFO field annotations (AC for example) or other features are modified. It's current primary use is for efficiently evaluating joint calling.
-- Add integration test for onlyEmitSamples
-- The previous approach in VQSR was to build a GMM with the same max. number of Gaussians for the positive and negative models. However, we usually have many more positive sites than negative, so we'd prefer to use a more detailed GMM for the positive model and a less well defined model using few sites for the negative model.
-- Now the maxGaussians argument only applies to the positive model
-- This update builds a GMM for the negative model with a default 4 max gaussians (though this can be controlled via command line parameter)
-- Removes the percentBadVariants argument. The only way to control how many variants are included in the negative model is with minNumBad
-- Reduced the minNumBad argument default to 1000 from 2500
-- Update MD5s for VQSR. md5s changed significantly due to underlying changes in the default GMM model. Only sites with NEGATIVE_TRAINING_LABELs and the resulting VQSLOD are different, as expected.
-- minNumBad is now numBad
-- Plot all negative training points as well, since this significantly changes our view of the GMM PDF
-- In the case where there's some variation to assembly and evaluate but the resulting haplotypes don't result in any called variants, the reference model would exception out with "java.lang.IllegalArgumentException: calledHaplotypes must contain the refHaplotype". Now we detect this case and emit the standard no variation output.
-- [delivers #54625060]
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.
Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.
Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
1. Removing old legacy code that was capping the positional depth for reduced reads to 127.
Unfortunately this cap affectively performs biased down-sampling and throws off e.g. FS numbers.
Added end to end unit test that depth counts in RR can be higher than max byte.
Some md5s change in the RR tests because depths are now (correctly) no longer capped at 127.
2. Down-sampling in ReduceReads was not safe as it could remove het compressed consensus reads.
Refactored it so that it can only remove non-consensus reads.
Now only filtered reads are unstranded. All consensus reads have strand, so that we
emit 2 consensus reads in general now: one for each strand.
This involved some refactoring of the sliding window which cleaned it up a lot.
Also included is a bug fix:
insertions downstream of a variant region weren't triggering a stop to the compression.
So, compromise solution is to go back to having biallelic PLs but emit a new FORMAT field, called APL, which has the 10 values, but all other statistics and regular PLs are computed as before.
Note that integration test had to be disabled, as the BCF2 codec apparently doesn't support writing into genotype fields other than PL,DP,AD,GQ,FT and GT.
Problem
-------
Qualify Missing Intervals only accepted GATK formatted interval files for it's coding sequence and bait parameters.
Solution
-------
There is no reason for such limitation, I erased all the code that did the parsing and used IntervalUtils to parse it (therefore, now it handles any type of interval file that the GATK can handle).
ps: Also added an average depth column to the output
- Added integration test to show that providing a contamination value and providing same value via a file results in the same VCF
- overrode default contamination value in test
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.
2. Updates to the KB assessment functionality:
a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.
3. Make the HC consistent in how it treats the pruning factor. As part of this I removed and archived
the DeBruijn assembler.
4. Improvements to the likelihoods for the HC
a. We now include a "tristate" correction in the PairHMM (just like we do with UG). Basically, we need
to divide e by 3 because the observed base could have come from any of the non-observed alleles.
b. We now correct overlapping read pairs. Note that the fragments are not merged (which we know is
dangerous). Rather, the overlapping bases are just down-weighted so that their quals are not more
than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
turned into Q0s for now.
c. We no longer run contamination removal by default in the UG or HC. The exome tends to have real
sites with off kilter allele balances and we occasionally lose them to contamination removal.
5. Improved the dangling tail merging implementation.
-- Assembly graph building now returns an object that describes whether the graph was successfully built and has variation, was succesfully built but didn't have variation, or truly failed in construction. Fixing an annoying bug where you'd prefectly assembly the sequence into the reference graph, but then return a null graph because of this, and you'd increase your kmer because it null was also used to indicate assembly failure
--
-- Output format looks like:
20 10026072 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026073 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119
20 10026074 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,121
20 10026075 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,119
20 10026076 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026077 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:3,0:3:9:0,9,120
20 10026078 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:5,0:5:15:0,15,217
20 10026079 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,240
20 10026080 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,268
20 10026081 . T <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,267
We use a symbolic allele to indicate that the site is hom-ref, and because we have an ALT allele we can provide AD and PL field values. Currently these are calculated as ref vs. any non-ref value (mismatch or insertion) but doesn't yet account properly for alignment uncertainty.
-- Can we enabled for single samples with --emitRefConfidence (-ERC).
-- This is accomplished by realigning the each read to its most likley haplotype, and then evaluting the resulting pileups over the active region interval. The realignment is done by the HaplotypeBAMWriter, which now has a generalized interface that lets us provide a ReadDestination object so we can capture the realigned reads
-- Provide access to the more raw LocusIteratorByState constructor so we can more easily make them programmatically without constructing lots of misc. GATK data structures. Moved the NO_DOWNSAMPLING constant from LIBSDownsamplingInfo to LocusIteratorByState so clients can use it without making LIBSDownsamplingInfo a public class.
-- Includes GVCF writer
-- Add 1 mb of WEx data to private/testdata
-- Integration tests for reference model output for WGS and WEx data
-- Emit GQ block information into VCF header for GVCF mode
-- OutputMode from StandardCallerArgumentCollection moved to UnifiedArgumentCollection as its no longer relevant for HC
-- Control max indel size for the reference confidence model from the command line. Increase default to 10
-- Don't use out_mode in HaplotypeCallerComplexAndSymbolicVariantsIntegrationTest
-- Unittests for ReferenceConfidenceModel
-- Unittests for new MathUtils functions
-- The previous code would adapter clip before reverting soft clips, so because we only clip the adapter when it's actually aligned (i.e., not in the soft clips) we were actually not removing bases in the adapter unless at least 1 bp of the adapter was aligned to the reference. Terrible.
-- Removed the broken logic of determining whether a read adaptor is too long.
-- Doesn't require isProperPairFlag to be set for a read to be adapter clipped
-- Update integration tests for new adapter clipping code
I "fixed" this once before but instead of testing with unit tests I used integration tests.
Bad decision.
The proper fix is in now, with a bonafide unit test included.
Previous fixes and tests only covered trailing soft-clips. Now that up front
hard-clipping is working properly though, we were failing on those in the tool.
Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
This time we don't accidentally drop reads (phew), but this bug does cause us not to
update the alignment start of the mate. Fixed and added unit test to cover it.
-- Added experimental LikelihoodRankSum, which required slightly more detailed access to the information managed by the base class, so added an overloaded getElementForRead also provides access to the MostLikelyAllele class
-- Added base class default implementation of getElementForPileupElement() which returns null, indicating that the pileup version isn't supported.
-- Added @Override to many of the RankSum classes for safety's sake
-- Updates to GeneralCallingPipeline: annotate sites with dbSNP IDs,
-- R script to assess the value of annotations for VQSR
-- The VR, when the model is bad, may evaluate log10sumlog10 where some of the values in the vector are NaN. This case is now trapped in VR and handled as previously -- indicating that the model has failed and evaluation continues.
-- Currently we don't support writing a BAM file from the haplotype caller when nct is enabled. Check in initialize if this is the case, and throw a UserException
-- Previous version emitted command lines that look like:
##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."
the new version provides additional information on when the GATK was run and the GATK version in a nicer format:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">
-- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">
-- Removed the ProtectedEngineFeaturesIntegrationTest
-- Actual unit tests for these features!
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table
Implementation:
* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
Implemtation details:
* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
-- Changed default HMM model.
-- Removed check.
-- Changed md5's: PL's in the high 100s change by a point or two due to new implementation.
-- Resulting performance improvement is about 30 to 50% less runtime when using -glm INDEL.
-- numPruningSamples allows one to specify that the minPruning factor must be met by this many samples for a path to be considered good (e.g. seen twice in three samples). By default this is just one sample.
-- adding unit test to test this new functionality
-- When doing cross-species comparisons and studying population history and ancient DNA data, having SOME measure of confidence is needed at every single site that doesn't depend on the reference base, even in a naive per-site SNP mode. Old versions of GATK provided GQ and some wrong PL values at reference sites but these were wrong. This commit addresses this need by adding a new UG command line argument, -allSitePLs, that, if enabled will:
a) Emit all 3 ALT snp alleles in the ALT column.
b) Emit all corresponding 10 PL values.
It's up to the user to process these PL values downstream to make sense of these. Note that, in order to follow VCF spec, the QUAL field in a reference call when there are non-null ALT alleles present will be zero, so QUAL will be useless and filtering will need to be done based on other fields.
-- Tweaks and fixes to processing pipelines for Reich lab.
1. Have the RMSMappingQuality annotation take into account the fact that reduced reads represent multiple reads.
2. The rank sume tests should not be using reduced reads (because they do not represent distinct observations).
3. Fixed a massive bug in the BaseQualityRankSumTest annotation! It was not using the base qualities but rather
the read likelihoods?!
Added a unit test for Rank Sum Tests to prove that the distributions are correctly getting assigned appropriate p-values.
Also, and just as importantly, the test shows that using reduced reads in the rank sum tests skews the results and
makes insignificant distributions look significant (so it can falsely cause the filtering of good sites).
Also included in this commit is a massive refactor of the RankSumTest class as requested by the reviewer.
-- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele.
-- Here's an example output line from the previous and current versions:
old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88
new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15
-- update MD5s to reflect these changes
-- [delivers #50847721]
-- Now table looks like:
Name VariantType AssessmentType Count
variant SNPS TRUE_POSITIVE 1220
variant SNPS FALSE_POSITIVE 0
variant SNPS FALSE_NEGATIVE 1
variant SNPS TRUE_NEGATIVE 150
variant SNPS CALLED_NOT_IN_DB_AT_ALL 0
variant SNPS HET_CONCORDANCE 100.00
variant SNPS HOMVAR_CONCORDANCE 99.63
variant INDELS TRUE_POSITIVE 273
variant INDELS FALSE_POSITIVE 0
variant INDELS FALSE_NEGATIVE 15
variant INDELS TRUE_NEGATIVE 79
variant INDELS CALLED_NOT_IN_DB_AT_ALL 2
variant INDELS HET_CONCORDANCE 98.67
variant INDELS HOMVAR_CONCORDANCE 89.58
-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:
20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6
Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
-- Merging overlapping fragments turns out to be a bad idea. In the case where you can safely merge the reads you only gain a small about of overlapping kmers, so the potential gains are relatively small. That's in contrast to the very large danger of merging reads inappropriately, such as when the reads only overlap in a repetitive region, and you artificially construct reads that look like the reference but actually may carry a larger true insertion w.r.t. the reference. Because this problem isn't limited to repetitive sequeuence, but in principle could occur in any sequence, it's just not safe to do this merging. Best to leave haplotype construction to the assembly graph.
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs
and don't let those pass through.
Added tests for this implementation that checks exact SW results and correct edges added.
Principle is simple: when coverage is deep enough, any single-base read error will look like a rare k-mer but correct sequence will be supported by many reads to correct sequences will look like common k-mers. So, algorithm has 3 main steps:
1. K-mer graph buildup.
For each read in an active region, a map from k-mers to the number of times they have been seen is built.
2. Building correction map.
All "rare" k-mers that are sparse (by default, seen only once), get mapped to k-mers that are good (by default, seen at least 20 times but this is a CL argument), and that lie within a given Hamming distance (by default, =1). This map can be empty (i.e. k-mers can be uncorrectable).
3. Correction proposal
For each constituent k-mer of each read, if this k-mer is rare and maps to a good k-mer, get differing base positions in k-mer and add these to a list of corrections for each base in each read. Then, correct read at positions where correction proposal is unanimous and non-empty.
The algorithm defaults are chosen to be very stringent and conservative in the correction: we only try to correct singleton k-mers, we only look for good k-mers lying at Hamming distance = 1 from them, and we only correct a base in read if all correction proposals are congruent.
By default, algorithm is disabled but can be enabled in HaplotypeCaller via the -readErrorCorrect CL option. However, at this point it's about 3x-10x more expensive so it needs to be optimized if it's to be used.
Ns are treated as wildcards in the PairHMM so creating haplotypes with Ns gives them artificial advantages over other ones.
This was the cause of at least one FN where there were Ns at a SNP position.
Problem:
The sequence graphs can get very complex and it's not enough just to test that any given read has non-unique kmers.
Reads with variants can have kmers that match unique regions of the reference, and this causes cycles in the final
sequence graph. Ultimately the problem is that kmers of 10/25 may not be large enough for these complex regions.
Solution:
We continue to try kmers of 10/25 but detect whether cycles exist; if so, we do not use them. If (and only if) we
can't get usable graphs from the 10/25 kmers, then we start iterating over larger kmers until we either can generate
a graph without cycles or attempt too many iterations.
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.
Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:
a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)
b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)
Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.
Implementation:
* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
- Memoized MathUtil's cumulative binomial probability function.
- Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
-- Created a new annotation DepthPerSampleHC that is by default on in the HaplotypeCaller
-- The depth for the HC is the sum of the informative alleles at this site. It's not perfect (as we cannot differentiate between reads that align over the event but aren't informative vs. those that aren't even close) but it's a pretty good proxy and it matches with the AD field (i.e., sum(AD) = DP).
-- Update MD5s
-- delivers [#48240601]
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
-- Ultimately this was caused by overly aggressive merging of CommonSuffixMerger. In the case where you have this graph:
ACT [ref source] -> C
G -> ACT -> C
we would merge into
G -> ACT -> C
which would linearlize into
GACTC
Causing us to add bases to the reference source node that couldn't be recovered. The solution was to ensure that CommonSuffixMerger only operates when all nodes to be merged aren't source nodes themselves.
-- Added a convenient argument to the haplotype caller (captureAssemblyFailureBAM) that will write out the exact reads to a BAM file that went into a failed assembly run (going to a file called AssemblyFailure.BAM). This can be used to rerun the haplotype caller to produce the exact error, which can be hard in regions of deep coverage where the downsampler state determines the exact reads going into assembly and therefore makes running with a sub-interval not reproduce the error
-- Did some misc. cleanup of code while debugging
-- [delivers #50917729]
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
-- The previous implementation attempted to be robust to this, but not all cases were handled properly. Added a helper function updateInde() that bounds up the update to be in the range of the indel array, and cleaned up logic of how the method works. The previous behavior was inconsistent across read fwd/rev stand, so that the indel cigars at the end of read were put at the start of reads if the reads were in the forward strand but not if they were in the reverse strand. Everything is now consistent, as can be seen in the symmetry of the unit tests:
tests.add(new Object[]{"1D3M", false, EventType.BASE_DELETION, new int[]{0,0,0}});
tests.add(new Object[]{"1M1D2M", false, EventType.BASE_DELETION, new int[]{1,0,0}});
tests.add(new Object[]{"2M1D1M", false, EventType.BASE_DELETION, new int[]{0,1,0}});
tests.add(new Object[]{"3M1D", false, EventType.BASE_DELETION, new int[]{0,0,1}});
tests.add(new Object[]{"1D3M", true, EventType.BASE_DELETION, new int[]{1,0,0}});
tests.add(new Object[]{"1M1D2M", true, EventType.BASE_DELETION, new int[]{0,1,0}});
tests.add(new Object[]{"2M1D1M", true, EventType.BASE_DELETION, new int[]{0,0,1}});
tests.add(new Object[]{"3M1D", true, EventType.BASE_DELETION, new int[]{0,0,0}});
tests.add(new Object[]{"4M1I", false, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
tests.add(new Object[]{"3M1I1M", false, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
tests.add(new Object[]{"2M1I2M", false, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});
tests.add(new Object[]{"1M1I3M", false, EventType.BASE_INSERTION, new int[]{1,0,0,0,0}});
tests.add(new Object[]{"1I4M", false, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});
tests.add(new Object[]{"4M1I", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,0}});
tests.add(new Object[]{"3M1I1M", true, EventType.BASE_INSERTION, new int[]{0,0,0,0,1}});
tests.add(new Object[]{"2M1I2M", true, EventType.BASE_INSERTION, new int[]{0,0,0,1,0}});
tests.add(new Object[]{"1M1I3M", true, EventType.BASE_INSERTION, new int[]{0,0,1,0,0}});
tests.add(new Object[]{"1I4M", true, EventType.BASE_INSERTION, new int[]{0,1,0,0,0}});
-- delivers #50445353
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not
clips should be ignored. Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
since low dcov values can result in problematic downsampling artifacts for locus-based
traversals.
-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
controls the number of reads per alignment start position, and even a dcov value of 1 might
be safe/desirable in some circumstances.
-Also reorganize the global downsampling defaults so that they are specified as annotations
to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
DownsamplingMethod class.
-The default downsampling settings have not been changed: they are still -dcov 1000
for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.