* bitset could legitimately be in an unfinished state but we were trying to access it without finalizing.
* added --cancer_mode argument per Mark's suggestion to force the user to explicitly enable multi-sample mode.
* tests were easiest to implement as integration tests (this was a really complicated case).
Problem
-------
The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.
Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows:
identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
for each base in the sequence from the starting vertex kmer:
look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue
If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue
This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers.
This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:
-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler
Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
* This is emerging now because BWA-MEM produces lots of reads that are not primary alignments
* The ConstrainedMateFixingManager class used by IndelRealigner was mis-adjusting SAM flags because it
was getting confused by these secondary alignments
* Added unit test to cover this case
Only try to clip adaptors when both reads of the pair are on opposite strands
-- Read pairs that have unusual alignments, such as two reads both oriented like:
<-----
<-----
where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.
- Updated MD5's after looking at diffs to verify that the change is what I expected.
-Changes in Java 7 related to comparators / sorting produce a large number
of innocuous differences in our test output. Updating expectations now
that we've moved to using Java 7 internally.
-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
intermittent failures.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly. Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.
The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly. Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).
Also:
1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
-- Added check to see if read spans beyond reference window MINUS padding and event length. This guarantees that read will always be contained in haplotype.
-- Changed md5's that happen when long reads from old 454 data have their likelihoods changed because of the extra base clipping.
-- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts. Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly. Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read
-- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously
Use case:
The default AF priors used (infinite sites model, neutral variation) is appropriate in the case where the reference allele is ancestral, and the called allele is a derived allele.
Most of the times this is true but in several population studies and in ancient DNA analyses this might introduce reference biases, and in some other cases it's hard to ascertain what the ancestral allele is (normally requiring to look up homologous chimp sequence).
Specifying no prior is one solution, but this may introduce a lot of artifactual het calls in shallower coverage regions.
With this option, users can specify what the prior for each AC should be according to their needs, subject to the restrictions documented in the code and in GATK docs.
-- Updated ancient DNA single sample calling script with filtering options and other cleanups.
-- Added integration test. Removed old -noPrior syntax.
-Do not throw an exception when parsing snpEff output files
generated by not-officially-supported versions of snpEff,
PROVIDED that snpEff was run with -o gatk
-Requested by the snpEff author
-Relevant integration tests updated/expanded
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.
Added lots of unit tests for new functionality.
-- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences. If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference. Updated the code to first clip the adapter sequences from the incoming fragments
-- Update MD5s
1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly
got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites
use binomial prob and at high coverage sites revert to using the old base proportions. Then we get the
best of both worlds. As a note, coverage refers to just the individual base counts and not the entire pileup.
2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had
the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We
now use a PriorityQueue instead of a TreeSet to allow for such cases.
3. Each consensus keeps track of its own number of softclipped bases. There was no reason that that number
should be shared between them.
4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now. Don't lose that
information. Maybe we'll decide to change this in the future, but for now we are conservative.
5. Also implemented various small performance optimizations based on profiling.
Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.
Calling everything statistics was very confusing. Diagnose Targets stratifies the data three ways: Interval, Sample and Locus. Each stratification then has it's own set of metrics (plugin system) to calculate -- LocusMetric, SampleMetric, IntervalMetric.
Metrics are generalized by the Metric interface. (for generic access)
Stratifications are generalized by the AbstractStratification abstract class. (to aggressively limit code duplication)
-- In case there are no informative bases in a pileup but pileup isn't empty (like when all bases have Q < min base quality) the GLs were still computed (but were all zeros) and fed to the exact model. Now, mimic case of diploid Gl computation where GLs are only added if # good bases > 0
-- I believe general case where only non-informative GLs are fed into AF calc model is broken and yields bogus QUAL, will investigate separately.
* Make most classes final, others package local
* Move to diagnostics.diagnosetargets package
* Aggregate statistics and walker classes on the same package for simplified visibility.
* Make status list a LinkedList instead of a HashSet
A plugin enabled implementation of DiagnoseTargets
Summarized Changes:
-------------------
* move argument collection into Thresholder object
* make thresholder object private member of all statistics classes
* rework the logic of the mate pairing thresholds
* update unit and integration tests to reflect the new behavior
* Implements Locus Statistic plugins
* Extend Locus Statistic plugins to determine sample status
* Export all common plugin functionality into utility class
* Update tests accordingly
[fixes#48465557]
* remove interval statistic low_median_coverage -- it is already captured by low coverage and coverage gaps.
* add gatkdocs to all the parameters
* clean up the logic on callable status a bit (still need to be re-worked into a plugin system)
* update integration tests
This is not really feasible with the current mandate of this walker. We would have to traverse by reference and that would make the runtime much higher, and we are not really interested in the status 99% of the time anyway. There are other walkers that can report this, and just this, status more cheaply.
[fixes#48442663]
Problem
-------
Diagnose targets is outputting both LOW_MEDIAN_COVERAGE and NO_READS when no reads are covering the interval
Solution
--------
Only allow low median coverage check if there are reads
[fixes#48442675]
Problem
-------
Diagnose targets was skipping intervals when they were not covered by any reads.
Solution
--------
Rework the interval iteration logic to output all intervals as they're skipped over by the traversal, as well as adding a loop on traversal done to finish outputting intervals past the coverage of teh BAM file.
Summarized Changes
------------------
* Outputs all intervals it iterates over, even if uncovered
* Outputs leftover intervals in the end of the traversal
* Updated integration tests
[fixes#47813825]
-- The problem is that the common suffix splitter could eliminate the reference source vertex when there's an incoming node that contains all of the reference source vertex bases and then some additional prefix bases. In this case we'd eliminate the reference source vertex. Fixed by checking for this condition and aborting the simplification
-- Update MD5s, including minor improvements
-- Reduce the min read length to 10 bp in the filterNonPassingReads in the HC. Now that we filter out reads before genotyping, we have to be more tolerant of shorter, but informative, reads, in order to avoid a few FNs in shallow read data
-- Reduce the min usable base qual to 8 by default in the HC. In regions with low coverage we sometimes throw out our only informative kmers because we required a contiguous run of bases with >= 16 QUAL. This is a bit too aggressive of a requirement, so I lowered it to 8.
-- Together with the previous commit this results in a significant improvement in the sensitivity and specificity of the caller
NA12878 MEM chr20:10-11
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
branch SNPS 1216 0 2 194 0
branch INDELS 312 2 13 71 7
master SNPS 1214 0 4 194 1
master INDELS 309 2 16 71 10
-- Update MD5s in the integration tests to reflect these two new changes
* Moved redundant code out of UGEngine
* Added overloaded methods that assume p=0.5 for speed efficiency
* Added unit test for the binomialCumulativeProbability method
The Problem:
Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
regions in an exome) causes RR (with default settings) to consider it a variant region. This
seriously hurts compression performance.
The Solution:
1. We now use a probabilistic model for determining whether we can create a consensus (in other
words, whether we can error correct a site) instead of the old ratio threshold. We calculate
the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
2. We also allow het compression globally, not just at known sites. So if we cannot create a
consensus at a given site then we try to perform het compression; and if we cannot perform het
compression that we just don't reduce the variant region. This way very wonky regions stay
uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
locus get het compressed.
Details:
1. -minvar is now deprecated in favor of -min_pvalue.
2. Added integration test for bad pvalue input.
3. -known argument still works to force het compression only at known sites; if it's not included
then we allow het compression anywhere. Added unit tests for this.
4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
Before finalizing het compression, we now check for insertions or other variant regions (usually due
to multi-allelics) which can render a region incompressible (and we back out if we find one). We
were checking for excessive softclips before, but now we add these tests too.
5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
out instead of backing out all of them.
6. We no longer create a mini read at the stop of the variant window for het compression. Instead, we
allow it to be part of the next global consensus.
7. The coverage test is no longer run systematically on all integration tests because the quals test
supercedes it. The systematic quals test is now much stricter in order to catch bugs and edge cases
(very useful!).
8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
for a consensus was affected by good and bad bases/reads).
9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
with insertions from a header.
Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples:
1000G EUR samples:
10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 50 0 5 8 1
per_sample INDELS 6 0 7 2 1
pooled SNPS 49 0 6 8 1
pooled INDELS 5 0 8 2 1
100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 144 0 22 28 1
per_sample INDELS 28 1 16 9 11
pooled SNPS 143 0 23 28 1
pooled INDELS 27 1 17 9 11
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
-- VariantRecalibrator now emits plots with denormlized values (original values) instead of their normalized (x - mu / sigma) which helps to understand the distribution of values that are good and bad
-- It's useful to know which sites have been used in the training of the model. The recal_file emitted by VR now contains VCF info field annotations labeling each site that was used in the positive or negative training models with POSITIVE_TRAINING_SITE and/or NEGATIVE_TRAINING_SITE
-- Update MD5s, which all changed now that the recal file and the resulting applied vcfs all have these pos / neg labels