-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.
Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:
a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)
b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)
Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.
Implementation:
* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
- Memoized MathUtil's cumulative binomial probability function.
- Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
-- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts. From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads).
-- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases.
-- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString)
-- Added unit tests for reverting soft clipped bases that create a read before the contig
-- [delivers #50892431]
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
-- Although the original bug report was about SplitSamFile it actually was an engine wide error. The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils
-- delivers #49531009
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
1) Add in checks for input parameters in MathUtils method. I was careful to use the bottom-level methods whenever possible, so that parameters don't needlessly go through multiple checks (so for instance, the parameters n and k for a binomial aren't checked on log10binomial, but rather in the log10binomialcoefficient subroutine).
This addresses JIRA GSA-767
Unit tests pass (we'll let bamboo deal with the integrations)
2) Address reviewer comments (change UserExceptions to IllegalArgumentExceptions).
3) .isWellFormedDouble() tests for infinity and not strictly positive infinity. Allow negative-infinity values for log10sumlog10 (as these just correspond to p=0).
After these commits, unit and integration tests now pass, and GSA-767 is done.
rebase and fix conflict:
public/java/src/org/broadinstitute/sting/utils/MathUtils.java
-- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 *before* the BQSR filters them out with MappingQualityUnassignedFilter.
-- delivers #50222251
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not
clips should be ignored. Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
-ReadShardBalancer was printing out an extra "Loading BAM index data for next contig"
message at traversal end, which was confusing users and making the GATK look stupid.
Suppress the extraneous message, and reword the log messages to be less confusing.
-Improve log message output when initializing the shard iterator in GenomeAnalysisEngine.
Don't mention BAMs when the are none, and say "Preparing for traversal" rather than
mentioning the meaningless-for-users concept of "shard strategy"
-These log messages are needed because the operations they surround might take a
while under some circumstances, and the user should know that the GATK is actively
doing something rather than being hung.
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
since low dcov values can result in problematic downsampling artifacts for locus-based
traversals.
-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
controls the number of reads per alignment start position, and even a dcov value of 1 might
be safe/desirable in some circumstances.
-Also reorganize the global downsampling defaults so that they are specified as annotations
to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
DownsamplingMethod class.
-The default downsampling settings have not been changed: they are still -dcov 1000
for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference
Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment).
Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.
-- Add() call had a misplaced map.put call, so that we were always putting the result of get() back into the map, when what we really intended was to only put the value back in if the original get() resulted in a null and so initialized the result
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
-- The previous version of PerReadAlleleLikelihoodMap only stored the alleles in an ArrayList, and used ArrayList.contains() to determine if an allele was already present in the map. This is very slow with many alleles. Now keeps both the ArrayList (for get() performance) and a Set of alleles for contains().
1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts)
2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object!
3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions.
Also:
* Make sure users realize that the read counts are approximate in the print outs.
* Removed a lot of unused cruft from the metrics object while I was in there.
* Added test to make sure that the ReadMetrics read count does not overflow ints.
* Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!
- Converted my old GATKBAMIndexText (within PileupWalkerIntegrationTest) to use a dataProvider
- Added two integration tests to test -outputInsertLength option
-- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards.
-- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard. This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider. So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes.
-- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter. This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense. Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout.
-- Added integration tests to ensure that subshard timeouts are working properly
-- Previously we used the LocusShardBalancer for the haplotype caller, which meant that TraverseActiveRegions saw its shards grouped in chunks of 16kb bits on the genome. These locus shards are useful when you want to use the HierarchicalMicroScheduler, as they provide fine-grained accessed to the underlying BAM, but they have two major drawbacks (1) we have to fairly frequently reset our state in TAR to handle moving between shard boundaries and (2) with the nano scheduled TAR we end up blocking at the end of each shard while our threads all finish processing.
-- This commit changes the system over to using an ActiveRegionShardBalancers, that combines all of the shard data for a single contig into a single combined shard. This ensures that TAR, and by extensions the HaplotypeCaller, gets all of the data on a single contig together so the the NanoSchedule runs efficiently instead of blocking over and over at shard boundaries. This simple change allows us to scale efficiently to around 8 threads in the nano scheduler:
-- See https://www.dropbox.com/s/k7f280pd2zt0lyh/hc_nano_linear_scale.pdf
-- See https://www.dropbox.com/s/fflpnan802m2906/hc_nano_log_scale.pdf
-- Misc. changes throughout the codebase so we Use the ActiveRegionShardBalancer where appropriate.
-- Added unit tests for ActiveRegionShardBalancer to confirm it does the merging as expected.
-- Fix bad toString in FilePointer
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
-- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
-- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
* A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
right thing for indels at the edges of the alignments.
* Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
* Lots of systematic testing added for this implementation.
* NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
-Diff engine output is now included in the actual exception message thrown as a
result of an MD5 mismatch, which allows it to be conveniently viewed on the
main page of a build in Bamboo.
Minor Additional Improvements:
-WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass()
method, and the test class name is now included as a regular part of integration
test output for each test.
-Fix race condition in MD5DB.ensureMd5DbDirectory()
-integrationtests dir is now cleaned by "ant clean"
GSA-915 #resolve
Problem
-------
The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.
Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows:
identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
for each base in the sequence from the starting vertex kmer:
look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue
If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue
This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers.
This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:
-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler
Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
Only try to clip adaptors when both reads of the pair are on opposite strands
-- Read pairs that have unusual alignments, such as two reads both oriented like:
<-----
<-----
where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
-- Previous version would trim down 2M2D2M into 2M if you asked for the first 2 bases, but this can result in incorrect alignment of the bases to the reference as the bases no longer span the full reference interval expected. Fixed and added unit tests
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.
- Updated MD5's after looking at diffs to verify that the change is what I expected.
-Changes in Java 7 related to comparators / sorting produce a large number
of innocuous differences in our test output. Updating expectations now
that we've moved to using Java 7 internally.
-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
intermittent failures.
This class, being unused, was no longer getting packaged into the
GATK release jar by bcel, and so attempting to run its unit test
on the release jar was producing an error.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly. Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.
The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly. Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).
Also:
1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.