There are a few pipeline test classes that do not run Queue, but are
classified as pipeline tests because they submit farm jobs. Make these
unconventional pipeline tests respect the pipeline test dry run setting.
Previous fixes and tests only covered trailing soft-clips. Now that up front
hard-clipping is working properly though, we were failing on those in the tool.
Added a patch for this as well as a separate test independent of the soft-clips
to make sure that it's working properly.
-- Previous version emitted command lines that look like:
##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] ..."
the new version provides additional information on when the GATK was run and the GATK version in a nicer format:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[private/testdata/reduced.readNotFullySpanningDeletion.bam] read_buffer_size=null phone_home=AWS ...">
-- Additionally, the command line options are emitted sequentially in the file, so you can see a running record of how a VCF was produced, such as this example from the integration test:
##GATKCommandLine=<ID=HaplotypeCaller,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:09:01 EDT 2013",Epoch=1371740941197,CommandLineOptions="lots of stuff">
##GATKCommandLine=<ID=SelectVariants,Version=2.5-206-gbc7be2b,Date="Thu Jun 20 11:16:23 EDT 2013",Epoch=1371741383277,CommandLineOptions="lots of stuff">
-- Removed the ProtectedEngineFeaturesIntegrationTest
-- Actual unit tests for these features!
Improved AnalyzeCovariates (AC) integration test.
Renamed AC test files ending with .grp to .table
Implementation:
* Removed RECAL_PDF/CSV_FILE from RecalibrationArgumentCollection (RAC). Updated rest of the code accordingly.
* Fixed BQSRIntegrationTest to work with new changes
Implemtation details:
* Added tool class *.AnalyzeCovariates
* Added convenient addAll method to Utils to be able to add elements of an array.
* Added parameter comparison methods to RecalibrationArgumentCollection class in order to verify that multiple imput recalibration report are compatible and comparable.
* Modified the BQSR.R script to handle up to 3 different recalibration tables (-BQSR, -before and -after) and removed some irrelevant arguments (or argument values) from the output.
* Added an integration test class.
-Collapses zero-length and repeated cigar elements, neither of which
can necessarily be handled correctly by downstream code (like LIBS).
-Consolidation is done before read filters, because not all read filters
behave correctly with non-consoliated cigars.
-Examined other uses of consolidateCigar() throughout the GATK, and
found them to not be redundant with the new engine-level consolidation
(they're all on artificially-created cigars in the HaplotypeCaller
and SmithWaterman classes)
-Improved comments in SAMDataSource.applyDecoratingIterators()
-Updated MD5s; differences were examined and found to be innocuous
-Two tests: -Unit test for ReadFormattingIterator
-Integration test for correct handling of zero-length
cigar elements by the GATK engine as a whole
-This argument was completely redundant with the engine-level -dfrac
argument.
-Could produce unintended consequences if used in conjunction with
engine-level downsampling arguments.
-- It was being applied in the wrong order (after the first call to the underlying MalformedReadFilter) so if your first read was malformed you'd blow up there instead of being fixed properly. Added integration tests to ensure this continues to work.
-- [delivers #49538319]
-- VariantContextWriterStorage was gzipping the intermediate files that would be merged in, but the mergeInto function couldn't read those outputs, and we'd throw a very strange error. Now tmp. VCFs aren't compressed, even if the final VCF is. Added integrationtest to ensure this behavior works going forward.
-- [delivers #47399279]
-- Previous version created FILTERs for each possible alt allele when that site was set to monomorphic by BEAGLE. So if you had a A/C SNP in the original file and beagle thought it was AC=0, then you'd get a record with BGL_RM_WAS_A in the FILTER field. This obviously would cause problems for indels, as so the tool was blowing up in this case. Now beagle sets the filter field to BGL_SET_TO_MONOMORPHIC and sets the info field annotation OriginalAltAllele to A instead. This works in general with any type of allele.
-- Here's an example output line from the previous and current versions:
old: 20 64150 rs7274499 C . 3041.68 BGL_RM_WAS_A AN=566;DB;DP=1069;Dels=0.00;HRun=0;HaplotypeScore=238.33;LOD=3.5783;MQ=83.74;MQ0=0;NumGenotypesChanged=1;OQ=1949.35;QD=10.95;SB=-6918.88
new: 20 64062 . G . 100.39 BGL_SET_TO_MONOMORPHIC AN=566;DP=1108;Dels=0.00;HRun=2;HaplotypeScore=221.59;LOD=-0.5051;MQ=85.69;MQ0=0;NumGenotypesChanged=1;OQ=189.66;OriginalAltAllele=A;QD=15.81;SB=-6087.15
-- update MD5s to reflect these changes
-- [delivers #50847721]
-WalkerTest now deletes *.idx files on exit
-ArtificialBAMBuilder now deletes *.bai files on exit
-VariantsToBinaryPed walker now deletes its temp files on exit
Problem:
Classes in com.sun.javadoc.* are non-standard. Since we can't depend on their availability for
all users, the GATK proper should not have any runtime dependencies on this package.
Solution:
-Isolate com.sun.javadoc.* dependencies in a DocletUtils class for use only by doclets. The
only users who need to run our doclets are those who compile from source, and they
should be competent enough to figure out how to resolve a missing com.sun.* dependency.
-HelpUtils now contains no com.sun.javadoc.* dependencies and can be safely used by walkers/other
tools.
-Added comments with instructions on when it is safe to use DocletUtils vs. HelpUtils
[delivers #51450385]
[delivers #50387199]
-- Now table looks like:
Name VariantType AssessmentType Count
variant SNPS TRUE_POSITIVE 1220
variant SNPS FALSE_POSITIVE 0
variant SNPS FALSE_NEGATIVE 1
variant SNPS TRUE_NEGATIVE 150
variant SNPS CALLED_NOT_IN_DB_AT_ALL 0
variant SNPS HET_CONCORDANCE 100.00
variant SNPS HOMVAR_CONCORDANCE 99.63
variant INDELS TRUE_POSITIVE 273
variant INDELS FALSE_POSITIVE 0
variant INDELS FALSE_NEGATIVE 15
variant INDELS TRUE_NEGATIVE 79
variant INDELS CALLED_NOT_IN_DB_AT_ALL 2
variant INDELS HET_CONCORDANCE 98.67
variant INDELS HOMVAR_CONCORDANCE 89.58
-- Rewrite / refactored parts of subsetDiploidAlleles in GATKVariantContextUtils to have a BEST_MATCH assignment method that does it's best to simply match the genotype after subsetting to a set of alleles. So if the original GT was A/B and you subset to A/B it remains A/B but if you subset to A/C you get A/A. This means that het-alt B/C genotypes become A/B and A/C when subsetting to bi-allelics which is the convention in the KB. Add lots of unit tests for this functions (from 0 previously)
-- BadSites in Assessment now emits TP sites with discordant genotypes with the type GENOTYPE_DISCORDANCE and tags the expected genotype in the info field as ExpectedGenotype, such as this record:
20 10769255 . A ATGTG 165.73 . ExpectedGenotype=HOM_VAR;SupportingCallsets=ebanks,depristo,CEUTrio_best_practices;WHY=GENOTYPE_DISCORDANCE GT:AD:DP:GQ:PL 0/1:1,9:10:6:360,0,6
Indicating that the call was a HET but the expected result was HOM_VAR
-- Forbid subsetting of diploid genotypes to just a single allele.
-- Added subsetToRef as a separate specific function. Use that in the DiploidExactAFCalc in the case that you need to reduce yourself to ref only. Preserves DP in the genotype field when this is possible, so a few integration tests have changed for the UG
Problem:
-Downsamplers were treating reduced reads the same as normal reads,
with occasionally catastrophic results on variant calling when an
entire reduced read happened to get eliminated.
Solution:
-Since reduced reads lack the information we need to do position-based
downsampling on them, best available option for now is to simply
exempt all reduced reads from elimination during downsampling.
Details:
-Add generic capability of exempting items from elimination to
the Downsampler interface via new doNotDiscardItem() method.
Default inherited version of this method exempts all reduced reads
(or objects encapsulating reduced reads) from elimination.
-Switch from interfaces to abstract classes to facilitate this change,
and do some minor refactoring of the Downsampler interface (push
implementation of some methods into the abstract classes, improve
names of the confusing clear() and reset() methods).
-Rewrite TAROrderedReadCache. This class was incorrectly relying
on the ReservoirDownsampler to preserve the relative ordering of
items in some circumstances, which was behavior not guaranteed by
the API and only happened to work due to implementation details
which no longer apply. Restructured this class around the assumption
that the ReservoirDownsampler will not preserve relative ordering
at all.
-Add disclaimer to description of -dcov argument explaining that
coverage targets are approximate goals that will not always be
precisely met.
-Unit tests for all individual downsamplers to verify that reduced
reads are exempted from elimination
We now run Smith-Waterman on the dangling tail against the corresponding reference tail.
If we can generate a reasonable, low entropy alignment then we trigger the merge to the
reference path; otherwise we abort. Also, we put in a check for low-complexity of graphs
and don't let those pass through.
Added tests for this implementation that checks exact SW results and correct edges added.
-- Reuse infrastructure for RODs for reads to implement general IntervalReferenceOrderedView so that both TraverseReads and TraverseActiveRegions can use the same underlying infrastructure
-- TraverseActiveRegions now provides a meaningful RefMetaDataTracker to ActiveRegionWalker.map
-- Cleanup misc. code as it came up
-- Resolves GSA-808: Write general utility code to do rsID allele matching, hook up to UG and HC
-- Variants will be considered matching if they have the same reference allele and at least 1 common alternative allele. This matching algorithm determines how rsID are added back into the VariantContext we want to annotate, and as well determining the overlap FLAG attribute field.
-- Updated VariantAnnotator and VariantsToVCF to use this class, removing its old stale implementation
-- Added unit tests for this VariantOverlapAnnotator class
-- Removed GATKVCFUtils.rsIDOfFirstRealVariant as this is now better to use VariantOverlapAnnotator
-- Now requires strict allele matching, without any option to just use site annotation.
The previous behavior is to process reads with N CIGAR operators as they are despite that many of the tools do not actually support such operator and results become unpredictible.
Now if the there is some read with the N operator, the engine returns a user exception. The error message indicates what is the problem (including the offending read and mapping position) and give a couple of alternatives that the user can take in order to move forward:
a) ask for those reads to be filtered out (with --filter_reads_with_N_cigar or -filterRNC)
b) keep them in as before (with -U ALLOW_N_CIGAR_READS or -U ALL)
Notice that (b) does not have any effect if (a) is enacted; i.e. filtering overrides ignoring.
Implementation:
* Added filterReadsWithMCigar argument to MalformedReadFilter with the corresponding changes in the code to get it to work.
* Added ALLOW_N_CIGAR_READS unsafe flag so that N cigar containing reads can be processed as they are if that is what the user wants.
* Added ReadFilterTest class commont parent for ReadFilter test cases.
* Refactor ReadGroupBlackListFilterUnitTest to extend ReadFilterTest and push up some functionality to that class.
* Modified MalformedReadFilterUnitTest to extend ReadFilterTest and to test the new filter functionality.
* Added AllowNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALLOW_N_CIGAR_READS flag is used.
* Added UnsafeNCigarMalformedReadFilterUnittest to check on the behavior when the unsafe ALL flag is used.
* Updated a broken test case in UnifiedGenotyperIntegrationTest resulting from the new behavior.
* Updated EngineFeaturesIntegrationTest testdata to be compliant with new behavior
- Memoized MathUtil's cumulative binomial probability function.
- Reduced the default size of the read name map in reduced reads and handle its resets more efficiently.
-- In the case where we have multiple potential alternative alleles *and* we weren't calling all of them (so that n potential values < n called) we could end up trimming the alleles down which would result in the mismatch between the PerReadAlleleLikelihoodMap alleles and the VariantContext trimmed alleles.
-- Fixed by doing two things (1) moving the trimming code after the annotation call and (2) updating AD annotation to check that the alleles in the VariantContext and the PerReadAlleleLikelihoodMap are concordant, which will stop us from degenerating in the future.
-- delivers [#50897077]
-- This occurred because we were reverting reads with soft clips that would produce reads with negative (or 0) alignment starts. From such reads we could end up with adaptor starts that were negative and that would ultimately produce the "Only one of refStart or refStop must be < 0, not both" error in the FragmentUtils merging code (which would revert and adaptor clip reads).
-- We now hard clip away bases soft clipped reverted bases that fall before the 1-based contig start in revertSoftClippedBases.
-- Replace buggy cigarFromString with proper SAM-JDK call TextCigarCodec.getSingleton().decode(cigarString)
-- Added unit tests for reverting soft clipped bases that create a read before the contig
-- [delivers #50892431]
-- Ultimately this was caused by an underlying bug in the reverting of soft clipped bases in the read clipper. The read clipper would fail to properly set the alignment start for reads that were 100% clipped before reverting, such as 10H2S5H => 10H2M5H. This has been fixed and unit tested.
-- Update 1 ReduceReads MD5, which was due to cases where we were clipping away all of the MATCH part of the read, leaving a cigar like 50H11S and the revert soft clips was failing to properly revert the bases.
-- delivers #50655421
-- Although the original bug report was about SplitSamFile it actually was an engine wide error. The two places in the that provide compression to the BAM write now check the validity of the compress argument via a static method in ReadUtils
-- delivers #49531009
-- We now inject the given alleles into the reference haplotype and add them to the graph.
-- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode.
-- This unifies how Smith-Waterman is performed between discovery and GGA modes.
-- Misc minor cleanup in several places.
1) Add in checks for input parameters in MathUtils method. I was careful to use the bottom-level methods whenever possible, so that parameters don't needlessly go through multiple checks (so for instance, the parameters n and k for a binomial aren't checked on log10binomial, but rather in the log10binomialcoefficient subroutine).
This addresses JIRA GSA-767
Unit tests pass (we'll let bamboo deal with the integrations)
2) Address reviewer comments (change UserExceptions to IllegalArgumentExceptions).
3) .isWellFormedDouble() tests for infinity and not strictly positive infinity. Allow negative-infinity values for log10sumlog10 (as these just correspond to p=0).
After these commits, unit and integration tests now pass, and GSA-767 is done.
rebase and fix conflict:
public/java/src/org/broadinstitute/sting/utils/MathUtils.java
-- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 *before* the BQSR filters them out with MappingQualityUnassignedFilter.
-- delivers #50222251
The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but
ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not
clips should be ignored. Also rebased so that return type is no longer a Pair.
Added unit test to cover this situation.
-ReadShardBalancer was printing out an extra "Loading BAM index data for next contig"
message at traversal end, which was confusing users and making the GATK look stupid.
Suppress the extraneous message, and reword the log messages to be less confusing.
-Improve log message output when initializing the shard iterator in GenomeAnalysisEngine.
Don't mention BAMs when the are none, and say "Preparing for traversal" rather than
mentioning the meaningless-for-users concept of "shard strategy"
-These log messages are needed because the operations they surround might take a
while under some circumstances, and the user should know that the GATK is actively
doing something rather than being hung.
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
since low dcov values can result in problematic downsampling artifacts for locus-based
traversals.
-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
controls the number of reads per alignment start position, and even a dcov value of 1 might
be safe/desirable in some circumstances.
-Also reorganize the global downsampling defaults so that they are specified as annotations
to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
DownsamplingMethod class.
-The default downsampling settings have not been changed: they are still -dcov 1000
for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference
Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment).
Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.
-- Add() call had a misplaced map.put call, so that we were always putting the result of get() back into the map, when what we really intended was to only put the value back in if the original get() resulted in a null and so initialized the result
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
-- The previous version of PerReadAlleleLikelihoodMap only stored the alleles in an ArrayList, and used ArrayList.contains() to determine if an allele was already present in the map. This is very slow with many alleles. Now keeps both the ArrayList (for get() performance) and a Set of alleles for contains().
1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts)
2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object!
3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions.
Also:
* Make sure users realize that the read counts are approximate in the print outs.
* Removed a lot of unused cruft from the metrics object while I was in there.
* Added test to make sure that the ReadMetrics read count does not overflow ints.
* Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!
- Converted my old GATKBAMIndexText (within PileupWalkerIntegrationTest) to use a dataProvider
- Added two integration tests to test -outputInsertLength option
-- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards.
-- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard. This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider. So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes.
-- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter. This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense. Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout.
-- Added integration tests to ensure that subshard timeouts are working properly
-- Previously we used the LocusShardBalancer for the haplotype caller, which meant that TraverseActiveRegions saw its shards grouped in chunks of 16kb bits on the genome. These locus shards are useful when you want to use the HierarchicalMicroScheduler, as they provide fine-grained accessed to the underlying BAM, but they have two major drawbacks (1) we have to fairly frequently reset our state in TAR to handle moving between shard boundaries and (2) with the nano scheduled TAR we end up blocking at the end of each shard while our threads all finish processing.
-- This commit changes the system over to using an ActiveRegionShardBalancers, that combines all of the shard data for a single contig into a single combined shard. This ensures that TAR, and by extensions the HaplotypeCaller, gets all of the data on a single contig together so the the NanoSchedule runs efficiently instead of blocking over and over at shard boundaries. This simple change allows us to scale efficiently to around 8 threads in the nano scheduler:
-- See https://www.dropbox.com/s/k7f280pd2zt0lyh/hc_nano_linear_scale.pdf
-- See https://www.dropbox.com/s/fflpnan802m2906/hc_nano_log_scale.pdf
-- Misc. changes throughout the codebase so we Use the ActiveRegionShardBalancer where appropriate.
-- Added unit tests for ActiveRegionShardBalancer to confirm it does the merging as expected.
-- Fix bad toString in FilePointer
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
-- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
-- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
* A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
right thing for indels at the edges of the alignments.
* Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
* Lots of systematic testing added for this implementation.
* NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
-Diff engine output is now included in the actual exception message thrown as a
result of an MD5 mismatch, which allows it to be conveniently viewed on the
main page of a build in Bamboo.
Minor Additional Improvements:
-WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass()
method, and the test class name is now included as a regular part of integration
test output for each test.
-Fix race condition in MD5DB.ensureMd5DbDirectory()
-integrationtests dir is now cleaned by "ant clean"
GSA-915 #resolve
Problem
-------
The DeBruijn assembler was too slow. The cause of the slowness was the need to construct many kmer graphs (from max read length in the interval to 11 kmer, in increments of 6 bp). This need to build many kmer graphs was because the assembler (1) needed long kmers to assemble through regions where a shorter kmer was non-unique in the reference, as we couldn't split cycles in the reference (2) shorter kmers were needed to be sensitive to differences from the reference near the edge of reads, which would be lost often when there was chain of kmers of longer length that started before and after the variant.
Solution
--------
The read threading assembler uses a fixed kmer, in this implementation by default two graphs with 10 and 25 kmers. The algorithm operates as follows:
identify all non-unique kmers of size K among all reads and the reference
for each sequence (ref and read):
find a unique starting position of the sequence in the graph by matching to a unique kmer, or starting a new source node if non exist
for each base in the sequence from the starting vertex kmer:
look at the existing outgoing nodes of current vertex V. If the base in sequence matches the suffix of outgoing vertex N, read the sequence to N, and continue
If no matching next vertex exists, find a unique vertex with kmer K. If one exists, merge the sequence into this vertex, and continue
If a merge vertex cannot be found, create a new vertex (note this vertex may have a kmer identical to another in the graph, if it is not unique) and thread the sequence to this vertex, and continue
This algorithm has a key property: it can robustly use a very short kmer without introducing cycles, as we will create paths through the graph through regions that aren't unique w.r.t. the sequence at the given kmer size. This allows us to assemble well with even very short kmers.
This commit includes many critical changes to the haplotype caller to make it fast, sensitive, and accurate on deep and shallow WGS and exomes, the key changes are highlighted below:
-- The ReadThreading assembler keeps track of the maximum edge multiplicity per sample in the graph, so that we prune per sample, not across all samples. This change is essential to operate effectively when there are many deep samples (i.e., 100 exomes)
-- A new pruning algorithm that will only prune linear paths where the maximum edge weight among all edges in the path have < pruningFactor. This makes pruning more robust when you have a long chain of bases that have high multiplicity at the start but only barely make it back into the main path in the graph.
-- We now do a global SmithWaterman to compute the cigar of a Path, instead of the previous bubble-based SmithWaterman optimization. This change is essential for us to get good variants from our paths when the kmer size is small. It also ensures that we produce a cigar from a path that only depends only the sequence of bases in the path, unlike the previous approach which would depend on both the bases and the way the path was decomposed into vertices, which depended on the kmer size we used.
-- Removed MergeHeadlessIncomingSources, which was introducing problems in the graphs in some cases, and just isn't the safest operation. Since we build a kmer graph of size 10, this operation is no longer necessary as it required a perfect match of 10 bp to merge anyway.
-- The old DebruijnAssembler is still available with a command line option
-- The number of paths we take forward from the each assembly graph is now capped at a factor per sample, so that we allow 128 paths for a single sample up to 10 x nSamples as necessary. This is an essential change to make the system work well for large numbers of samples.
-- Add a global mismapping parameter to the HC likelihood calculation: The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haploytype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against the reference that this (and any) read could contribute against reference is Q30. -- Controllable via a command line argument, defaulting to Q60 rate. Results from 20:10-11 mb for branch are consistent with the previous behavior, but this does help in cases where you have rare very divergent haplotypes
-- Reduced ActiveRegionExtension from 200 bp to 100 bp, which is a performance win and the large extension is largely unnecessary with the short kmers used with the read threading assembler
Infrastructure changes / improvements
-------------------------------------
-- Refactored BaseGraph to take a subclass of BaseEdge, so that we can use a MultiSampleEdge in the ReadThreadingAssembler
-- Refactored DeBruijnAssembler, moving common functionality into LocalAssemblyEngine, which now more directly manages the subclasses, requiring them to only implement a assemble() method that takes ref and reads and provides a List<SeqGraph>, which the LocalAssemblyEngine takes forward to compute haplotypes and other downstream operations. This allows us to have only a limited amount of code that differentiates the Debruijn and ReadThreading assemblers
-- Refactored active region trimming code into ActiveRegionTrimmer class
-- Cleaned up the arguments in HaplotypeCaller, reorganizing them and making arguments @Hidden and @Advanced as appropriate. Renamed several arguments now that the read threading assembler is the default
-- LocalAssemblyEngineUnitTest reads in the reference sequence from b37, and assembles with synthetic reads intervals from 10-11 mbs with only the reference sequence as well as artificial snps, deletions, and insertions.
-- Misc. updates to Smith Waterman code. Added generic interface to called not surpisingly SmithWaterman, making it easier to have alternative implementations.
-- Many many more unit tests throughout the entire assembler, and in random utilities
Only try to clip adaptors when both reads of the pair are on opposite strands
-- Read pairs that have unusual alignments, such as two reads both oriented like:
<-----
<-----
where previously having their adaptors clipped as though the standard calculation of the insert size was meaningful, which it is not for such oddly oriented pairs. This caused us to clip extra good bases from reads.
-- Update MD5s due change in adaptor clipping, which add some coverage in some places
-- Previous version would trim down 2M2D2M into 2M if you asked for the first 2 bases, but this can result in incorrect alignment of the bases to the reference as the bases no longer span the full reference interval expected. Fixed and added unit tests
Output didn't "mix-up" the genotypes, it outputed the same HET vs HET (e.g.) 3 times rather than the combinations of HET vs {HET, HOM, HOM_REF}, etc.
This was only a problem in the text, _not_ the actual numbers, which were outputted correctly.
- Updated MD5's after looking at diffs to verify that the change is what I expected.
-Changes in Java 7 related to comparators / sorting produce a large number
of innocuous differences in our test output. Updating expectations now
that we've moved to using Java 7 internally.
-Also incorporate Eric's fix to the GATKSAMRecordUnitTest to prevent
intermittent failures.
This class, being unused, was no longer getting packaged into the
GATK release jar by bcel, and so attempting to run its unit test
on the release jar was producing an error.
RR counts are represented as offsets from the first count, but that wasn't being done
correctly when counts are adjusted on the fly. Also, we were triggering the expensive
conversion and writing to binary tags even when we weren't going to write the read
to disk.
The code has been updated so that unconverted counts are passed to the GATKSAMRecord
and it knows how to encode the tag correctly. Also, there are now methods to write
to the reduced counts array without forcing the conversion (and methods that do force
the conversion).
Also:
1. counts are now maintained as ints whenever possible. Only the GATKSAMRecord knows
about the internal encoding.
2. as discussed in meetings today, we updated the encoding so that it can now handle
a range of values that extends to 255 instead of 127 (and is backwards compatible).
3. tests have been moved from SyntheticReadUnitTest to GATKSAMRecordUnitTest accordingly.
-- The previous version of the read clipping operations wouldn't modify the reduced reads counts, so hardClipToRegion would result in a read with, say, 50 bp of sequence and base qualities but 250 bp of reduced read counts. Updated the hardClip operation to handle reduce reads, and added a unit test to make sure this works properly. Also had to update GATKSAMRecord.emptyRead() to set the reduced count to new byte[0] if the template read is a reduced read
-- Update md5s, where the new code recovers a TP variant with count 2 that was missed previously
-Do not throw an exception when parsing snpEff output files
generated by not-officially-supported versions of snpEff,
PROVIDED that snpEff was run with -o gatk
-Requested by the snpEff author
-Relevant integration tests updated/expanded
Note that this works only in the case of pileups (i.e. coming from UG);
allele-biased down-sampling for RR just cannot work for haplotypes.
Added lots of unit tests for new functionality.
-- The previous version was unclipping soft clipped bases, and these were sometimes adaptor sequences. If the two reads successfully merged, we'd lose all of the information necessary to remove the adaptor, producing a very high quality read that matched reference. Updated the code to first clip the adapter sequences from the incoming fragments
-- Update MD5s
-Acquire file locks in a background thread with a timeout of 30 seconds,
and throw a UserException if a lock acquisition call times out
* should solve the locking issue for most people provided they
RETRY failed farm jobs
* since we use NON-BLOCKING lock acquisition calls, any call that
takes longer than a second or two indicates a problem with the
underlying OS file lock support
* use daemon threads so that stuck lock acquisition tasks don't
prevent the JVM from exiting
-Disable both auto-index creation and file locking for integration tests
via a hidden GATK argument --disable_auto_index_creation_and_locking_when_reading_rods
* argument not safe for general use, since it allows reading from
an index file without first acquiring a lock
* this is fine for the test suite, since all index files already
exist for test files (or if they don't, they should!)
-Added missing indices for files in private/testdata
-Had to delete most of RMDTrackBuilderUnitTest, since it mostly tested auto-index
creation, which we can't test with locking disabled, but I replaced the deleted
tests with some tests of my own.
-Unit test for FSLockWithShared to test the timeout feature
1. Using cumulative binomial probability was not working at high coverage sites (because p-values quickly
got out of hand) so instead we use a hybrid system for determining significance: at low coverage sites
use binomial prob and at high coverage sites revert to using the old base proportions. Then we get the
best of both worlds. As a note, coverage refers to just the individual base counts and not the entire pileup.
2. Reads were getting lost because of the comparator being used in the SlidingWindow. When read pairs had
the same alignment end position the 2nd one encountered would get dropped (but added to the header!). We
now use a PriorityQueue instead of a TreeSet to allow for such cases.
3. Each consensus keeps track of its own number of softclipped bases. There was no reason that that number
should be shared between them.
4. We output consensus filtered (i.e. low MQ) reads whenever they are present for now. Don't lose that
information. Maybe we'll decide to change this in the future, but for now we are conservative.
5. Also implemented various small performance optimizations based on profiling.
Added unit tests to cover these changes; systematic assessment now tests against low MQ reads too.
-- Now that this function is used in the core of LIBS it needed some basic optimizations, which are now complete, pass all unit tests.
-- Added caliper benchmark for AlignmentUtils to assess performance (showing new version is 3x-10x faster)
-- Remove unused import in ReadStateManager
* Moved redundant code out of UGEngine
* Added overloaded methods that assume p=0.5 for speed efficiency
* Added unit test for the binomialCumulativeProbability method
The Problem:
Exomes seem to be more prone to base errors and one error in 20x coverage (or below, like most
regions in an exome) causes RR (with default settings) to consider it a variant region. This
seriously hurts compression performance.
The Solution:
1. We now use a probabilistic model for determining whether we can create a consensus (in other
words, whether we can error correct a site) instead of the old ratio threshold. We calculate
the cumulative binomial probability of seeing the given ratio and trigger consensus creation if
that pvalue is lower than the provided threshold (0.01 by default, so rather conservative).
2. We also allow het compression globally, not just at known sites. So if we cannot create a
consensus at a given site then we try to perform het compression; and if we cannot perform het
compression that we just don't reduce the variant region. This way very wonky regions stay
uncompressed, regions with one errorful read get fully compressed, and regions with one errorful
locus get het compressed.
Details:
1. -minvar is now deprecated in favor of -min_pvalue.
2. Added integration test for bad pvalue input.
3. -known argument still works to force het compression only at known sites; if it's not included
then we allow het compression anywhere. Added unit tests for this.
4. This commit includes fixes to het compression problems that were revealed by systematic qual testing.
Before finalizing het compression, we now check for insertions or other variant regions (usually due
to multi-allelics) which can render a region incompressible (and we back out if we find one). We
were checking for excessive softclips before, but now we add these tests too.
5. We now allow het compression on some but not all of the 4 consensus reads: if creating one of the
consensuses is not possible (e.g. because of excessive softclips) then we just back that one consensus
out instead of backing out all of them.
6. We no longer create a mini read at the stop of the variant window for het compression. Instead, we
allow it to be part of the next global consensus.
7. The coverage test is no longer run systematically on all integration tests because the quals test
supercedes it. The systematic quals test is now much stricter in order to catch bugs and edge cases
(very useful!).
8. Each consensus (both the normal and filtered) keep track of their own mapping qualities (before the MQ
for a consensus was affected by good and bad bases/reads).
9. We now completely ignore low quality bases, unless they are the only bases present in a pileup.
This way we preserve the span of reads across a region (needed for assembly). Min base qual moved to Q15.
10.Fixed long-standing bug where sliding window didn't do the right thing when removing reads that start
with insertions from a header.
Note that this commit must come serially before the next commit in which I am refactoring the binomial prob
code in MathUtils (which is failing and slow).
-- The previous algorithm would compute the likelihood of each haplotype pooled across samples. This has a tendency to select "consensus" haplotypes that are reasonably good across all samples, while missing the true haplotypes that each sample likes. The new algorithm computes instead the most likely pair of haplotypes among all haplotypes for each sample independently, contributing 1 vote to each haplotype it selects. After all N samples have been run, we sort the haplotypes by their counts, and take 2 * nSample + 1 haplotypes or maxHaplotypesInPopulation, whichever is smaller.
-- After discussing with Mauricio our view is that the algorithmic complexity of this approach is no worse than the previous approach, so it should be equivalently fast.
-- One potential improvement is to use not hard counts for the haplotypes, but this would radically complicate the current algorithm so it wasn't selected.
-- For an example of a specific problem caused by this, see https://jira.broadinstitute.org/browse/GSA-871.
-- Remove old pooled likelihood model. It's worse than the current version in both single and multiple samples:
1000G EUR samples:
10Kb
per sample: 7.17 minutes
pooled: 7.36 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 50 0 5 8 1
per_sample INDELS 6 0 7 2 1
pooled SNPS 49 0 6 8 1
pooled INDELS 5 0 8 2 1
100 kb
per sample: 140.00 minutes
pooled: 145.27 minutes
Name VariantType TRUE_POSITIVE FALSE_POSITIVE FALSE_NEGATIVE TRUE_NEGATIVE CALLED_NOT_IN_DB_AT_ALL
per_sample SNPS 144 0 22 28 1
per_sample INDELS 28 1 16 9 11
pooled SNPS 143 0 23 28 1
pooled INDELS 27 1 17 9 11
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I private/testdata/AFR.structural.indels.bam -L 20:8187565-8187800 -L 20:18670537-18670730 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 8 seconds
haplotypes from pools: 8 seconds
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -I /Users/depristo/Desktop/broadLocal/localData/phaseIII.4x.100kb.bam -L 20:10,000,000-10,001,000 -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -o /dev/null -debug
haplotypes from samples: 173.32 seconds
haplotypes from pools: 167.12 seconds
-- Add pair cleaning feature. Reads in query-name sorted order are required and pairs need to appear consecutively, but if -cleanPairs option is set, a malformed pair where second read is missing is just skipped instead of erroring out.
-- Add integration tests
-- Move walker to public
The Problem
----------
Some read x haplotype pairs were getting very low likelihood when caching is on. Turning it off seemed to give the right result.
Solution
--------
The HaplotypeCaller only initializes the PairHMM once and then feed it with a set of reads and haplotypes. The PairHMM always caches the matrix when the previous haplotype length is the same as the current one. This is not true when the read has changed. This commit adds another condition to zero the haplotype start index when the read changes.
Summarized Changes
------------------
* Added the recacheReadValue check to flush the matrix (hapStartIndex = 0)
* Updated related MD5's
Bamboo link: http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL9
Key improvement
---------------
-- The haplotype caller was producing unstable calls when comparing the following two haplotypes:
ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
in which the alt and ref haplotypes differ in having indel at both the start and end of the bubble. The previous parameter values used in the Path algorithm were set so that such haplotype comparisons would result in the either the above alignment or the following alignment depending on exactly how many GA units were present in the bubble.
ref: ACAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
alt: TGTGTGTGTGTGTGACAGAGAGAGAGAGAGAGAGAGAGAGAGAGA
The number of elements could vary depending on how the graph was built, and resulted in real differences in the calls between BWA mem and BWA-SW calls. I added a few unit tests for this case, and found a set of SW parameter values with lower gap-extension penalties that significantly favor the first alignment, which is the right thing to do, as we really don't mind large indels in the haplotypes relative to having lots of mismatches.
-- Expanded the unit tests in both SW and KBestPaths to look at complex events like this, and to check as well somewhat sysmatically that we are finding many types of expected mutational events.
-- Verified that this change doesn't alter our calls on 20:10,000,000-11,000,000 at all
General code cleanup
--------------------
-- Move Smith-Waterman to its own package in utils
-- Refactored out SWParameters class in SWPairwiseAlignment, and made constructors take either a named parameter set or a Parameter object directly. Depreciated old call to inline constants. This makes it easier to group all of the SW parameters into a single object for callers
-- Update users of SW code to use new Parameter class
-- Also moved haplotype bam writers to protected so they can use the Path SW parameter, which is protected
-- Removed the storage of the SW scoring matrix in SWPairwiseAligner by default. Only the SWPairwiseAlignmentMain test program needs this, so added a gross protected static variable that enables its storage
-- Ensure that BQSR works properly for an Ion Torrent BAM. (Added integration test and bam)
-- Improve the error message when a unknown platform is found (integration test added)
-- When the alignments are sufficiently apart from each other all the scores in the sw matrix could be negative which screwed up the max score calculation since it started at zero.
Problem:
--------
Print Reads was running out of disk space when using the -BQSR option even for small bam files
Solution:
---------
Configure setupWriter to expect pre sorted reads
-- Add a maximum per sample and overall maximum number of reads held in memory by the ART at any one time. Does this in a new TAROrderedReadCache data structure that uses a reservior downsampler to limit the total number of reads to a constant amount. This constant is set to be by default 3000 reads * nSamples to a global maximum of 1M reads, all controlled via the ActiveRegionTraversalParameters annotation.
-- Added an integration test and associated excessively covered BAM excessiveCoverage.1.121484835.bam (private/testdata) that checks that the system is operating correctly.
-- #resolves GSA-921
-- This method provides client with the current number of elements, without having to retreive the underlying list<T>. Added unit tests for LevelingDownsampler and ReservoirDownsampler as these are the only two complex ones. All of the others are trivially obviously correct.
-- The function getReducedCounts() was returning the undecoded reduced read tag, which looks like [10, 5, -1, -5] when the depths were [10, 15, 9, 5]. The only function that actually gave the real counts was getReducedCount(int i) which did the proper decoding. Now GATKSAMRecord decodes the tag into the proper depths vector so that getReduceCounts() returns what one reasonably expects it to, and getReduceCount(i) merely looks up the value at i. Added unit test to ensure this behavior going forward.
-- Changed the name of setReducedCounts() to setReducedCountsTag as this function assumes that counts have already been encoded in the tag way.
-- The previous likelihood calculation proceeds as normal, but after each read has been evaluated against each haplotype we go through the read / allele / likelihoods map and eliminate all reads that have poor fit to any of the haplotypes. This functionality stops us from making a particular type of error in the HC, where we have a haplotype that's very far from the reference allele but not the right true haplotype. All of the reads that are slightly closer to this FP haplotype than the reference previously generated enormous likelihoods in favor of this FP haplotype because they were closer to it than the reference, even if each read had many mismatches w.r.t. the FP haplotype (and so the FP haplotype was a bad model for the true underlying haplotype).
-- Trims down active regions and associated reads and haplotypes to a smaller interval based on the events actually in the haplotypes within the original active region (without extension). Radically speeds up calculations when using large active region extensions. The ActiveRegion.trim algorithm does the best job it can of trimming an active region down to a requested interval while ensuring the resulting active region has a region (and extension) no bigger than the original while spanning as much of the requested extend as possible. The trimming results in an active region that is a subset of the previous active region based on the position and types of variants found among the haplotypes
-- Retire error corrector, archive old code and repurpose subsystem into a general kmer counter. The previous error corrector was just broken (conceptually) and was disabled by default in the engine. Now turning on error correction throws a UserException. Old part of the error corrector that counts kmers was extracted and put into KMerCounter.java
-- Add final simplify graph call after we prune away the non-reference paths in DeBruijnAssembler
-- These events always occur on the very edge of the haplotypes, and are intrinsically dodgy. So instead of emitting them and then potentially having to deal with merging real basepair events into them we just no longer emit those events.
-- Moved R^2 LD haplotype merging system to the utils.haplotype package
-- New LD merging only enabled with HC argument.
-- EventExtractor and EventExtractorUnitTest refactors so we can test the block substitution code without having to enabled it via a static variable
-- A few misc. bug fixes in LDMerger itself
-- Refactoring of Haplotype event splitting and merging code
-- Renamed EventExtractor to EventMap
-- EventMap has a static method that computes the event maps among n haplotypes
-- Refactor Haplotype score and base comparators into their own classes and unit tested them
-- Refactored R^2 based LD merging code into its own class HaplotypeR2Calculator and unit tested much of it.
-- LDMerger now uses the HaplotypeR2Calculator, which cleans up the code a bunch and allowed me to easily test that code with a MockHaplotypeR2Calculator. For those who haven't seen this testing idiom, have a look, and very useful
-- New algorithm uses a likelihood-ratio test to compute the probability that only the phased haplotypes exist in the population.
-- Fixed fundamental bug in the way the previous R^2 implementation worked
-- Optimizations for HaplotypeLDCalculator: only compute the per sample per haplotype summed likelihoods once, regardless of how many calls there are
-- Previous version would enter infinite loop if it merged two events but the second event had other low likelihood events in other haplotypes that didn't get removed. Now when events are removed they are removed from all event maps, regardless of whether the haplotypes carry both events
-- Bugfixes for EventMap in the HaplotypeCaller as well. Previous version was overly restrictive, requiring that the first event to make into a block substitution was a snp. In some cases we need to merge an insertion with a deletion, such as when the cigar is 10M2I3D4M. The new code supports this. UnitTested and documented as well. LDMerger handles case where merging two alleles results in a no-op event. Merging CA/C + A/AA -> CAA/CAA -> no op. Handles this case by removing the two events. UnitTested
-- Turn off debugging output for the LDMerger in the HaplotypeCaller unless -debug was enabled
-- This new version does a much more specific test (that's actually right). Here's the new algorithm:
* Compute probability that two variants are in phase with each other and that no
* compound hets exist in the population.
*
* Implemented as a likelihood ratio test of the hypothesis:
*
* x11 and x22 are the only haplotypes in the populations
*
* vs.
*
* all four haplotype combinations (x11, x12, x21, and x22) all exist in the population.
*
* Now, since we have to have both variants in the population, we exclude the x11 & x11 state. So the
* p of having just x11 and x22 is P(x11 & x22) + p(x22 & x22).
*
* Alternatively, we might have any configuration that gives us both 1 and 2 alts, which are:
*
* - P(x11 & x12 & x21) -- we have hom-ref and both hets
* - P(x22 & x12 & x21) -- we have hom-alt and both hets
* - P(x22 & x12) -- one haplotype is 22 and the other is het 12
* - P(x22 & x21) -- one haplotype is 22 and the other is het 21
-- This fixes edge base bugs where non-consolidated cigars are causing problems in users of the Haplotype object. Input arguments are now checks (let's see if we blow up)
Problem:
--------
PairHMM was generating positive likelihoods (even after the re-work of the model)
Solution:
---------
The caching idices were never re-initializing the initial conditions in the first position of the deletion matrix. Also the match matrix was being wrongly initialized (there is not necessarily a match in the first position). This commit fixes both issues on both the Logless and the Log10 versions of the PairHMM.
Summarized Changes:
------------------
* Redesign the matrices to have only 1 col/row of padding instead of 2.
* PairHMM class now owns the caching of the haplotype (keeps track of last haplotypes, and decides where the caching should start)
* Initial condition (in the deletionMatrix) is now updated every time the haplotypes differ in length (this was wrong in the previous version)
* Adjust the prior and probability matrices to be one based (logless)
* Update Log10PairHMM to work with prior and probability matrices as well
* Move prior and probability matrices to parent class
* Move and rename padded lengths to parent class to simplify interface and prevent off by one errors in new implementations
* Simple cleanup of PairHMMUnitTest class for a little speedup
* Updated HC and UG integration test MD5's because of the new initialization (without enforcing match on first base).
* Create static indices for the transition probabilities (for better readability)
[fixes#47399227]
* As reported here: http://gatkforums.broadinstitute.org/discussion/comment/4270#Comment_4270
* This was a commit into the variant.jar; the changes here are a rev of that jar and handling of errors in VF
* Added integration test to confirm failure with User Error
* Removed illegal header line in KB test VCF that was causing related tests to fail.
* Very trivial, but I happened to see this code and it drove me nuts so I felt compelled to refactor it.
* Instead of iterating over keys in map to get the values, just iterate over the values...
-- When consecutive intervals were within the bandpass filter size the ActiveRegion traversal engine would create
duplicate active regions.
-- Now when flushing the activity profile after we jump to a new interval we remove the extra states which are outside
of the current interval.
-- Added integration test which ensures that the output VCF contains no duplicate records. Was failing test before this commit.
-A UserException is now thrown if either the fai or dict file for the
reference does not exist, with pointers to instructions for creating
these files.
-Gets rid of problematic file locking that was causing intermittent
errors on our farm.
-Integration tests to verify that correct exceptions are thrown in
the case of a missing fai / dict file.
GSA-866 #resolve
-The algorithm for finding the intersection of two sets of intervals
relies on the sortedness of the intervals within each set, but the engine
was not sorting the intervals before attempting to find the intersection.
-The result was that if one or both interval lists was unsorted / lexicographically
sorted, we would often fail to find the intersection correctly.
-Now the IntervalBinding sorts all sets of intervals before returning them,
solving the problem.
-Added an integration test for this case.
GSA-909 #resolve
-- Sometimes it's desireable to specify a set of "good" regions and filter out other stuff (like say an alignability mask or a "good regions" mask). But by default, the -mask argument in VF will only filter sites inside a particular mask. New argument -filterNotInMask will reverse default logic and filter outside of a given mask.
-- Added integration test, and made sure we also test with a BED rod.
The Problem:
------------
the SAM spec does not allow multiple @PG tags with the same id. Our @PG tag writing routines were allowing that to happen with the boolean parameter "keep_all_pg_records".
How this fixes it:
------------------
This commit removes that option from all the utility functions and cleans up the code around the classes that used these methods off-spec.
Summarized changes:
-------------------
* Remove keep_all_pg_records option from setupWriter utility methos in Util
* Update all walkers to now replace the last @PG tag of the same walker (if it already exists)
* Cleanup NWaySamFileWriter now that it doesn't need to keep track of the keep_all_pg_records variable
* Simplify the multiple implementations to setupWriter
Bamboo:
-------
http://gsabamboo.broadinstitute.org/browse/GSAUNSTABLE-PARALLEL31
Issue Tracker:
--------------
[fixes 47100885]
-- Corrected logic to pick biallelic vc to left align.
-- Added integration test to make sure this feature is tested and feature to trim bases is also tested.
The current implementation of the PairHMM had issues with the probabilities and the state machines. Probabilities were not adding up to one because:
# Initial conditions were not being set properly
# Emission probabilities in the last row were not adding up to 1
The following commit fixes both by
# averaging all potential start locations (giving an equal prior to the state machine in it's first iteration -- allowing the read to start it's alignment anywhere in the haplotype with equal probability)
# discounting all paths that end in deletions by not adding the last row of the deletion matrix and summing over all paths ending in matches and insertions (this saves us from a fourth matrix to represent the end state)
Summarized changes:
* Fix LoglessCachingPairHMM and Log10PairHMM according to the new algorithm
* Refactor probabilities check to throw exception if we ever encounter probabilities greater than 1.
* Rename LoglessCachingPairHMM to LoglessPairHMM (this is the default implementation in the HC now)
* Rename matrices to matchMatrix, insertionMatrix and deletionMatrix for clarity
* Rename metric lengths to read and haplotype lengths for clarity
* Rename private methods to initializePriors (distance) and initializeProbabilities (constants) for clarity
* Eliminate first row constants (because they're not used anyway!) and directly assign initial conditions in the deletionMatrix
* Remove unnecessary parameters from updateCell()
* Fix the expected probabilities coming from the exact model in PairHMMUnitTest
* Neatify PairHMM class (removed unused methods) and PairHMMUnitTest (removed unused variables)
* Update MD5s: Probabilities have changed according to the new PairHMM model and as expected HC and UG integration tests have new MD5s.
[fix 47164949]
-- Added ability to trim common bases in front of indels before left-aligning. Otherwise, records may not be left-aligned if they have common bases, as they will be mistaken by complext records.
-- Added ability to split multiallelic records and then left align them, otherwise we miss a lot of good left-aligneable indels.
-- Motivated by this, renamed walker to LeftAlignAndTrimVariants.
-- Code refactoring, cleanup and bring up to latest coding standards.
-- Added unit testing to make sure left alignment is performed correctly for all offsets.
-- Changed phase 3 HC script to new syntax. Add command line options, more memory and reduce alt alleles because jobs keep crashing.
Currently, the multi-allelic test is covering the following case:
Eval A T,C
Comp A C
reciprocate this so that the reverse can be covered.
Eval A C
Comp A T,C
And furthermore, modify ConcordanceMetrics to more properly handle the situation where multiple alternate alleles are available in the comp. It was possible for an eval C/C sample to match a comp T/T sample, so long as the C allele were also present in at least one other comp sample.
This comes from the fact that "truth" reference alleles can be paired with *any* allele also present in the truth VCF, while truth het/hom var sites are restricted to having to match only the alleles present in the genotype. The reason that truth ref alleles are special case is as follows, imagine:
Eval: A G,T 0/0 2/0 2/2 1/1
Comp: A C,T 0/0 1/0 0/0 0/0
Even though the alt allele of the comp is a C, the assessment of genotypes should be as follows:
Sample1: ref called ref
Sample2: alleles don't match (the alt allele of the comp was not assessed in eval)
Sample3: ref called hom-var
Sample4: alleles don't match (the alt allele of the eval was not assessed in comp)
Before this change, Sample2 was evaluated as "het called het" (as the T allele in eval happens to also be in the comp record, just not in the comp sample). Thus: apply current
logic to comp hom-refs, and the more restrictive logic ("you have to match an allele in the comp genotype") when the comp is not reference.
Also in this commit,major refactoring and testing for MathUtils. A large number of methods were not used at all in the codebase, these methods were removed:
- dotProduct(several types). logDotProduct is used extensively, but not the real-space version.
- vectorSum
- array shuffle, random subset
- countOccurances (general forms, the char form is used in the codebase)
- getNMaxElements
- array permutation
- sorted array permutation
- compare floats
- sum() (for integer arrays and lists).
Final keyword was extensively added to MathUtils.
The ratio() and percentage() methods were revised to error out with non-positive denominators, except in the case of 0/0 (which returns 0.0 (ratio), or 0.0% (percentage)). Random sampling code was updated to make use of the cleaner implementations of generating permutations in MathUtils (allowing the array permutation code to be retired).
The PaperGenotyper still made use of one of these array methods, since it was the only walker it was migrated into the genotyper itself.
In addition, more extensive tests were added for
- logBinomialCoefficient (Newton's identity should always hold)
- logFactorial
- log10sumlog10 and its approximation
All unit tests pass
-- This new functionality allows the client to make decisions about how to handle non-informative reads, rather than having a single enforced constant that isn't really appropriate for all users. The previous functionality is maintained now and used by all of the updated pieces of code, except the BAM writers, which now emit reads to display to their best allele, regardless of whether this is particularly informative or not. That way you can see all of your data realigned to the new HC structure, rather than just those that are specifically informative.
-- This all makes me concerned that the informative thresholding isn't appropriately used in the annotations themselves. There are many cases where nearby variation makes specific reads non-informative about one event, due to not being informative about the second. For example, suppose you have two SNPs A/B and C/D that are in the same active region but separated by more than the read length of the reads. All reads would be non-informative as no read provides information about the full combination of 4 haplotypes, as they reads only span a single event. In this case our annotations will all fall apart, returning their default values. Added a JIRA to address this (should be discussed in group meeting)
* It is now cleaner and easier to test; added tests for newly implemented methods.
* Many fixes to the logic to make it work
* The most important change was that after triggering het compression we actually need to back it out if it
creates reads that incorporated too many softclips at any one position (because they get unclipped).
* There was also an off-by-one error in the general code that only manifested itself with het compression.
* Removed support for creating a het consensus around deletions (which was broken anyways).
* Mauricio gave his blessing for this.
* Het compression now works only against known sites (with -known argument).
* The user can pass in one or more VCFs with known SNPs (other variants are ignored).
* If no known SNPs are provided het compression will automatically be disabled.
* Added SAM tag to stranded (i.e. het compressed) reduced reads to distinguish their
strandedness from normal reduced reads.
* GATKSAMRecord now checks for this tag when determining whether or not the read is stranded.
* This allows us to update the FisherStrand annotation to count het compressed reduced reads
towards the FS calculation.
* [It would have been nice to mark the normal reads as unstranded but then we wouldn't be
backwards compatible.]
* Updated integration tests accordingly with new het compressed bams (both for RR and UG).
* In the process of fixing the FS annotation I noticed that SpanningDeletions wasn't handling
RR properly, so I fixed it too.
* Also, the test in the UG engine for determining whether there are too many overlapping
deletions is updated to handle RR.
* I added a special hook in the RR integration tests to additionally run the systematic
coverage checking tool I wrote earlier.
* AssessReducedCoverage is now run against all RR integration tests to ensure coverage is
not lost from original to reduced bam.
* This helped uncover a huge bug in the MultiSampleCompressor where it would drop reads
from all but 1 sample (now fixed).
* AssessReducedCoverage moved from private to protected for packaging reasons.
* #resolve GSA-639
At this point, this commit encompasses most of what is needed for het compression to go live.
There are still a few TODO items that I want to get in before the 2.5 release, but I will save
those for a separate branch because as it is I feel bad for the person who needs to review all
these changes (sorry, Mauricio).
-- added calls to representativeCount() of the pileup instead of using ++
-- renamed CallableLoci integration test
-- added integration test for reduce read support on callable loci
-- Previously we tried to include lots of these low mapping quality reads in the assembly and calling, but we effectively were just filtering them out anyway while generating an enormous amount of computational expense to handle them, as well as much larger memory requirements. The new version simply uses a read filter to remove them upfront. This causes no major problems -- at least, none that don't have other underlying causes -- compared to 10-11mb of the KB
-- Update MD5s to reflect changes due to no longer including mmq < 20 by default
-- DeBruijnAssemblerUnitTest and AlignmentUtilsUnitTest were both in DEBUG = true mode (bad!)
-- Remove the maxHaplotypesToConsider feature of HC as it's not useful