Commit Graph

535 Commits (65d31ba4adfecb5cfa7efbb4e30e60c7a7975c71)

Author SHA1 Message Date
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
David Roazen 8b29030467 Change default downsampling coverage target for the HaplotypeCaller to 250
-was previously set to 30, which seems far too aggressive given that with
 ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any
 pileup returned by LIBS

-250 is a more conservative default used by the UG

-can adjust down/up later based on further experiments (GSA-699 will
 remain open)

-verified with Ryan that all integration test differences are either
 innocent or represent an improvement

GSA-699
2013-02-26 09:33:25 -05:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
Mauricio Carneiro e3f01673e1 Implementation of the find and diagnose Queue script
-- Added 'uncovered intervals' output for FindCoveredIntervals
-- updated scala script to make use of it.
2013-02-22 10:19:01 -05:00
Ryan Poplin 62e14f5b58 Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores. 2013-02-21 14:34:17 -05:00
Eric Banks 6996a953a8 Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample).
These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons):
* Don't unnecessarily overload Allele.getBases() in the Haplotype class.
  * Haplotype.getBases() was calling clone() on the byte array.
* Added a constructor to Allele (and Haplotype) that takes in an Allele as input.
  * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated).
  * Rev'ed the variant jar accordingly.

For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.
2013-02-21 10:14:11 -05:00
Eric Banks 551d33686c Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761
Reduce memory footprint of SyntheticRead by replacing several Lists with...
2013-02-20 04:49:07 -08:00
Eric Banks 9dfdb9528b Merge pull request #49 from broadinstitute/gda_hidden_ug_args
Hide arguments related to reference sample operation in UG - for interna...
2013-02-19 16:18:32 -08:00
Eric Banks 0055a6f1cd Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774
Fix to the Indel Realigner bug described in GSA-774
2013-02-19 16:16:48 -08:00
Guillermo del Angel 5a0a9bc488 Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished. 2013-02-19 19:06:42 -05:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Alec Wysoker ab75e053da Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static
class that contains the attributes that were scattered across the several Lists.
2013-02-19 15:33:33 -05:00
Ryan Poplin c025e84c8b Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping.
-- Updated a few integration tests for HC, UG, and UG general ploidy
2013-02-19 14:09:24 -05:00
Mark DePristo be45edeff2 ActivityProfile and ActiveRegions respects engine interval boundaries
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet.  Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
2013-02-18 10:40:25 -05:00
Ryan Poplin b7e9c342c7 Reducing the size of the reference padding in the HaplotypeCaller. 2013-02-17 11:09:00 -05:00
Mark DePristo 73a363b166 Update MD5s due to new QualityUtils calculations
-- Increase the allowed runtime of one UG integration test
-- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before.  Some updates can push this right over the edge.  Increased limit
-- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster
-- Updating MD5s due to more correct quality utils.  DuplicatesWalkers quality estimates have changed.  One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different
2013-02-16 07:31:38 -08:00
Mark DePristo 3231031c1a Bugfix for FisherStrand
-- FisherStrand pValues can sum to slightly greater than 1.0, so they need to be capped to convert to a Phred-scaled quality score
2013-02-16 07:31:38 -08:00
Mark DePristo 9a29d6d4be Fix an catastrophic bug (WoW!) in the reference calculation of the UG
-- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0.  This is all because there wasn't a unit test.
-- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.
2013-02-16 07:31:38 -08:00
Mark DePristo 9e28d1e347 Cleanup and unit tests for QualityUtils
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value.  Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual.  Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils.  Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
2013-02-16 07:31:37 -08:00
MauricioCarneiro d80b99143f Merge pull request #37 from broadinstitute/rp_left_alignment_hc_contract_GSA-771 2013-02-15 08:32:45 -08:00
MauricioCarneiro 1dd284a5bb Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720
PrintReads writes a header when used with -BQSR
2013-02-15 07:18:28 -08:00
MauricioCarneiro b58a0eca6b Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62
Refactored GATKDocs categories some more ( GSATDG-62 )
2013-02-14 22:35:07 -08:00
Tad Jordan 6cb80591e3 PrintReads writes a header when used with -BQSR 2013-02-14 22:19:14 -05:00
Guillermo del Angel b18f216033 Updated md5's from BiasedDownsamplerIntegrationTest that changed due to changes in HaplotypeCaller - changing HashMaps to LinkedHashMaps changed ordering of reads presented to BiasedDownSampler which changed reads chosen, thereby marginally changing PL's and some site info. 2013-02-14 20:18:49 -05:00
Ryan Poplin 871c8b3866 No need to consider haplotypes which Smith-Waterman aligns off the end of the large padded reference. 2013-02-14 11:18:10 -05:00
Geraldine Van der Auwera 6208742f7c Refactored GATKDocs categories some more ( GSATDG-62 )
-- Renamed ValidatePileup to CheckPileup since validation is reserved word
-- Renamed AlignmentValidation to CheckAlignment (same as above)
-- Refactored category definitions to use constants defined in HelpConstants
-- Fixed a couple of minor typos and an example error
-- Reorganized the GATKDocs index template to use supercategories
-- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)
2013-02-13 16:49:18 -05:00
depristo 357d196dad Merge pull request #32 from broadinstitute/yf_per-sample-downsampling_GSA_765
Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.
2013-02-13 10:08:11 -08:00
Yossi Farjoun 6d12e5a54f Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.
- got md5s from a interim version that does not have the per-sample downsampling hookedup
- added an integration test that forces the result from flat-downsampling to equal that which results from an equivalent flat contamination file
2013-02-13 12:49:39 -05:00
Guillermo del Angel 4308b27f8c Fixed non-determinism in HaplotypeCaller and some UG calls -
-- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads.
-- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast)
-- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller  (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling).
-- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test
-- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).
2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera dff5ef562b Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details)
-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
2013-02-12 13:36:15 -05:00
Ryan Poplin 3f2f837b6a Optimization to ReadPosRankSumTest: Don't do the work of parsing through the cigar string for non-informative reads. 2013-02-11 11:36:09 -05:00
Mark DePristo b4417dff5b Updating MD5s due to changes in HMM
-- New HMM has two impacts on MD5s.  First, all indel calls with UG and all calls by HC no longer have the HaplotypeScore computed.  This is for the good, especially given the computational cost of this annotationa and unclear value for HC.  Second, the BaseQualityRankSum values are changing by tiny amounts because of the changes in the HMM likelihoods.
-- Disabled three tests from Yossi that cause strange MD5 differences with calls for HC, created a JIRA for him to enable and fix
-- Disabled the non-deterministic GGA test.  Assigned JIRA to Guillermo
-- With this push I expect all integration tests to pass
2013-02-09 19:19:28 -05:00
Mark DePristo 35139cf990 HaplotypeScore only annotates SNPs
-- The new HMM new edge conditions the likelihoods are offset by log10(n possible starts) so the results don't really mean "fits the haplotype well" any longer.  This results in grossly inflated HaplotypeScores for indels and with the HaplotypeCaller.  So I'm simply not going to emit this annotation value any longer for indels and for the HC
2013-02-09 19:19:28 -05:00
Mark DePristo e40d83f00e Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses.  This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length.  I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10.  This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM.  All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes.  Fixed bug.  Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit.  Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum.  This involved moving some initialize() code into the computeLikelihoods function.  That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-09 19:19:22 -05:00
Mark DePristo 09595cdeb9 Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments 2013-02-09 13:06:54 -05:00
Mark DePristo 2d802e17a4 Delete the CachingPairHMM 2013-02-09 13:06:54 -05:00
Mark DePristo 7dcafe8b81 Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
2013-02-09 13:06:54 -05:00
Mark DePristo 7fb620dce7 Generalize and fixup ContigComparator
-- Now uses a SAMSequenceDictionary to do the comparison of contigs (which is the right way to do it)
-- Added unit tests
2013-02-09 09:52:13 -05:00
Mauricio Carneiro d004bfbe6f walker to calculate per base coverage distribution
-- Base distribution optionally includes deletions
-- Implemented an optional filtered coverage distribution option
-- Integration tests added for every feature of the traversal

This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage.
GSATDG-45 #resolve
2013-02-07 16:33:05 -05:00
Mauricio Carneiro 5f49c95cc1 Added distance across contigs calculation to GenomeLocs
-- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader)
-- unit test added
GSATDG-45
2013-02-07 16:31:41 -05:00
depristo cd4aec177a Merge pull request #20 from broadinstitute/aw_reduceread_perf_1_GSA-761
Aw reduceread perf 1 gsa 761
2013-02-07 12:11:05 -08:00
Eric Banks 9826192854 Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class!
* AlignmentUtils.getMismatchCount()
* AlignmentUtils.calcAlignmentByteArrayOffset()
* AlignmentUtils.readToAlignmentByteArray().
* AlignmentUtils.leftAlignIndel()
2013-02-07 13:04:24 -05:00
Alec Wysoker e88bc753aa Replace with map.containsKey followed by map.get with map.get followed by null check. 2013-02-07 11:58:41 -05:00
Alec Wysoker 72e496d6f3 Eliminate unnecessary zeroing out of primitive arrays immediately after new. 2013-02-07 11:57:43 -05:00
Eric Banks 481982202d Fixing the failing RR integration tests.
* After consulting Tim/David/Mauricio we determined that the md5 changes were due to different encodings of binary arrays in samjdk
   * However, it made no functional difference to the results (confirmed by Eric) so we agreed to update md5s
 * Also, the header of one of the test bams was malformed but old picard jar didn't perform checks so it only started failing now
   * Fixed the bam
2013-02-06 12:40:56 -05:00
Mark DePristo 59df329776 Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc
-- If the VariantContext is a bi-allelic variant already, don't split up the VC (it doesn't do anything) and then combine it back together.  This saves us a lot of work on average
-- Be more protective of calls to AFCalc with a VariantContext that might only have ref allele, throwing an exception
2013-02-06 10:34:09 -05:00
eitanbanks 584899329c Merge pull request #13 from broadinstitute/dr_variant_migration_GSA-692
Replace org.broadinstitute.variant with jar built from the Picard repo
2013-02-06 07:22:30 -08:00
Eric Banks 562f2406d7 Added check that BaseRecalibrator is not being run on a reduced bam.
- Throws user exception if it is.
 - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
 - Added test to check this is working.
 - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
 - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
 - Added comment about not changing the program record name, as per reviewer comments.
 - Removed unused variable.
2013-02-06 10:14:27 -05:00
Eric Banks 4e5ff3d6f1 Bug fix for NPE in HC with --dbsnp argument.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
 - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
2013-02-05 21:59:19 -05:00