Commit Graph

774 Commits (d8ca4d3e6df72de74b3e8c85d28c762feb94b45e)

Author SHA1 Message Date
Ryan Poplin 0cf5d30dac Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes. 2013-03-15 14:20:25 -04:00
Ryan Poplin b8991f5e98 Fix for edge case bug of trying to create insertions/deletions on the edge of contigs.
-- Added integration test using MT that previously failed
2013-03-15 12:32:13 -04:00
Mark DePristo 2d35065238 QualityByDepth remaps QD values > 40 to a gaussian around 30
-- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads.  In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value.  This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR.
-- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm
-- Update md5s
2013-03-14 16:09:41 -04:00
droazen 0fd9f0e77c Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837
Fixed the logic of the @Output annotation and its interaction with 'required'
2013-03-14 12:52:00 -07:00
Ryan Poplin 38914384d1 Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878. 2013-03-14 14:44:18 -04:00
Geraldine Van der Auwera 61349ecefa Cleaned up annotations
- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
  - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
  - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
  - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
  - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
  - Removed unnecessary check in AverageAltAlleleLength
2013-03-14 14:26:48 -04:00
Eric Banks 7cab709a88 Fixed the logic of the @Output annotation and its interaction with 'required'.
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:

I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
  * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
  * The logic for @Output is now:
    * if required==true then -o MUST be provided or a User Error is generated.
    * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
      * this is the default behavior (i.e. @Output with no modifiers).
    * if required==false and defaultToStdout==false then the output object is null.
      * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).

  * I have updated walkers so that previous behavior has been maintained (as best I could).
    * In general, all @Outputs with default long/short names have required=false.
    * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
  * I added unit tests for @Output changes with David's help (thanks!).
  * #resolve GSA-837
2013-03-14 11:58:51 -04:00
Mark DePristo b5b63eaac7 New GATKSAMRecord concept of a strandless read, update to FS
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads.  Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands.  This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called.  Added new GATKSAMRecord method setReducedCounts() that does the right thing.  Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation.  Differences are just minor updates to the FS
2013-03-13 11:16:36 -04:00
MauricioCarneiro 4403e3572a Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111 2013-03-12 13:02:35 -07:00
MauricioCarneiro 3a16ba04d4 Merge pull request #97 from broadinstitute/eb_refactor_sliding_window
Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug
2013-03-12 12:27:26 -07:00
Geraldine Van der Auwera f972963918 Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
2013-03-12 10:57:14 -04:00
Eric Banks 05e69b6294 Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug.
* Allow RR to write its BAM to stdout by setting required=true for @Output.
  * Fixed bug in sliding window where a break in coverage after a long stretch without
     a variant region was causing a doubling of all the reads before the break.
  * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods.
  * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().
2013-03-12 09:06:55 -04:00
Ryan Poplin c96fbcb995 Use the indel heterozygosity prior when calling indels with the HC 2013-03-11 14:12:43 -04:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
Eric Banks 3759d9dd67 Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine.
* ReadTransformers can say they must be first, must be last, or don't care.
  * By default, none of the existing ones care about ordering except BQSR (must be first).
    * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
  * The engine now orders the read transformers up front before applying iterators.
  * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
  * Added unit tests.
2013-03-06 12:38:59 -05:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Eric Banks bbbaf9ad20 Revert push from stable (I forgot that pushing from stable overwrites current unstable changes) 2013-03-05 09:06:02 -05:00
Eric Banks a037423225 Merged bug fix from Stable into Unstable 2013-03-05 09:03:48 -05:00
Eric Banks 7e1bfd6a7c Included an accidental change from unstable into the previous push 2013-03-05 09:03:31 -05:00
Eric Banks bd4e4f4ee3 Merged bug fix from Stable into Unstable 2013-03-04 23:24:44 -05:00
Eric Banks b715218bfe Fix for mismatching indel quals erro: need to adjust for softclips just like we do for bases and normal quals. 2013-03-04 23:23:18 -05:00
Ryan Poplin ce7554e9d6 Merged bug fix from Stable into Unstable 2013-03-04 12:36:04 -05:00
Ryan Poplin 0697594778 Active regions that don't contain any usable reads should just be skipped over instead of throwing an IllegalStateException. 2013-03-04 12:35:40 -05:00
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
David Roazen c5c99c8339 Split long-running integration test classes into multiple classes
This is to facilitate the current experiment with class-level test
suite parallelism. It's our hope that with these changes, we can get
the runtime of the integration test suite down to 20 minutes or so.

-UnifiedGenotyper tests: these divided nicely into logical categories
 that also happened to distribute the runtime fairly evenly

-UnifiedGenotyperPloidy: these had to be divided arbitrarily into two
 classes in order to halve the runtime

-HaplotypeCaller: turns out that the tests for complex and symbolic
 variants make up half the runtime here, so merely moving these into
 a separate class was sufficient

-BiasedDownsampling: most of these tests use excessively large intervals
 that likely can't be reduced without defeating the goals of the tests. I'm
 disabling these tests for now until they can either be redesigned to use smaller
 intervals around the variants of interest, or refactored into unit tests
 (creating a JIRA for Yossi for this task)
2013-03-01 13:55:23 -05:00
depristo cac3f80c64 Merge pull request #73 from broadinstitute/eb_remove_nested_hashmap_GSA-732
Replace uses of NestedHashMap with NestedIntegerArray.
2013-02-28 05:19:56 -08:00
Eric Banks d2904cb636 Update docs for RTC. 2013-02-27 14:56:44 -05:00
Eric Banks 69b8173535 Replace uses of NestedHashMap with NestedIntegerArray.
* Removed from codebase NestedHashMap since it is unused and untested.
 * Integration tests change because the BQSR CSV is now sorted automatically.
 * Resolves GSA-732
2013-02-27 14:03:39 -05:00
Alec Wysoker c8368ae2a5 Eliminate 7-element arrays in BaseCounts and BaseAndQualsCount and replace with in-line primitive attributes. This is ugly but reduces heap overhead, and changes are localized. When used in conjunction with Mauricio's FastUtil changes it saves and additional 9% or so of execution time. 2013-02-27 12:49:56 -05:00
David Roazen 6466463d5a Merged bug fix from Stable into Unstable 2013-02-26 21:54:54 -05:00
David Roazen 12a3d7ecad Fix licenses on files modified in 2.4-1 2013-02-26 21:53:17 -05:00
David Roazen a53b4a7521 Merged bug fix from Stable into Unstable 2013-02-26 21:41:13 -05:00
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
depristo 93205154b5 Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776
Eb fix pairhmm unittest gsa 776
2013-02-26 11:56:58 -08:00
Eric Banks 734353e9df Merge pull request #60 from broadinstitute/mc_fastutil_GSATDG-83
Brought all of ReduceReads to fastutils
2013-02-26 11:56:41 -08:00
David Roazen 8b29030467 Change default downsampling coverage target for the HaplotypeCaller to 250
-was previously set to 30, which seems far too aggressive given that with
 ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any
 pileup returned by LIBS

-250 is a more conservative default used by the UG

-can adjust down/up later based on further experiments (GSA-699 will
 remain open)

-verified with Ryan that all integration test differences are either
 innocent or represent an improvement

GSA-699
2013-02-26 09:33:25 -05:00
Eric Banks 396b7e0933 Fixed the intermittent PairHMM unit test failure.
The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the
BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of
that class; so if the optimized tests were run first we were unintentionally running those same tests
again with the basic ones (but expecting different results).
2013-02-25 15:05:13 -05:00
Eric Banks 7519484a38 Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods. 2013-02-25 15:04:23 -05:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
Mauricio Carneiro 0ff3343282 Addressing Eric's comments
-- added @param docs to the new variables
-- made all variables final
-- switched to string builder instead of String for performance.

GSATDG-83
2013-02-25 13:33:47 -05:00
Mauricio Carneiro 9e5a31b595 Brought all of ReduceReads to fastutils
-- Added unit tests to ReduceReads name compression
-- Updated reduce reads walker for unit testing

GSATDG-83
2013-02-23 22:53:23 -05:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
Mauricio Carneiro e3f01673e1 Implementation of the find and diagnose Queue script
-- Added 'uncovered intervals' output for FindCoveredIntervals
-- updated scala script to make use of it.
2013-02-22 10:19:01 -05:00
Ryan Poplin 62e14f5b58 Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores. 2013-02-21 14:34:17 -05:00
Eric Banks 6996a953a8 Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample).
These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons):
* Don't unnecessarily overload Allele.getBases() in the Haplotype class.
  * Haplotype.getBases() was calling clone() on the byte array.
* Added a constructor to Allele (and Haplotype) that takes in an Allele as input.
  * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated).
  * Rev'ed the variant jar accordingly.

For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.
2013-02-21 10:14:11 -05:00
Eric Banks 551d33686c Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761
Reduce memory footprint of SyntheticRead by replacing several Lists with...
2013-02-20 04:49:07 -08:00
Eric Banks 9dfdb9528b Merge pull request #49 from broadinstitute/gda_hidden_ug_args
Hide arguments related to reference sample operation in UG - for interna...
2013-02-19 16:18:32 -08:00
Eric Banks 0055a6f1cd Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774
Fix to the Indel Realigner bug described in GSA-774
2013-02-19 16:16:48 -08:00
Guillermo del Angel 5a0a9bc488 Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished. 2013-02-19 19:06:42 -05:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Alec Wysoker ab75e053da Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static
class that contains the attributes that were scattered across the several Lists.
2013-02-19 15:33:33 -05:00
Ryan Poplin c025e84c8b Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping.
-- Updated a few integration tests for HC, UG, and UG general ploidy
2013-02-19 14:09:24 -05:00
Mark DePristo be45edeff2 ActivityProfile and ActiveRegions respects engine interval boundaries
-- Active regions are created as normal, but they are split and trimmed to the engine intervals when added to the traversal, if there are intervals present.
-- UnitTests for ActiveRegion.splitAndTrimToIntervals
-- GenomeLocSortedSet.getOverlapping uses binary search to efficiently in ~ log N time find overlapping intervals
-- UnitTesting overlap function in GenomeLocSortedSet
-- Discovered fundamental implementation bug in that adding genome locs out of order (elements on 20 then on 19) produces an invalid GenomeLocSortedSet.  Created a JIRA to address this: https://jira.broadinstitute.org/browse/GSA-775
-- Constructor that takes a collection of genome locs now sorts its input and merges overlapping intervals
-- Added docs for the constructors in GLSS
-- Update HaplotypeCaller MD5s, which change because ActiveRegions are now restricted to the engine intervals, which changes slightly the regions in the tests and so the reads in the regions, and thus the md5s
-- GenomeAnalysisEngineUnitTest needs to provide non-null genome loc parser
2013-02-18 10:40:25 -05:00
Ryan Poplin b7e9c342c7 Reducing the size of the reference padding in the HaplotypeCaller. 2013-02-17 11:09:00 -05:00
Mark DePristo 73a363b166 Update MD5s due to new QualityUtils calculations
-- Increase the allowed runtime of one UG integration test
-- The GGA indels mode runs two UG commands, and was barely under the 10 minute limit before.  Some updates can push this right over the edge.  Increased limit
-- CalibrateGenotypeLikelihoods runs on a small data set now, so it's faster
-- Updating MD5s due to more correct quality utils.  DuplicatesWalkers quality estimates have changed.  One UG test has different FS and rank sum tests because the conversion to phred scores are slightly (second decimal place) different
2013-02-16 07:31:38 -08:00
Mark DePristo 3231031c1a Bugfix for FisherStrand
-- FisherStrand pValues can sum to slightly greater than 1.0, so they need to be capped to convert to a Phred-scaled quality score
2013-02-16 07:31:38 -08:00
Mark DePristo 9a29d6d4be Fix an catastrophic bug (WoW!) in the reference calculation of the UG
-- The UG was using MathUtils binomial probability backward, so that the estimated confidence was always NaN, and was as a side effect other utils converted this to a meaningless 0.0.  This is all because there wasn't a unit test.
-- I've fixed the calculation, so it's now log10 based, uses robust MathUtils and QualityUtils functions to compute probabilities, and added a unit test.
2013-02-16 07:31:38 -08:00
Mark DePristo 9e28d1e347 Cleanup and unit tests for QualityUtils
-- Fixed a few conversion bugs with edge case quals (ones that were very high)
-- Fixed a critical bug in the conversion of quals that was causing near capped quals to fall below their actual value.  Will undoubtedly need to fix md5s
-- More precise prob -> qual calculations for very high confidence events in phredScaleCorrectRate, trueProbToQual, and errorProbToQual.  Very likely to improve accuracy of many calculations in the GATK
-- Added errorProbToQual and trueProbToQual calculations that accept an integer cap, and perform the (tricky) conversion from int to byte correctly.
-- Full docs and unit tests for phredScaleCorrectRate and phredScaleErrorRate.
-- Renamed probToQual to trueProbToQual
-- Added goodProbability and log10OneMinusX to MathUtils
-- Went through the GATK and cleaned up many uses of QualityUtils
-- Cleanup constants in QualityUtils
-- Added full docs for all of the constants
-- Rename MAX_QUAL_SCORE to MAX_SAM_QUAL_SCORE for clarity
-- Moved MAX_GATK_USABLE_Q_SCORE to RecalDatum, as it's s BQSR specific feature
-- Convert uses of QualityUtils.errorProbToQual(1-x) to QualityUtils.trueProbToQual(x)
-- Cleanup duplicate quality score routines in MathUtils.  Moved and renamed MathUtils.log10ProbabilityToPhredScale => QualityUtils.phredScaleLog10ErrorRate. Removed 3 routines from MathUtils, and remapped their usages into the better routines in QualityUtils
2013-02-16 07:31:37 -08:00
MauricioCarneiro d80b99143f Merge pull request #37 from broadinstitute/rp_left_alignment_hc_contract_GSA-771 2013-02-15 08:32:45 -08:00
MauricioCarneiro 1dd284a5bb Merge pull request #39 from broadinstitute/tj_printreads_tag_for_bqsr_GSA-720
PrintReads writes a header when used with -BQSR
2013-02-15 07:18:28 -08:00
MauricioCarneiro b58a0eca6b Merge pull request #33 from broadinstitute/gg_more_gatkdocs_tweaks_GSATDG-62
Refactored GATKDocs categories some more ( GSATDG-62 )
2013-02-14 22:35:07 -08:00
Tad Jordan 6cb80591e3 PrintReads writes a header when used with -BQSR 2013-02-14 22:19:14 -05:00
Guillermo del Angel b18f216033 Updated md5's from BiasedDownsamplerIntegrationTest that changed due to changes in HaplotypeCaller - changing HashMaps to LinkedHashMaps changed ordering of reads presented to BiasedDownSampler which changed reads chosen, thereby marginally changing PL's and some site info. 2013-02-14 20:18:49 -05:00
Ryan Poplin 871c8b3866 No need to consider haplotypes which Smith-Waterman aligns off the end of the large padded reference. 2013-02-14 11:18:10 -05:00
Geraldine Van der Auwera 6208742f7c Refactored GATKDocs categories some more ( GSATDG-62 )
-- Renamed ValidatePileup to CheckPileup since validation is reserved word
-- Renamed AlignmentValidation to CheckAlignment (same as above)
-- Refactored category definitions to use constants defined in HelpConstants
-- Fixed a couple of minor typos and an example error
-- Reorganized the GATKDocs index template to use supercategories
-- Refactored integration tests for renamed walkers (my earlier refactoring had screwed them up or not carried over)
2013-02-13 16:49:18 -05:00
depristo 357d196dad Merge pull request #32 from broadinstitute/yf_per-sample-downsampling_GSA_765
Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.
2013-02-13 10:08:11 -08:00
Yossi Farjoun 6d12e5a54f Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.
- got md5s from a interim version that does not have the per-sample downsampling hookedup
- added an integration test that forces the result from flat-downsampling to equal that which results from an equivalent flat contamination file
2013-02-13 12:49:39 -05:00
Guillermo del Angel 4308b27f8c Fixed non-determinism in HaplotypeCaller and some UG calls -
-- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads.
-- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast)
-- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller  (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling).
-- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test
-- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).
2013-02-12 15:43:29 -05:00
Geraldine Van der Auwera dff5ef562b Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details)
-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
2013-02-12 13:36:15 -05:00
Ryan Poplin 3f2f837b6a Optimization to ReadPosRankSumTest: Don't do the work of parsing through the cigar string for non-informative reads. 2013-02-11 11:36:09 -05:00
Mark DePristo b4417dff5b Updating MD5s due to changes in HMM
-- New HMM has two impacts on MD5s.  First, all indel calls with UG and all calls by HC no longer have the HaplotypeScore computed.  This is for the good, especially given the computational cost of this annotationa and unclear value for HC.  Second, the BaseQualityRankSum values are changing by tiny amounts because of the changes in the HMM likelihoods.
-- Disabled three tests from Yossi that cause strange MD5 differences with calls for HC, created a JIRA for him to enable and fix
-- Disabled the non-deterministic GGA test.  Assigned JIRA to Guillermo
-- With this push I expect all integration tests to pass
2013-02-09 19:19:28 -05:00
Mark DePristo 35139cf990 HaplotypeScore only annotates SNPs
-- The new HMM new edge conditions the likelihoods are offset by log10(n possible starts) so the results don't really mean "fits the haplotype well" any longer.  This results in grossly inflated HaplotypeScores for indels and with the HaplotypeCaller.  So I'm simply not going to emit this annotation value any longer for indels and for the HC
2013-02-09 19:19:28 -05:00
Mark DePristo e40d83f00e Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses.  This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length.  I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10.  This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM.  All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes.  Fixed bug.  Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit.  Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum.  This involved moving some initialize() code into the computeLikelihoods function.  That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-09 19:19:22 -05:00
Mark DePristo 09595cdeb9 Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments 2013-02-09 13:06:54 -05:00
Mark DePristo 2d802e17a4 Delete the CachingPairHMM 2013-02-09 13:06:54 -05:00
Mark DePristo 7dcafe8b81 Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
2013-02-09 13:06:54 -05:00
Mark DePristo 7fb620dce7 Generalize and fixup ContigComparator
-- Now uses a SAMSequenceDictionary to do the comparison of contigs (which is the right way to do it)
-- Added unit tests
2013-02-09 09:52:13 -05:00
Mauricio Carneiro d004bfbe6f walker to calculate per base coverage distribution
-- Base distribution optionally includes deletions
-- Implemented an optional filtered coverage distribution option
-- Integration tests added for every feature of the traversal

This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage.
GSATDG-45 #resolve
2013-02-07 16:33:05 -05:00
Mauricio Carneiro 5f49c95cc1 Added distance across contigs calculation to GenomeLocs
-- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader)
-- unit test added
GSATDG-45
2013-02-07 16:31:41 -05:00
depristo cd4aec177a Merge pull request #20 from broadinstitute/aw_reduceread_perf_1_GSA-761
Aw reduceread perf 1 gsa 761
2013-02-07 12:11:05 -08:00
Eric Banks 9826192854 Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class!
* AlignmentUtils.getMismatchCount()
* AlignmentUtils.calcAlignmentByteArrayOffset()
* AlignmentUtils.readToAlignmentByteArray().
* AlignmentUtils.leftAlignIndel()
2013-02-07 13:04:24 -05:00
Alec Wysoker e88bc753aa Replace with map.containsKey followed by map.get with map.get followed by null check. 2013-02-07 11:58:41 -05:00
Alec Wysoker 72e496d6f3 Eliminate unnecessary zeroing out of primitive arrays immediately after new. 2013-02-07 11:57:43 -05:00
Eric Banks 481982202d Fixing the failing RR integration tests.
* After consulting Tim/David/Mauricio we determined that the md5 changes were due to different encodings of binary arrays in samjdk
   * However, it made no functional difference to the results (confirmed by Eric) so we agreed to update md5s
 * Also, the header of one of the test bams was malformed but old picard jar didn't perform checks so it only started failing now
   * Fixed the bam
2013-02-06 12:40:56 -05:00
Mark DePristo 59df329776 Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc
-- If the VariantContext is a bi-allelic variant already, don't split up the VC (it doesn't do anything) and then combine it back together.  This saves us a lot of work on average
-- Be more protective of calls to AFCalc with a VariantContext that might only have ref allele, throwing an exception
2013-02-06 10:34:09 -05:00
eitanbanks 584899329c Merge pull request #13 from broadinstitute/dr_variant_migration_GSA-692
Replace org.broadinstitute.variant with jar built from the Picard repo
2013-02-06 07:22:30 -08:00
Eric Banks 562f2406d7 Added check that BaseRecalibrator is not being run on a reduced bam.
- Throws user exception if it is.
 - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
 - Added test to check this is working.
 - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
 - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
 - Added comment about not changing the program record name, as per reviewer comments.
 - Removed unused variable.
2013-02-06 10:14:27 -05:00
Eric Banks 4e5ff3d6f1 Bug fix for NPE in HC with --dbsnp argument.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
 - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
2013-02-05 21:59:19 -05:00
Eric Banks e7c35a907f Fixes to BQSR for the --maximum_cycle_value argument.
- It's now written into the recal report so that it can be used in the PrintReads step.
  - Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before!
  - Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR.
(This is my last non-branch commit; all future pushes will follow new GATK practices)
2013-02-05 17:38:03 -05:00
David Roazen e7e76ed76e Replace org.broadinstitute.variant with jar built from the Picard repo
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
2013-02-05 17:24:25 -05:00
Ryan Poplin cb2dd470b6 Moving the random number generator over to using GenomeAnalysisEngine.getRandomGenerator in the logless versus exact pair hmm unit test. We don't believe this will fix the problem with the non-deterministic test failures but it will give us more information the next time it fails. 2013-02-05 12:56:20 -05:00
MauricioCarneiro 050c4794a5 Merge pull request #11 from yfarjoun/per_sample2
-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...
2013-02-05 08:04:29 -08:00
Eric Banks 23c6aee236 Added in some basic unit tests for polyploid consensus creation in RR.
- Uncovered small bug in the fix that I added yesterday, which is now fixed properly.
- Uncovered massive general bug: polyploid consensus is totally busted for deletions (because of call to read.getReadBases()[readPos]).
  - Need to consult Mauricio on what to do here (are we supporting het compression for deletions?  (Insertions are definitely not supported)
2013-02-05 10:35:45 -05:00
Yossi Farjoun de03f17be4 -Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @Advanced option to the StandardCallerArgumentCollection, a file which should
contain two columns, Sample (String) and Fraction (Double) that form the Sample-Fraction map for the per-sample AlleleBiasedDownsampling.
-Integration tests to UnifiedGenotyper (Using artificially contaminated BAMs created from a mixure of two broadly concented samples) were added
-includes throwing an exception in HC if called using per-sample contamination file (not implemented); tested in a new integration test.
-(Note: HaplotypeCaller already has "Flat" contamination--using the same fraction for all samples--what it doesn't have is
   _per-sample_ AlleleBiasedDownsampling, which is what has been added here to the UnifiedGenotyper.
-New class: DefaultHashMap (a Defaulting HashMap...) and new function: loadContaminationFile (which reads a Sample-Fraction file and returns a map).
-Unit tests to the new class and function are provided.
-Added tests to see that malformed contamination files are found and that spaces and tabs are now read properly.
-Merged the integration tests that pertain to biased downsampling, whether HaplotypeCaller or unifiedGenotyper, into a new IntegrationTest class.
2013-02-04 18:24:36 -05:00
Eric Banks 70f3997a38 More RR tests and fixes.
* Fixed implementation of polyploid (het) compression in RR.
  * The test for a usable site was all wrong.  Worked out details with Mauricio to get it right.
  * Added comprehensive unit tests in HeaderElement class to make sure this is done right.
  * Still need to add tests for the actual polyploid compression.
  * No longer allow non-diploid het compression; I don't want to test/handle it, do you?
* Added nearly full coverage of tests for the BaseCounts class.
2013-02-04 15:55:15 -05:00
Ryan Poplin 79ef41e7b1 Added some docs, unit test, and contracts to SimpleDeBruijnAssembler.
-- Testing that cycles in the reference graph fail graph construction appropriately.
-- Minor bug fix in assembly with reduced reads.

Added some docs and contracts to SimpleDeBruijnAssembler

Added a unit test to SimpleDeBruijnAssembler
2013-02-04 15:17:22 -05:00
Geraldine Van der Auwera 43e3a040b6 Updated UnifiedGenotyper GATKDoc (note on ploidy model) 2013-02-04 14:18:56 -05:00
Chris Hartl 41a030f4b7 Apparently I'm a failure at rebasing...there should have been only one commit message to write. But whatever, here it is again:
Part 1 of Variant Annotator Unit tests: PerReadAlleleLikelihoodMap

 - Added contract enforcement for public methods
 - Refactored the conversion from read -> (allele -> likelihood) to allele -> list[read] into its own method
 - added method documentation for non getters/setters
 - finals, finals everywhere
 - Add in a unit test for the PerReadAlleleLikelihoodMap. Complete coverage except for .clear() and a method that is a straight call into a separately-tested utility class.
2013-02-04 14:16:28 -05:00
Ryan Poplin d9fd89ecaa Somehow these md5 updates got lost in my previous git rebase disaster. Sorry for the trouble. 2013-02-04 13:26:18 -05:00
Eric Banks 2d518f3063 More RR-related updates and tests.
- ReduceReads by default now sets up-front ReadWalker downsampling to 40x per start position.
   - This is the value I used in my tests with Picard to show that memory issues pretty much disappeared.
   - This should hopefully take care of the memory issues being reported on the forum.
- Added javadocs to SlidingWindow (the main RR class) to follow GATK conventions.
- Added more unit tests to increase coverage of BaseCounts class.
- Added more unit tests to test I/D operators in the SlidingWindow class.
2013-02-04 12:57:43 -05:00
Guillermo del Angel 971ded341b Swap java Random generator for GATK one to ensure test determinism 2013-02-04 10:57:34 -05:00
Guillermo del Angel f31bf37a6f First step in better BQSR unit tests for covariates (not done yet): more test coverage in basic covariates, test logging several read groups/read lengths and more combinations simultaneously.
Add basic Javadocs headers for PerReadAlleleLikehoodMap.
2013-02-03 15:31:30 -05:00
Eric Banks 03df5e6ee6 - Added more comprehensive tests for consensus creation to RR. Still need to add tests for I/D ops.
- Added RR qual correctness tests (note that this is a case where we don't add code coverage but still need to test critical infrastructure).
- Also added minor cleanup of BaseUtils
2013-02-01 15:37:19 -05:00
Ryan Poplin 2fee000dba Adding unit tests for KBestPaths class and fixing edge case bugs. 2013-02-01 13:51:31 -05:00
David Roazen c6581e4953 Update MD5s to reflect version number change in the BAM header
I've confirmed via a script that all of these differences only
involve the version number bump in the BAM headers and nothing
else:

< @HD   VN:1.0  GO:none SO:coordinate
---
> @HD   VN:1.4  GO:none SO:coordinate
2013-02-01 13:51:31 -05:00
Guillermo del Angel a520058ef6 Add option to specify maximum STR length to RepeatCovariates from command line to ease testing 2013-02-01 13:51:31 -05:00
Mark DePristo 22f7fe0d52 Expanded unit tests for AlignmentUtils
-- Added JIRA entries for the remaining capabilities to be fixed up and unit tested
2013-02-01 13:51:31 -05:00
Ryan Poplin ac033ce41a Intermediate commit of new bubble assembly graph traversal algorithm for the HaplotypeCaller. Adding functionality for a path from an assembly graph to calculate its own cigar string from each of the bubbles instead of doing a massive Smith-Waterman alignment between the path's full base composition and the reference. 2013-01-31 11:32:19 -05:00
Ryan Poplin 495bca3d1a Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-31 10:12:26 -05:00
Ryan Poplin ca6968d038 Use base List and Map types in the GenotypingEngineUnitTest. 2013-01-31 10:12:18 -05:00
Eric Banks 75ceddf9e5 Adding new unit tests for RR. These tests took a frustratingly long time to get to pass, but now we have a framework for
testing the adding of reads into the SlidingWindow plus consensus creation.  Will flesh these out more after I take care of
some other items on my plate.
2013-01-31 09:46:38 -05:00
Ryan Poplin bb29bd7df7 Use base List and Map types in the HaplotypeCaller when possible. 2013-01-30 17:09:27 -05:00
Ryan Poplin 5f4a063def Breaking up my massive commits into smaller pieces that I can successfully merge and digest. This one enables downsampling in the HaplotypeCaller (by lowering the default dcov to 20) and removes my long-standing, temporary region-based downsampling. 2013-01-30 16:14:07 -05:00
David Roazen 591df2be44 Move additional VariantContext utility methods back to the GATK
Thanks to Eric for his feedback
2013-01-30 13:58:17 -05:00
Ryan Poplin ff8ba03249 Updating BQSR integration test md5s to reflect the updates to the hierarchicalBayesianQualityEstimate function 2013-01-30 13:30:18 -05:00
Ryan Poplin 85dabd321f Adding unit tests for hierarchicalBayesianQualityEstimate function 2013-01-30 13:26:07 -05:00
Ryan Poplin 07fe3dd1ef Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-30 13:19:24 -05:00
David Roazen 9985f82a7a Move BaseUtils back to the GATK by request, along with associated utility methods 2013-01-30 13:09:44 -05:00
Ryan Poplin 2967776458 The Empirical quality column in the recalibration report can't be compared in the BQSRGatherer because the value is calculated using the Bayesian estimate with different priors. This value should never be used from a recalibration report anyway except during plotting. 2013-01-30 12:28:14 -05:00
Eric Banks d067c7f136 Resolving merge conflicts 2013-01-30 10:47:59 -05:00
Eric Banks 9025567cb8 Refactoring the SimpleGenomeLoc into the now public utility UnvalidatingGenomeLoc and the RR-specific FinishedGenomeLoc.
Moved the merging utility methods into GenomeLoc and moved the unit tests around accordingly.
2013-01-30 10:45:29 -05:00
Mark DePristo 4852c7404e GenomeLocs are already comparable, so I'm removing the less complete GenomeLocComparator class and updating ReduceReads and CompressionStash to use built-in comparator 2013-01-30 10:12:27 -05:00
Ryan Poplin 59311aeea2 Getting back null values from the tables is perfectly reasonable if those covariates don't appear in your table. Need to handle them gracefully. 2013-01-30 10:06:14 -05:00
Ryan Poplin e7d7d70247 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-30 10:01:06 -05:00
Mark DePristo 92c5635e19 Cleanup, document, and unit test ActiveRegion
-- All functions tested.  In the testing / review I discovered several bugs in the ActiveRegion routines that manipulate reads.  New version should be correct
-- Enforce correct ordering of supporting states in constructor
-- Enforce read ordering when adding reads to an active region in add
-- Fix bug in HaplotypeCaller map with new updating read spans.  Now get the full span before clipping down reads in map, so that variants are correctly placed w.r.t. the full reference sequence
-- Encapsulate isActive field with an accessor function
-- Make sure that all state lists are unmodifiable, and that the docs are clear about this
-- ActiveRegion equalsExceptReads is for testing only, so make it package protected
-- ActiveRegion.hardClipToRegion must resort reads as they can become out of order
-- Previous version of HC clipped reads but, due to clipping, these reads could no longer overlap the active region.  The old version of HC kept these reads, while the enforced contracts on the ActiveRegion detected this was a problem and those reads are removed.  Has a minor impact on PLs and RankSumTest values
-- Updating HaplotypeCaller MD5s to reflect changes to ActiveRegions read inclusion policy
2013-01-30 09:47:12 -05:00
Mauricio Carneiro 3d9a83c759 BaseCoverageDistributions should be 'by reference'
otherwise we miss all the 0 coverage spots.
2013-01-29 22:37:44 -05:00
Mauricio Carneiro 29fd536c28 Updating licenses manually
Please check that your commit hook is properly pointing at ../../private/shell/pre-commit

Conflicts:
	public/java/test/org/broadinstitute/variant/VariantBaseTest.java
2013-01-29 17:27:53 -05:00
David Roazen a536e1da84 Move some VCF/VariantContext methods back to the GATK based on feedback
-Moved some of the more specialized / complex VariantContext and VCF utility
 methods back to the GATK.

-Due to this re-shuffling, was able to return things like the Pair class back
 to the GATK as well.
2013-01-29 16:56:55 -05:00
Eric Banks e4ec899a87 First pass at adding unit tests for the RR framework: I have added 3 tests and all 3 uncovered RR bugs!
One of the fixes was critical: SlidingWindow was not converting between global and relative positions correctly.
Besides not being correct, it was resulting in a massive slow down of the RR traversal.
That fix definitely breaks at least one of the integration tests, but it's not worth changing md5s now because I'll be
changing things all over RR for the next few days, so I am going to let that test fail indefinitely until I can confirm
general correctness of the tool.
2013-01-29 15:51:07 -05:00
Ryan Poplin cba89e98ad Refactoring the Bayesian empirical quality estimates to be in a single unit-testable function. 2013-01-29 15:50:46 -05:00
Guillermo del Angel 1d5b29e764 Unit tests for repeat covariates: generate 100 random reads consisting of tandem repeat units of random content and size, and check that covariates match expected values at all positions in reads.
Fixed corner case where value of covariate at border between 2 tandem repeats of different length/content wasn't consistent
2013-01-29 15:23:02 -05:00
Guillermo del Angel c11197e361 Refactored repeat covariates to eliminate duplicated code - now all inherit from basic RepeatCovariate abstract class. Comprehensive unit tests coming... 2013-01-29 10:10:24 -05:00
Ryan Poplin 35543b9cba updating BQSR integration test values for the PR half of BQSR. 2013-01-29 09:47:57 -05:00
Ryan Poplin bf25196a0b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-28 22:33:13 -05:00
Ryan Poplin 1f254d29df Don't set the empirical quality when reading in the recal table because then we won't be using the new quality estimates for the prior since the value is cached. 2013-01-28 22:16:43 -05:00
Guillermo del Angel ff799cc79a Fixed bad merge 2013-01-28 20:04:25 -05:00
Guillermo del Angel 5995f01a01 Big intermediate commit (mostly so that I don't have to go again through merge/rebase hell) in expanding BQSR capabilities. Far from done yet:
a) Add option to stratify CalibrateGenotypeLikelihoods by repeat - will add integration test in next push.
b) Simulator to produce BAM files with given error profile - for now only given SNP/indel error rate can be given. A bad context can be specified and if such context is present then error rate is increased to given value.
c) Rewrote RepeatLength covariate to do the right thing - not fully working yet, work in progress.
d) Additional experimental covariates to log repeat unit and combined repeat unit+length. Needs code refactoring/testing
2013-01-28 19:55:46 -05:00
Ryan Poplin d665a8ba0c The Bayesian calculation of Qemp in the BQSR is now hierarchical. This fixes issues in which the covariate bins were very sparse and the prior estimate being used was the original quality score. This resulted in large correction factors for each covariate which breaks the equation. There is also now a new option, qlobalQScorePrior, which can be used to ignore the given (very high) quality scores and instead use this value as the prior. 2013-01-28 15:56:33 -05:00
Ryan Poplin aab160372a No need to sort the BQSR tables by default. 2013-01-28 11:26:01 -05:00
David Roazen f63f27aa13 org.broadinstitute.variant refactor, part 2
-removed sting dependencies from test classes
-removed org.apache.log4j dependency
-misc cleanup
2013-01-28 09:03:46 -05:00
Mauricio Carneiro 1aee8f205e Tool to calculate per base coverage distribution
GSATDG-29 #resolve
2013-01-27 23:38:46 -05:00
Mark DePristo 804caf7a45 HaplotypeCaller Optimization: return a inactive (p = 0.0) activity if the context has no bases in the pileup
-- Allows us to avoid doing a lot of misc. work to set up the genotype when we don't have any data to genotype.  Valuable in the case where we are passing through large regions without any data
2013-01-27 14:10:06 -05:00
Ami Levy-Moonshine b4447cdca2 In cases where one uses VariantContextUtils.GenotypeMergeType.REQUIRE_UNIQUE we used to verify that the samples names are unique in VariantContextUtils.simpleMerge for each VCs. It couse to a bug that was reported on the forum (when a VCs had 2 VC from the same sample).
Now we will check it only in CombineVariants.init using the headers. A new function was added to SamplesUtils with unitTests in CVunitTest.java.
2013-01-25 15:49:51 -05:00
Mark DePristo 3f95f39be3 Updating HC md5s for new cutting algorithm and default band pass filter parameters 2013-01-25 11:07:29 -05:00
Eric Banks f7b80116d6 Don't let users play with the different exact model implementations. 2013-01-25 10:52:02 -05:00
Eric Banks 6dd0e1ddd6 Pulled out the --regenotype functionality from SelectVariants into its own tool: RegenotypeVariants.
This allows us to move SelectVariants into the public suite of tools now.
2013-01-25 09:42:04 -05:00
Mark DePristo 592f90aaef ActivityProfile now cuts intelligently at the best local minimum when in a larger than max size active region
-- This new algorithm is essential to properly handle activity profiles that have many large active regions generated from lots of dense variant events.  The new algorithm passes unit tests and passes visualize visual inspection of both running on 1000G and NA12878
-- Misc. commenting of the code
-- Updated ActiveRegionExtension to include a min active region size
-- Renamed ActiveRegionExtension to ActiveRegionTraversalParameters, as it carries more than just the traversal extension now
2013-01-24 13:48:00 -05:00
Eric Banks 6790e103e0 Moving lots of walkers back from protected to public (along with several of the VA annotations).
Let's see whether Mauricio's automatic git hook really works!
2013-01-24 11:42:49 -05:00
Chris Hartl a3b98daf1a Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-23 14:49:34 -05:00
Chris Hartl 7fcfa4668c Since GenotypeConcordance is now a standalone walker, remove the old GenotypeConcordance evaluation module and the associated integration tests. 2013-01-23 14:47:23 -05:00
Mark DePristo 8026199e4c Updating md5s for CountReadsInActiveRegions and HaplotypeCaller to reflect new activity profile mechanics
-- In this process I discovered a few missed sites in the old code.  The new approach actually produces better HC results than the previous version.
2013-01-23 13:46:01 -05:00
Mark DePristo 8d9b0f1bd5 Restructure ActivityProfiler into root class ActivityProfile and derived class BandPassActivityProfile
-- Required before I jump in an redo the entire activity profile so it's can be run imcrementally
-- This restructuring makes the differences between the two functionalities clearer, as almost all of the functionality is in the base class. The only functionality provided by the BandPassActivityProfile is isolated to a finalizeProfile function overloaded from the base class.
-- Renamed ActivityProfileResult to ActivityProfileState, as this is a clearer indication of its actual functionality.  Almost all of the misc. walker changes are due to this name update
-- Code cleanup and docs for TraverseActiveRegions
-- Expanded unit tests for ActivityProfile and ActivityProfileState
2013-01-23 13:45:21 -05:00
Chris Hartl c500e1d8ac Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-22 15:31:30 -05:00
Chris Hartl d33c755aea Adding docs. 2013-01-22 15:29:33 -05:00
Chris Hartl 7060e01a8e Fix for broken unit test plus some minor changes to comments. Unit tests were broken by my pulling the site status utility function into the enum. Thankfully the unit tests caught my silly duplication of a line. 2013-01-22 15:14:41 -05:00
Mauricio Carneiro 7b8b064165 Last manual license update (hopefully)
if everyone updates their git hook accordingly, this will be the last time I have to manually run the script.

GSATDG-5
2013-01-18 16:13:07 -05:00
Ami Levy-Moonshine 0fb7b73107 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-18 15:03:42 -05:00
Ami Levy-Moonshine 826c29827b change the default VCFs gatherer of the GATK (not just the UG) 2013-01-18 15:03:12 -05:00
Eric Banks cac439bc5e Optimized the Allele Biased Downsampling: now it doesn't re-sort the pileup but just removes reads from the original one.
Added a small fix that slightly changed md5s.
2013-01-18 11:17:31 -05:00
Chris Hartl 08d2da9057 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-18 10:28:45 -05:00
Chris Hartl bf5748a538 Forgot to actually put in the md5. Also with the new change to record pairing and filtering, the multiple-records integration test changed: the indel records (T/TG | T/TGACA) are matched up (rather than left separate) resulting in properly identifying mismatching alleles, rather than HET-UNAVAILABLE and UNAVAILABLE-HET. Very nice. 2013-01-18 10:25:36 -05:00
Chris Hartl 91030e9afa Bugfix: records that get paired up during the resolution of multiple-records-per-site were not going into genotype-level filtering. Caught via testing.
Testing for moltenized output, and for genotype-level filtering. This tool is now fully functional. There are three todo items:

1) Docs
2) An additional output table that gives concordance proportions normalized by records in both eval and comp (not just total in eval or total in comp)
3) Code cleanup for table creation (putting a table together the way I do takes -way- too many lines of code)
2013-01-18 09:49:48 -05:00
Eric Banks 39c73a6cf5 1. Ryan and I noticed that the FisherStrand annotation was completely busted for indels with reduced reads; fixed.
2. While making the previous fix and unifying FS for SNPs and indels, I noticed that FS was slightly broken in the general case for indels too; fixed.
3. I also fixed a minor bug in the Allele Biased Downsampling code for reduced reads.
2013-01-18 03:35:48 -05:00
Eric Banks 6a903f2c23 I finally gave up on trying to get the Haplotype/Allele merging to work in the HaplotypeCaller.
I've resigned myself instead to create a mapping from Allele to Haplotype.  It's cheap so not a big deal, but really shouldn't be necessary.
Ryan and I are talking about refactoring for GATK2.5.
2013-01-18 01:21:08 -05:00
Eric Banks 6db3e473af Better error message for bad qual 2013-01-17 10:30:04 -05:00
Eric Banks 953592421b I think we got out of sync with the HC tests as we were clobbering each other's changes. Only differences here are to some RankSumTest values. 2013-01-17 09:19:21 -05:00
Eric Banks ded659232b Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 22:49:56 -05:00
Eric Banks a623cca89a Bug fix for HaplotypeCaller, as reported on the forum: when reduced reads didn't completely overlap a deletion call,
we were incorrectly trying to find the reference position of a base on the read that didn't exist.
Added integration test to cover this case.
2013-01-16 22:47:58 -05:00
Eric Banks dbb69a1e10 Need to use ints for quals in HaplotypeScore instead of bytes because of overflow (they are summed when haplotypes are combined) 2013-01-16 22:33:16 -05:00
Chris Hartl e15d4ad278 Addition of moltenize argument for moltenized tabular output. NRD/NRS not moltenized because there are only two columns. 2013-01-16 18:00:23 -05:00
Mark DePristo 3c476a92a2 Add dummy functionality (currently throws an error) to allow HC to include unmapped reads during assembly and calling 2013-01-16 16:25:36 -05:00
Eric Banks 4cf34ee9da Bug fix to FisherStrand: do not let it output INFINITY. This all needs to be unit tested, but that's coming on the horizon. 2013-01-16 15:35:04 -05:00
Mark DePristo 2a42b47e4a Massive expansion of ActiveRegionTraversal unit tests, resulting in several bugfixes to ART
-- UnitTests now include combinational tiling of reads within and spanning shard boundaries
-- ART now properly handles shard transitions, and does so efficiently without requiring hash sets or other collections of reads
-- Updating HC and CountReadsInActiveRegions integration tests
2013-01-16 15:30:00 -05:00
Eric Banks e47a389b26 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 14:59:11 -05:00
Eric Banks d18dbcbac1 Added tests for changing IUPAC bases to Ns, for failing on bad ref bases, and for the HaplotypeCaller not failing when running over a region with an IUPAC base.
Out of curiosity, why does Picard's IndexedFastaSequenceFile allow one to query for start position 0?  When doing so, that base is a line feed (-1 offset to the first base in the contig) which is an illegal base (and which caused me no end of trouble)...
2013-01-16 14:55:33 -05:00
Khalid Shakir 4ffb43079f Re-committing the following changes from Dec 18:
Refactored interval specific arguments out of GATKArgumentCollection into InvtervalArgumentCollection such that it can be used in other CommandLinePrograms.
Updated SelectHeaders to print out full interval arguments.
Added RemoteFile.createUrl(Date expiration) to enable creation of presigned URLs for download over http: or file:.
2013-01-16 12:43:15 -05:00
Eric Banks 445735a4a5 There was no reason to be sharing the Haplotype infrastructure between the HaplotypeCaller and the HaplotypeScore annotation since they were really looking for different things.
Separated them out, adding efficiencies for the HaplotypeScore version.
2013-01-16 11:10:13 -05:00
Eric Banks 392b5cbcdf The CachingIndexedFastaSequenceFile now automatically converts IUPAC bases to Ns and errors out on other non-standard bases.
This way walkers won't see anything except the standard bases plus Ns in the reference.
Added option to turn off this feature (to maintain backwards compatibility).

As part of this commit I cleaned up the BaseUtils code by adding a Base enum and removing all of the static indexes for
each of the bases.  This uncovered a bug in the way the DepthOfCoverage walker counts deletions (it was counting Ns instead!) that isn't covered by tests.  Fortunately that walker is being deprecated soon...
2013-01-16 10:22:43 -05:00
Eric Banks 4fb3e48099 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-16 00:13:38 -05:00
Eric Banks 0d282a7750 Bam writing from HaplotypeCaller seems to be working on all my test cases. Note that it's a hidden debugging option for now.
Please let me know if you notice any bad behavior with it.
2013-01-16 00:12:02 -05:00
Chris Hartl 327169b283 Refactor the method that identifies the site overlap type into the type enum class (so it can be used elsewhere potentially).
Completed todo item: for sites like

(eval)
20   12345   A    C
20   12345   A    AC

(comp)
20   12345   A    C
20   12345   A    ACCC

the records will be matched by the presence of a non-empty intersection of alleles. Any leftover records are then paired with an empty variant context (as though the call was unique). This has one somewhat counterintuitive feature, which is that normally

(eval)
20  12345  A   AC
(comp)
20  12345  A   ACCC

would be classified as 'ALLELES_DO_NOT_MATCH' (and not counted in genotype tables), in the presence of the SNP, they're counted as EVAL_ONLY and TRUTH_ONLY respectively.

+ integration test
2013-01-15 12:13:45 -05:00
Eric Banks d3baa4b8ca Have Haplotype extend the Allele class.
This way, we don't need to create a new Allele for every read/Haplotype pair to be placed in the PerReadAlleleLikelihoodMap (very inefficient).  Also, now we can easily get the Haplotype associated with the best allele for a given read.
2013-01-15 11:36:20 -05:00
Mark DePristo 3c37ea014b Retire original TraverseActiveRegion, leaving only the new optimized version
-- Required some updates to MD5s, which was unexpected, and will be sorted out later with more detailed unit tests
2013-01-15 10:24:45 -05:00
Eric Banks 94800771e3 1. Initial implementation of bam writing for the HaplotypeCaller with -bam argument; currently only assembled haplotypes are emitted.
2. Framework is set up in the VariantAnnotator for the HaplotypeCaller to be able to call in to annotate dbSNP plus comp RODs.  Until the HC uses meta data though, this won't work.
2013-01-15 10:19:18 -05:00
Chris Hartl 682c59ff04 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-14 13:27:34 -05:00
Chris Hartl 61bc334df1 Ensure output table formatting does not contain NaNs. For (0 eval ref calls)/(0 comp ref calls), set the proportion to 0.00.
Added integration tests (checked against manual tabulation)
2013-01-14 09:21:30 -05:00
Ryan Poplin a7fe334a3f calculating the md5s for the new tests. 2013-01-11 15:43:52 -05:00
Ryan Poplin 65afec2a53 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-11 15:22:52 -05:00
Mark DePristo 85b529cced Updating MD5s in HC and UG that changed due to new LIBS
-- Resolved what was clearly a bug in UG (GGA mode was returning a neighboring, equivalent indel site that wasn't in input list.  Not ideal)
-- Trivial read count differences in HC
2013-01-11 15:17:19 -05:00
Mark DePristo 8b83f4d6c7 Near final cleanup of PileupElement
-- All functions documented and unit tested
-- New constructor interface
-- Cleanup some uses of old / removed functionality
2013-01-11 15:17:17 -05:00
Mark DePristo fb9eb3d4ee PileupElement and LIBS cleanup
-- function to create pileup elements in AlignmentStateMachine and LIBS
-- Cleanup pileup element constructors, directing users to LIBS.createPileupFromRead() that really does the right thing
2013-01-11 15:17:17 -05:00
Mark DePristo cc1d259cac Implement get Length and Bases of OfImmediatelyFollowingIndel in PileupElement
-- Added unit tests for this behavior.  Updated users of this code
2013-01-11 15:17:17 -05:00
Mark DePristo 2c38310868 Create LIBS using new AlignmentStateMachine infrastructure
-- Optimizations to AlignmentStateMachine
-- Properly count deletions.  Added unit test for counting routines
-- AlignmentStateMachine.java is no longer recursive
-- Traversals now use new LIBS, not the old one
2013-01-11 15:17:17 -05:00
Mark DePristo b53286cc3c HaplotypeCaller mode to skip assembly and genotyping for performance testing
-- Added HCPerformance evaluation Qscript
-- Added some docs about one of the HC integration tests
-- HaplotypeCaller / ART performance evaluation script
2013-01-11 15:17:16 -05:00
Ryan Poplin e952296c10 Adding HC GGA integration test to cover duplicated input alleles. 2013-01-11 15:01:27 -05:00
Ryan Poplin 7f7f40f851 Adding additional HC GGA integration tests to cover more complicated input alleles. 2013-01-11 14:36:21 -05:00
Eric Banks 85baf71b39 Merged bug fix from Stable into Unstable 2013-01-11 11:05:27 -05:00
Eric Banks d78539774f Another RR bug: off by one error led to ArrayIndexOutOfBoundsException when working with multiple samples and the variant region ended 1 base after the end of the last read for a given sample. 2013-01-11 11:05:09 -05:00
Eric Banks 79b93f659c Merged bug fix from Stable into Unstable 2013-01-11 09:20:13 -05:00
Eric Banks 67fafbb625 Forgot an include 2013-01-11 09:19:46 -05:00
Eric Banks 6bf0cc32f9 When reducing multiple samples it is possible to try to close a region that for a given sample has no reads. Currently we'd NPE. Fixed. 2013-01-11 09:16:19 -05:00
Eric Banks e7906713d9 Moving some random walkers back to public as requested by Mark. Mauricio will the licenses get updated automatically? 2013-01-11 02:03:43 -05:00
Eric Banks 3a51823c2a Clean up imports 2013-01-10 23:35:01 -05:00
Eric Banks e4b7b1955c Forgot to add the note about length normalization to the QD docs 2013-01-10 23:34:06 -05:00
Eric Banks ff5ac986d8 Fix docs for QD 2013-01-10 23:31:46 -05:00
Mauricio Carneiro 2a4ccfe6fd Updated all JAVA file licenses accordingly
GSATDG-5
2013-01-10 17:06:41 -05:00
Mauricio Carneiro dd177b1714 Removing fully commented out varianteval evaluators
- Files were completely commmented out, and were screwing up my license script. Dont like them. Removed them.

GSATDG-5
2013-01-10 17:06:12 -05:00
Chris Hartl 80dec72c53 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-10 14:35:59 -05:00
Chris Hartl 31a5f88c4f Expanded unit tests to cover the Concordance Metrics class fairly uniformly. 2013-01-10 14:33:47 -05:00
Ryan Poplin 1a18947abf Adding new command line argument requested on the forum to control the maximum number of haplotypes that are sent forward for genotyping. In the presence of a large degree of heterozygosity the current algorithm breaks down and so this argument would need to be increased. 2013-01-09 15:54:02 -05:00
Ryan Poplin 487fb2afb4 Bug fix for the case of overlapping assembled and partially-assembled events created by the HC. Unfortunately the symbolic allele can't be combined with the indel allele because the reference basis will change. 2013-01-09 15:30:46 -05:00
Chris Hartl 6787f86803 Eliminate the import of DiploidGenotype, which switched public/private underneath me but for some reason didn't stop me from compiling... 2013-01-09 13:23:24 -05:00
Chris Hartl c1de92b511 Add in some todo items 2013-01-09 13:16:06 -05:00
Chris Hartl 8d126161e2 Merge branch 'master' of gsa2:/humgen/gsa-scr1/chartl/dev/unstable 2013-01-09 13:15:04 -05:00
Eric Banks 3a0dd4b175 Oops, I broke the build. NOW we shouldn't have any more public->protected dependancies. 2013-01-09 11:12:28 -05:00
Eric Banks a921b06e02 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-09 11:06:17 -05:00
Eric Banks 4fa439d89e Move some classes back to public because they are used in the engine. Move some test classes to protected. We should have no more public->protected dependancies now 2013-01-09 11:06:10 -05:00
Ryan Poplin 396bce1f28 Reverting this change until we can figure out the right thing to do here. 2013-01-09 10:51:30 -05:00
Eric Banks 676e79542a Bring CombineVariants back to public since it's used for SG. I needed to break ChromosomeCountConstants out of ChromosomeCounts to make this work. 2013-01-09 10:39:48 -05:00
Ryan Poplin c87ad8c0ef Bug fixes related to HC's GGA mode. Tracking just the artificial allele isn't sufficient when there are multiple GGA records that change the reference basis. Also, duplicated records screw up the tracking of merged alleles. 2013-01-09 10:00:46 -05:00
Chris Hartl ad7c2a08d4 Normalize by the event type counts, not the total genotype counts: more useful normalization. 2013-01-09 09:12:41 -05:00
Chris Hartl b56754606b Initial break-out of GenotypeConcordance as a standalone walker. Some basic functionality testing. Currently performs only a pairwise comparison, but is very careful about proper tabulation through the GenotypeType enum. 2013-01-09 00:34:07 -05:00
Eric Banks 264cc9e78d Resolve protected->public dependencies for BQSR by wrapping the BQSR-specific arguments in a new class.
Instead of the GATK Engine creating a new BaseRecalibrator (not clean), it just keeps track of the arguments (clean).

There are still some dependency issues, but it looks like they are related to Ami's code.  Need to look into it further.
2013-01-08 16:23:29 -05:00
Eric Banks ee7d85c6e6 Move around the DiploidGenotype classes (so it can be used by the GATKPaperGenotyper) 2013-01-08 15:53:11 -05:00
Eric Banks 0e2e672521 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-08 15:46:39 -05:00
Eric Banks f0bd1b5ae5 Okay, all public->protected dependencies are gone except for the BQSR arguments. I'll need to think through this but should be able to make that work too. 2013-01-08 15:46:32 -05:00
Tad Jordan 9cbb2b868f ErrorRatePerCycleIntegrationTest fix
-- sorting by row is required
2013-01-08 14:53:07 -05:00
Eric Banks b099e2b4ae Moving integration tests to protected 2013-01-08 09:34:08 -05:00
Eric Banks dfe4cf1301 When merging the PerReadAlleleLikelihoodMap classes, I forgot to initialize the underlying objects. This was causing the LargeScaleTests to fail. 2013-01-08 09:24:12 -05:00
Eric Banks 9e6c2afb28 Not sure why IntelliJ didn't add this for commit like the other dirs 2013-01-07 18:11:07 -05:00
Ami Levy-Moonshine 3787ee6de7 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-07 17:07:29 -05:00
Eric Banks 47d030a52d Oops, move the covariates over too 2013-01-07 15:47:25 -05:00
Eric Banks 35699a8376 Move bqsr utils to protected 2013-01-07 15:41:21 -05:00
Eric Banks a0219acfaa Collapse the PerReadAlleleLikelihoodMap classes into 1 now that Lite is gone 2013-01-07 14:55:21 -05:00
Eric Banks 35d9bd377c Moved (nearly) all Walkers from public to protected and removed GATKLite utils 2013-01-07 14:42:40 -05:00
Ryan Poplin 4f95f850b3 Bug fix in the HC's allele mapping for multi-allelic events. Using the allele alone as a key isn't sufficient because alleles change when the reference allele changes during VariantContextUtils.simpleMerge for multi-allelic events. 2013-01-07 11:05:44 -05:00
Ami Levy-Moonshine d3c2c97fb2 Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-01-06 23:35:47 -05:00
Ami Levy-Moonshine 81eef3aa37 merge development branchs of log-less HMM and FastGatherer to master 2013-01-06 23:01:58 -05:00
Eric Banks 52067f0549 Handle merge conflicts 2013-01-06 12:29:12 -05:00
Chris Hartl 41bc416b65 Remove AAL and update MD5s. 2013-01-04 16:46:14 -05:00
Eric Banks bce6fce58d Resolving merge conflicts after Mark's latest push 2013-01-04 14:46:39 -05:00
Eric Banks dd7f5e2be7 Hooking up the Bayesian estimate code for calculating Qemp in BQSR; various fixes after adding unit tests. 2013-01-04 14:43:11 -05:00
Mark DePristo bbdf9ee91b BQSR cleanup: merge Advanced and Standard recalibration engine into just the RecalibrationEngine
-- As we are no longer maintaining a public/protected system we need only have one RecalibrationEngine.
-- Misc. code cleanup and docs along the way
2013-01-04 11:39:24 -05:00
Mark DePristo 7df47418d8 BQSR optimization: make RecalibrationTables thread-local, and merge results in onTraversalDone
-- With the newer, faster BQSR, scaling was limited by the NestedIntegerArray.  The solution to this is to make the entire table thread-local, so that each nct thread has its own data and doesn't have any collisions.
-- Removed the previous partial solution of having a thread-local quality score table
-- Added a new argument -lowMemory
2013-01-04 11:39:24 -05:00
Chris Hartl 3753209584 One md5sum slipped past in the HC integration test. 2013-01-02 15:09:28 -05:00
Chris Hartl e1d09ab0db QD is now divided by the average length of the alternate allele (weighted by the allele count). The average length is stored in a related annotation, "AAL", which can be used to re-compute the "old" QD by simple multiplication. Integration tests *should* all pass. 2013-01-02 14:41:29 -05:00
Eric Banks 275575462f Protect against non-standard ref bases. Ryan, please review. 2012-12-26 15:46:21 -05:00
Mark DePristo 7bf1f67273 BQSR optimization: read group x quality score calibration table is thread-local
-- AdvancedRecalibrationEngine now uses a thread-local table for the quality score table, and in finalizeData merges these thread-local tables into the final table.  Radically reduces the contention for RecalDatum in this very highly used table
-- Refactored the utility function to combine two tables into RecalUtils, and created UnitTests for this function, as well as all of RecalibrationTables.  Updated combine in RecalibrationReport to use this table combiner function
-- Made several core functions in RecalDatum into final methods for performance
-- Added RecalibrationTestUtils, a home for recalibration testing utilities
2012-12-24 13:35:58 -05:00
Mark DePristo 0f0188ddb1 Optimization of BQSR
-- Created a ReadRecalibrationInfo class that holds all of the information (read, base quality vectors, error vectors) for a read for the call to updateDataForRead in RecalibrationEngine.  This object has a restrictive interface to just get information about specific qual and error values at offset and for event type.  This restrict allows us to avoid creating an vector of byte 45 for each read to represent BI and BD values not in the reads.  Shaves 5% of the runtime off the entire code.
-- Cleaned up code and added lots more docs
-- With this commit we no longer have much in the way of low-hanging fruit left in the optimization of BQSR.  95% of the runtime is spent in BAQing the read, and updating the RecalData in the NestedIntegerArrays.
2012-12-24 13:35:09 -05:00