Commit Graph

11850 Commits (357d196dadcefd6f025d019157a1d0b157f2b059)

Author SHA1 Message Date
depristo 357d196dad Merge pull request #32 from broadinstitute/yf_per-sample-downsampling_GSA_765
Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.
2013-02-13 10:08:11 -08:00
Yossi Farjoun 6d12e5a54f Fixed md5s for the per-sample downsampling IntegrationTests that were disabled.
- got md5s from a interim version that does not have the per-sample downsampling hookedup
- added an integration test that forces the result from flat-downsampling to equal that which results from an equivalent flat contamination file
2013-02-13 12:49:39 -05:00
depristo 961f2533a5 Merge pull request #29 from broadinstitute/gda_gga_hc_GSA-722
Gda gga hc gsa 722
2013-02-13 07:58:57 -08:00
Guillermo del Angel 4308b27f8c Fixed non-determinism in HaplotypeCaller and some UG calls -
-- HaplotypeCaller and PerReadAlleleLikelihoodMap should use LinkedHashMaps instead of plain HashMaps. That way the ordering when traversing alleles is maintained. If the JVM traverses HashMaps with random ordering, different reads (with same likelihood) may be removed by contamination checker, and different alleles may be picked if they have same likelihoods for all reads.
-- Put in some GATKDocs and contracts in HaplotypeCaller files (far from done, code is a beast)
-- Update md5's due to different order of iteration in LinkedHashMaps instead of HashMaps inside HaplotypeCaller  (due to change in PerReadAlleleLikelihoodMap that also slightly modifies reads chosen by per-read downsampling).
-- Reenabled testHaplotypeCallerMultiSampleGGAMultiAllelic test
-- Added some defensive argument checks into HaplotypeCaller public functions (not intended to be done yet).
2013-02-12 15:43:29 -05:00
depristo 38cea0a7ab Merge pull request #28 from broadinstitute/gg_reorganize_gatkdocs_categories_GSATDG-62
Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details)
2013-02-12 11:11:45 -08:00
Geraldine Van der Auwera dff5ef562b Reorganized walker categories in GATKDocs (@DocumentedGATKFeature details)
-- Sorted out contents of BAM Processing vs. Diagnostics & QC Tools
-- Moved two validation-related walkers from Diagnostics & QC to Validation Utilities
-- Reworded some category names and descriptions to be more explicit and user-friendly
2013-02-12 13:36:15 -05:00
depristo 59484dfae4 Merge pull request #27 from broadinstitute/rp_ranksumtest_optimization
Optimization to ReadPosRankSumTest: Don't do the work of parsing through...
2013-02-11 08:58:22 -08:00
Ryan Poplin 3f2f837b6a Optimization to ReadPosRankSumTest: Don't do the work of parsing through the cigar string for non-informative reads. 2013-02-11 11:36:09 -05:00
delangel f8e2153c71 Merge pull request #25 from broadinstitute/md_hmm_fail_GSA-751
Md hmm fail gsa 751
2013-02-09 16:43:50 -08:00
Mark DePristo b4417dff5b Updating MD5s due to changes in HMM
-- New HMM has two impacts on MD5s.  First, all indel calls with UG and all calls by HC no longer have the HaplotypeScore computed.  This is for the good, especially given the computational cost of this annotationa and unclear value for HC.  Second, the BaseQualityRankSum values are changing by tiny amounts because of the changes in the HMM likelihoods.
-- Disabled three tests from Yossi that cause strange MD5 differences with calls for HC, created a JIRA for him to enable and fix
-- Disabled the non-deterministic GGA test.  Assigned JIRA to Guillermo
-- With this push I expect all integration tests to pass
2013-02-09 19:19:28 -05:00
Mark DePristo 35139cf990 HaplotypeScore only annotates SNPs
-- The new HMM new edge conditions the likelihoods are offset by log10(n possible starts) so the results don't really mean "fits the haplotype well" any longer.  This results in grossly inflated HaplotypeScores for indels and with the HaplotypeCaller.  So I'm simply not going to emit this annotation value any longer for indels and for the HC
2013-02-09 19:19:28 -05:00
Mark DePristo e40d83f00e Final version of PairHMMs with correct edge conditions
-- Uses 1/N for N potential start sites as the probability of starting at any one of the potential start sites
-- Add flag that says to use the original edge condition, respected by all subclasses.  This brings the new code back to the original state, but with all of the cleanup I've done
-- Only test configurations where the read length <= haplotype length.  I think this is actually the contract, but we'll talk about this tomorrow
-- Fix egregious bug with the myLog10SumLog10 function doing the exact opposite of the requested arguments, so that doExact really meant don't do exact
-- PairHMM now exposes computeReadLikelihoodGivenHaplotypeLog10 but subclasses must overload subComputeReadLikelihoodGivenHaplotypeLog10.  This protected function does the work, and the public function will do argument and result QC
-- Have to be more tolerant of reference (approximate) HMM.  All unit tests from the original HMM implementations pass now
-- Added locs of docs
-- Generalize unit tests with multiple equivalent matches of read to haplotype
-- Added runtime argument checking for initial and computeReadLikelihoodGivenHaplotypeLog10
-- Functions to dumpMatrices for debugging
-- Fix nasty bug (without original unit tests) in LoglessPairHMM
-- Max read and haplotype lengths only worked in previous code if they were exactly equal to the provided read and haplotype sizes.  Fixed bug.  Added unit test to ensure this doesn't break again.
-- Added dupString(string, n) method to Utils
-- Added TODOs for next commit.  Need to compute number of potential start sites not in initialize but in the calc routine since this number depends not on the max sizes but the actual read sizes
-- Unit tests for the hapStartIndex functionality of PairHMM
-- Moved computeFirstDifferingPosition to PairHMM, and added unit tests
-- Added extensive unit tests for the hapStartIndex functionality of computeReadLikelihoodGivenHaplotypeLog10
-- Still TODOs left in the code that I'll fix up
-- Logless now compute constants, if they haven't been yet initialized, even if you forgot to say so
-- General: the likelihood penalty for potential start sites is now properly computed against the actual read and reference bases, not the maximum.  This involved moving some initialize() code into the computeLikelihoods function.  That's ok because all of the potential log10 functions are actually going to cached versions, so the slowdown is minimal
-- Added some unit tests to ensure that common errors (providing haplotypes too long, reads too long, not initializing the HMM) are captured as errors
2013-02-09 19:19:22 -05:00
Mark DePristo 09595cdeb9 Remove ExactPairHMM and OriginalPairHMM, everyone just uses Log10PairHMM with appropriate arguments 2013-02-09 13:06:54 -05:00
Mark DePristo 2d802e17a4 Delete the CachingPairHMM 2013-02-09 13:06:54 -05:00
Mark DePristo 7dcafe8b81 Preliminary version of LoglessCachingPairHMM that avoids positive likelihoods
-- Would have been squashed but could not because of subsequent deletion of Caching and Exact/Original PairHMMs
-- Actual working unit tests for PairHMMUnitTest
-- Fixed incorrect logic in how I compared hmm results to the theoretical and exact results
-- PairHMM has protected variables used throughout the subclasses
2013-02-09 13:06:54 -05:00
Mauricio Carneiro b7593aeadc Removing the symlink from the private license file
We had identified this problem before, but Dropbox tricked me into pushing it again into the repo.
2013-02-09 12:57:44 -05:00
Mark DePristo ca76de0619 Move ProcessUtilsUnitTest to private 2013-02-09 12:34:45 -05:00
MauricioCarneiro f5e52b72ea Merge pull request #23 from broadinstitute/md_process_utils_unit_tests
UnitTests for ProcessUtils
2013-02-09 09:27:31 -08:00
MauricioCarneiro 3ff10ab277 Merge pull request #24 from broadinstitute/md_ngsplatform_unittests
Expand NGSPlatform to meet SAM 1.4 spec, with full unit tests
2013-02-09 09:27:03 -08:00
MauricioCarneiro 7dbdc4ea6a Merge pull request #22 from broadinstitute/md_better_contig_comparer
Generalize and fixup ContigComparator
2013-02-09 09:26:37 -08:00
Mark DePristo b127fc6a1a Expand NGSPlatform to meet SAM 1.4 spec, with full unit tests
-- Added CAPILLARY and HELICOS platforms as required by spec 1.4
-- Added extensive unit tests to ensure NGSPlatform functions work as expected.
-- Fixed some NPE bugs for reads that don't have RGs or PLs in their RG fields
2013-02-09 11:16:21 -05:00
Mark DePristo fc3307a97f UnitTests for ProcessUtils 2013-02-09 10:13:01 -05:00
Mark DePristo 7fb620dce7 Generalize and fixup ContigComparator
-- Now uses a SAMSequenceDictionary to do the comparison of contigs (which is the right way to do it)
-- Added unit tests
2013-02-09 09:52:13 -05:00
Mark DePristo a3dc7dc5cb Extend AWS timeout for uploads of the GATK run reports to 30 seconds 2013-02-08 17:37:36 -05:00
depristo db5b5e3482 Merge pull request #21 from broadinstitute/mc_base_coverage_distribution_GSATDG-45
Mc base coverage distribution gsatdg 45
2013-02-08 09:45:14 -08:00
Mauricio Carneiro d004bfbe6f walker to calculate per base coverage distribution
-- Base distribution optionally includes deletions
-- Implemented an optional filtered coverage distribution option
-- Integration tests added for every feature of the traversal

This walker is specially fast for the task due to the ability to calculate uncovered bases without having to visit the loci. This capability should be made generic in the future for the advantage of DiagnoseTargets and DepthOfCoverage.
GSATDG-45 #resolve
2013-02-07 16:33:05 -05:00
Mauricio Carneiro 5f49c95cc1 Added distance across contigs calculation to GenomeLocs
-- distance across contigs is calculated given a sequence dictionary (from SAMFileHeader)
-- unit test added
GSATDG-45
2013-02-07 16:31:41 -05:00
depristo cd4aec177a Merge pull request #20 from broadinstitute/aw_reduceread_perf_1_GSA-761
Aw reduceread perf 1 gsa 761
2013-02-07 12:11:05 -08:00
depristo 8f9d317a52 Merge pull request #19 from broadinstitute/eb_add_alignment_utils_tests_GSA-735_GSA-736_GSA-737_GSA-738
Added contracts, docs, and tests for several methods in AlignmentUtils. ...
2013-02-07 12:10:35 -08:00
Eric Banks 9826192854 Added contracts, docs, and tests for several methods in AlignmentUtils. There are over 74K tests being run now for this class!
* AlignmentUtils.getMismatchCount()
* AlignmentUtils.calcAlignmentByteArrayOffset()
* AlignmentUtils.readToAlignmentByteArray().
* AlignmentUtils.leftAlignIndel()
2013-02-07 13:04:24 -05:00
Alec Wysoker e88bc753aa Replace with map.containsKey followed by map.get with map.get followed by null check. 2013-02-07 11:58:41 -05:00
Alec Wysoker 72e496d6f3 Eliminate unnecessary zeroing out of primitive arrays immediately after new. 2013-02-07 11:57:43 -05:00
depristo cc7731d61f Merge pull request #18 from broadinstitute/eb_fix_RR_tests
Fixing the failing RR integration tests.
2013-02-06 10:02:13 -08:00
Eric Banks 481982202d Fixing the failing RR integration tests.
* After consulting Tim/David/Mauricio we determined that the md5 changes were due to different encodings of binary arrays in samjdk
   * However, it made no functional difference to the results (confirmed by Eric) so we agreed to update md5s
 * Also, the header of one of the test bams was malformed but old picard jar didn't perform checks so it only started failing now
   * Fixed the bam
2013-02-06 12:40:56 -05:00
depristo 462da7da8f Merge pull request #17 from broadinstitute/dr_variant_migration_cleanup
Minor build.xml cleanup post-variant-migration
2013-02-06 08:32:14 -08:00
David Roazen df142a389f Minor build.xml cleanup post-variant-migration
-Stop emitting our own (now empty) variant jar
-Correct BaseUtils package for the na12878kb jar
2013-02-06 11:16:52 -05:00
eitanbanks bd0349e570 Merge pull request #12 from broadinstitute/md_exact_fast_path_GSA-726
Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc
2013-02-06 07:43:02 -08:00
Mark DePristo 59df329776 Fast path for biallelic variants in IndependentAllelesDiploidExactAFCalc
-- If the VariantContext is a bi-allelic variant already, don't split up the VC (it doesn't do anything) and then combine it back together.  This saves us a lot of work on average
-- Be more protective of calls to AFCalc with a VariantContext that might only have ref allele, throwing an exception
2013-02-06 10:34:09 -05:00
eitanbanks 584899329c Merge pull request #13 from broadinstitute/dr_variant_migration_GSA-692
Replace org.broadinstitute.variant with jar built from the Picard repo
2013-02-06 07:22:30 -08:00
depristo bee127482b Merge pull request #16 from broadinstitute/eb_bqsr_fails_on_RR
Added check that BaseRecalibrator is not being run on a reduced bam.
2013-02-06 07:20:42 -08:00
Eric Banks 562f2406d7 Added check that BaseRecalibrator is not being run on a reduced bam.
- Throws user exception if it is.
 - Can be turned off with --allow_bqsr_on_reduced_bams_despite_repeated_warnings argument.
 - Added test to check this is working.
 - Added docs to BQSRReadTransformer explaining why this check is not performed on PrintReads end.
 - Added small bug fix to GenomeAnalysisEngine that I uncovered in this process.
 - Added comment about not changing the program record name, as per reviewer comments.
 - Removed unused variable.
2013-02-06 10:14:27 -05:00
depristo c677aa327c Merge pull request #15 from broadinstitute/eb_hapcaller_dbsnp_fix
Bug fix for NPE in HC with --dbsnp argument.
2013-02-06 04:29:23 -08:00
Eric Banks 4e5ff3d6f1 Bug fix for NPE in HC with --dbsnp argument.
- I had added the framework in the VA engine but should not have hooked it up to the HC yet since the RefMetaDataTracker is always null.
 - Added contracts and docs to the relevant methods in the VA engine so that this doesn't happen in the future.
2013-02-05 21:59:19 -05:00
Eric Banks e7c35a907f Fixes to BQSR for the --maximum_cycle_value argument.
- It's now written into the recal report so that it can be used in the PrintReads step.
  - Note that we also now write the --deletions_default_quality value which accidentally wasn't being written before!
  - Added tests to make sure that the value of the --maximum_cycle_value is being used properly by PR with -BQSR.
(This is my last non-branch commit; all future pushes will follow new GATK practices)
2013-02-05 17:38:03 -05:00
David Roazen e7e76ed76e Replace org.broadinstitute.variant with jar built from the Picard repo
The migration of org.broadinstitute.variant into the Picard repo is
complete. This commit deletes the org.broadinstitute.variant sources
from our repo and replaces it with a jar built from a checkout of the
latest Picard-public svn revision.
2013-02-05 17:24:25 -05:00
Ryan Poplin cb2dd470b6 Moving the random number generator over to using GenomeAnalysisEngine.getRandomGenerator in the logless versus exact pair hmm unit test. We don't believe this will fix the problem with the non-deterministic test failures but it will give us more information the next time it fails. 2013-02-05 12:56:20 -05:00
Mauricio Carneiro f6bc5be6b4 Fixing license on Yossi's file
Somebody needs to set up the license hook ;-)
2013-02-05 11:14:43 -05:00
MauricioCarneiro 050c4794a5 Merge pull request #11 from yfarjoun/per_sample2
-Added Per-Sample Contamination Removal to UnifiedGenotyper: Added an @A...
2013-02-05 08:04:29 -08:00
Eric Banks 00c98ff0cf Need to reset the static counter before tests are run or else we won't be deterministic.
Also need to give credit where credit is due: David was right that this was not a non-deterministic Bamboo failure...
2013-02-05 10:41:46 -05:00
Eric Banks 23c6aee236 Added in some basic unit tests for polyploid consensus creation in RR.
- Uncovered small bug in the fix that I added yesterday, which is now fixed properly.
- Uncovered massive general bug: polyploid consensus is totally busted for deletions (because of call to read.getReadBases()[readPos]).
  - Need to consult Mauricio on what to do here (are we supporting het compression for deletions?  (Insertions are definitely not supported)
2013-02-05 10:35:45 -05:00