Commit Graph

12936 Commits (7cd304fb41d9e7bfc0bf41edc74b87f1ef943373)

Author SHA1 Message Date
Eric Banks 7ed5344f8b Merge pull request #447 from broadinstitute/dr_segregate_kb_tests
Separate tests that access the knowledge base from other tests
2013-12-06 08:43:07 -08:00
David Roazen 10dc038a24 Separate tests that access the knowledge base from other tests
The tests that access the knowledge base are interfering with the basic
ability to run the unit/integration test suite to completion -- these
few tests often take hours to complete.

Created a new class of test ("KnowledgeBaseTest") that runs separately
from the unit/integration test suite, with corresponding build target.
A new bamboo plan will be set up to run these tests independently so
that they don't interfere with unit/integration testing.

With this change, plus the recent changes to the parallel test runner,
unit/integration test suite runtime should be back down to ~30 minutes
on average.
2013-12-06 11:31:35 -05:00
Eric Banks 1a0e140ab5 Merge pull request #445 from broadinstitute/dr_rev_picard_for_2.8
Rev picard, sam-jdk, tribble, and variant jars to 1.104.1628
2013-12-05 15:03:27 -08:00
Eric Banks 32cca883fc Merge pull request #444 from broadinstitute/dr_parallel_test_runner_adjustments
Tweak parallel test runner in attempt to decrease spurious failures
2013-12-05 15:02:39 -08:00
David Roazen 47ea3c3b22 Tweak parallel test runner in attempt to decrease spurious failures
-Run with -W 240 to give tests more time to complete and hopefully
 stop jobs from getting killed with TERM_RUNLIMIT

-Switch to /humgen/gsa-hpprojects for test working directories, since
 /broad/hptmp has been unacceptably slow lately

     Time to create test working directory, 12/5/13:
     /broad/hptmp: 19 minutes
     /humgen/gsa-hpprojects: 4 minutes
2013-12-05 13:49:37 -05:00
David Roazen 0e65296efb Rev picard, sam-jdk, tribble, and variant jars to 1.104.1628
-update VariantFiltration to work with new Lazy wrapper around the
 JexlEngine in VariantContextUtils
2013-12-05 12:45:32 -05:00
Eric Banks 6d2fcd2df9 Merge pull request #443 from broadinstitute/eb_better_doc_for_minpruning
Added docs for the minPruning argument in the HC
2013-12-05 08:52:56 -08:00
Eric Banks e022db4690 Added docs for the minPruning argument in the HC 2013-12-05 11:50:56 -05:00
Eric Banks 623aaa0d6f Merge pull request #442 from broadinstitute/gg_fixdoc_deletions
Fixed documentation for -deletions argument in the UAC
2013-12-04 17:37:18 -08:00
Geraldine Van der Auwera 3ab2f4edb2 Fixed documentation for -deletions argument in the UAC 2013-12-04 19:55:24 -05:00
amilev 0d94019bd6 Merge pull request #434 from broadinstitute/mc_dt_gccontent
Add GC Content to DiagnoseTargets
2013-12-04 09:42:26 -08:00
Eric Banks 41a0aecb07 Merge pull request #441 from broadinstitute/jt_gvcf_idx_user_error
Jt gvcf idx user error
2013-12-03 21:54:11 -08:00
Joel Thibault 5fe0531b4d Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy 2013-12-03 23:12:14 -05:00
Joel Thibault 8571a641bf Add @Advanced to variant_index_type and variant_index_parameter 2013-12-03 23:12:14 -05:00
Mauricio Carneiro 701ede2817 Add GC Content to DiagnoseTargets 2013-12-03 23:04:40 -05:00
droazen 61b50a02b1 Merge pull request #431 from broadinstitute/jt_custom_vcf_idx
Add engine options to override the default VCF/BCF indexing strategy
2013-12-03 19:32:36 -08:00
Joel Thibault fd0a02e52e New VCF engine arguments to specify an alternate IndexCreator
- CatVariants updates to use custom VCF indices
- Scala scripts for VCF index testing
2013-12-03 13:31:02 -05:00
Joel Thibault 42f78bdb3a Add a class-based DataProvider 2013-12-03 13:31:01 -05:00
Joel Thibault cd3ee2ae7e whitespace 2013-12-03 13:31:01 -05:00
Joel Thibault ed6f069191 Rev Picard 1.102.1595 2013-12-03 13:31:01 -05:00
Eric Banks d90b295570 Merge pull request #440 from broadinstitute/eb_selection_should_keep_pls
Eb selection should keep pls
2013-12-03 07:08:01 -08:00
Eric Banks cb2f228f5a Archiving SelectVariantsFromMongo since it has started to diverge from SelectVariants 2013-12-03 09:23:16 -05:00
Eric Banks 6bee6a1b53 Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles.
Previously, we would strip out the PLs and AD values since they were no longer accurate.  However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.

Now the PLs and AD get correctly selected down.

While I was in there I also refactored some related code in subsetDiploidAlleles().  There were no real changes there - I just
broke it out into smaller chunks as per our best practices.

Added unit tests and updated integration tests.
Addressed reviews.
2013-12-03 09:23:03 -05:00
Valentin Ruano Rubio b1073fb17b Merge pull request #439 from broadinstitute/vrr_graphLikelihoods2
Adding Graph-based likelihoods calculation
2013-12-02 18:54:15 -08:00
Valentin Ruano-Rubio 0f99778a59 Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test

Adding Graph-based likelihood ratio calculation to HC

  To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test
2013-12-02 19:37:19 -05:00
Valentin Ruano-Rubio 00116609e4 Archive addition as a result of the work on adding Graph-based likelihood ratio calculation to HC. 2013-12-02 19:33:14 -05:00
Eric Banks 84ddfb41b5 Merge pull request #438 from broadinstitute/rp_vqsr_num_bad_stability_fixes_and_runtime_optimizations
Various VQSR optimizations in runtime and stability.
2013-12-02 08:37:37 -08:00
Ryan Poplin 6a922e7aca Merge pull request #435 from broadinstitute/eb_fix_ug_bug_for_long_deletions
Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.
2013-12-02 08:09:23 -08:00
Ryan Poplin b57054c63c Various VQSR optimizations in both runtime and accuracy.
-- For very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build the Gaussian mixture model.
-- Annotations are ordered by the difference in means between known and novel instead of by their standard deviation.
-- Removed the training set quality score threshold.
-- Now uses 2 gaussians by default for the negative model.
-- Num bad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.
-- Model plots are now generated much faster.
-- Stricter threshold for determining model convergence.
-- All VQSR integration tests change because of these changes to the model.
-- Add test for downsampling of training data.
2013-11-29 13:04:46 -05:00
Eric Banks 6d43952ccc Merge pull request #436 from broadinstitute/eb_fix_mq_byte_issue_in_RR
Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int.
2013-11-27 19:25:36 -08:00
Eric Banks df6499e58c Bug fix for RR: stop (incorrectly) pulling the MQ out of the SAMRecord as a byte instead of an int.
For reads with high MQs (greater than max byte) the MQ was being treated as negative and failing
the min MQ filter.

Added unit test.

Delivers PT#61567540.
2013-11-27 18:55:03 -05:00
Eric Banks 51d1a26725 Bug fix for something Guillermo added to UG before he left to support calling indels from reduced reads.
His code was excessively clipping reads because it was looking at their cigar string instead of just
the read length.  This meant that it was basically impossible to call large deletions in UG even with
perfect evidence in the reads (as reported by Craig D).

Integration tests change because (IMO after looking at sites in IGV) reads with indels similar to the one
being genotyped used to be given too much likelihood and now give less.

Added unit tests for new methods.
2013-11-27 13:54:39 -05:00
Eric Banks 42bf83cdc8 Merge pull request #433 from broadinstitute/eb_finish_chartl_likelihood_posteriors
Introducing the latest-and-greatest in genotyping: CalculatePosteriors.
2013-11-27 10:53:05 -08:00
Chris Hartl 1f777c4898 Introducing the latest-and-greatest in genotyping: CalculatePosteriors.
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.

Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
 while not converged:
  * AC = [external AC] + [sample AC]
  * Prior = DirichletMultinomial[AC]
  * Posteriors = [sample GL + Prior]
  * sample AC = MLEAC(Posteriors)

This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.

Fully unit tested.

Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
2013-11-27 13:00:45 -05:00
Eric Banks 6d0d555925 Merge pull request #432 from broadinstitute/gg_fixSplitSamFile
Set SAMFileWriter to create index in ReadUtils to fix SplitSamFile issue
2013-11-26 13:01:22 -08:00
Geraldine Van der Auwera 429582589f Set SAMFileWriter to create index in ReadUtils to fix SplitSamFile issue 2013-11-26 15:54:47 -05:00
Eric Banks 5da901a8b1 Merge pull request #430 from broadinstitute/gg_patch_queueextensions
Patched Queue extensions lacking a main class definition
2013-11-25 08:09:42 -08:00
Ryan Poplin abf70dc071 Merge pull request #428 from broadinstitute/eb_various_HC_bug_fixes
Eb various hc bug fixes
2013-11-25 06:36:31 -08:00
Geraldine Van der Auwera 25bc6e64ae Patched Queue extensions lacking a main class definition 2013-11-22 14:57:09 -05:00
Eric Banks 0fac4fb3b6 Make the reference model calculation work with reduced reads.
It's just a matter of using PileupElement.getRepresentativeCount() instead of '++'.
2013-11-21 10:53:33 -05:00
Eric Banks adb77b406f Fixed poor implementation of isRefSource() and isRefSink() among others.
There was already a note in the code about how wrong the implementation was.
The bad code was causing a single-node graph to get cleaned up into nothing when pruning tails.
Delivers PT #61069820.
2013-11-21 10:53:27 -05:00
Eric Banks b5f3bf5f06 Merge pull request #427 from broadinstitute/gg_update_gatkdocs_template
Tweaked gatkdocs index template
2013-11-19 17:59:08 -08:00
Geraldine Van der Auwera b42ccdce11 Tweaked gatkdocs index template 2013-11-19 15:05:41 -05:00
amilev b603e6674d Merge pull request #425 from broadinstitute/ami_moleculo_project_changes
Ami moleculo project changes (add 'final's based on review)
2013-11-18 14:33:49 -08:00
Ami Levy-Moonshine e6ef37de1d Add an option to filter the read bases that are taking into account for the coveraged intervals. For that, new two arguments were added: minBaseQuality and minMappingQuality 2013-11-18 17:29:32 -05:00
Ami Levy-Moonshine 6ad841cec5 Rewrite ReadLengthDistribution to count the read lengths into a hash table first and only at the end to produce a GATK report table.
Before that fix, the tool was couldn't work with more then one RG before.
- Address all review comments
2013-11-18 17:29:31 -05:00
amilev cc85e373d0 Merge pull request #426 from broadinstitute/ami_fix_MoleculoPipeline_suffix_bug
fix a (ugly) weird error from last commit that changed all the scala fil...
2013-11-18 08:55:46 -08:00
Ami Levy-Moonshine 9c1023c933 fix a (ugly) weird error from last commit that changed all the scala files to end with MoleculoPipeline.scala 2013-11-18 11:44:24 -05:00
amilev 6448cc53f5 Merge pull request #422 from broadinstitute/ami-NPE-fix-NA12878
fix missed NPE error in AssessNA12878. Add a proper check and error mess...
2013-11-14 12:06:16 -08:00
Ami Levy-Moonshine 2ff0c23b53 fix missed NPE error in AssessNA12878. Add a check and an clear error message.
add unitTest for that case (when a file has genotypes but does not NA12878 as a sample)
2013-11-14 14:59:47 -05:00