Commit Graph

983 Commits (c163e6d0d2b57863facd1dd280cfaeca18e8fae1)

Author SHA1 Message Date
Khalid Shakir f02ce6eca7 Added tests for cleaning up scattered .bai files, and using the log directory.
Re-added import java.io.File for BamGatherFunction.
Other cleanup to resolve scala syntax warnings from intellij.
Moved Example UG script to from protected to public.
2014-02-26 02:11:28 +08:00
Eric Banks 0f30df0356 Stopgap procedure to rescue Fisher Strand for cases where there's lots of data.
This commit consists of 2 main changes:
1. When the strand table gets too large, we normalize it down to values that are more reasonable.
2. We don't include a particular sample's contribution unless the total ref and alt counts are at least 2 each;
this is a heuristic method for dealing only with hets.

MD5s change as expected.
Hopefully we'll have a more robust implementation for GATK 3.1.
2014-02-25 01:04:27 -05:00
droazen e8ea9f58d3 Merge pull request #531 from broadinstitute/ks_build_patches
Build patches
2014-02-24 15:13:16 -05:00
Valentin Ruano-Rubio 0b3a70b8c1 Fix for a bug a bug in (Assembly Graph) Routes.
The slicePrefix method functionality was broken.

Story:

https://www.pivotaltracker.com/story/show/64595624

Changes:

1. Fixed the bug.
2. Added unit test to check on the method functionality.
3. Added a integration test to verify the bug has been fixed in a empirical data reprudible case.
2014-02-24 10:54:39 -05:00
Khalid Shakir 7e516b294f Replaced local drmaa and Jama artifacts with versions from maven central.
Removed unused caliper binary from local repo.
2014-02-22 01:21:35 +08:00
Valentin Ruano-Rubio 463af7143f Activate reverse allele trimming in GVCF
Story:

https://www.pivotaltracker.com/s/projects/1007536

Changes:

1. HC's GenotypingEngine now invokes reverseAlleleTrimming on GVCF variant output lines.
2. GenotypeGVCFs also reverse trim after regenotyping as some alt. alleles are dropped (observed in real-data).
2014-02-20 03:17:24 -05:00
Eric Banks 53a7d5cbae Fixing a bug in the GVCF writer.
The writer was never resetting the pointer to the end of the last non-ref VariantContext that it saw.
This was fine except when it jumped to a new contig - and a lower position on that contig - where it
thought that it was still part of that previous non-ref VariantContext so wouldn't emit a reference
block.  Therefore, ref blocks were missing from the beginnings of all chromosomes (except chr1).

Added unit test to cover this case.
2014-02-20 02:33:43 -05:00
Valentin Ruano-Rubio c167fb5fdf Fixing GenotypesGVCF.
Bug uncovered by some untrimmed alleles in the single sample pipeline output.

Notice however does not fix the untrimmed alleles in general.

Story:

https://www.pivotaltracker.com/story/show/65481104

Changes:

1. Fixed the bug itself.
2. Fixed non-working tests (sliently skipped due to exception in dataProvider).
2014-02-19 14:20:39 -05:00
Ryan Poplin 43c20264b0 Initial commit of the random forest classifier. 2014-02-17 13:07:27 -05:00
droazen 688792c5b0 Merge pull request #520 from broadinstitute/jt_fix_failing_tests_post_maven
Fix for the Array Out of Bounds test error
2014-02-14 14:02:17 -05:00
Eric Banks 3724d4e5f3 Various small fixes for CalculateGenotypePosteriors based on feedback from guys in Ben Neale's group.
Note that this tool is still a work in progress and very experimental, so isn't 100% stable.  Most of
the features are untested (both by people and by unit/integration tests) because Chris Hartl implemented
it right before he left, and we're going to need to add tests at some point soon.  I added a first
integration test in this commit, but it's just a start.

The fixes include:

1. Stop having the genotyping code strip out AD values.  It doesn't make sense that it should do this so
I don't know why it was doing that at all.
Updated GenotypeGVCFs so that it doesn't need to manually recover them anymore.
This also helps CalculateGenotypePosteriors which was losing the AD values.
Updated code in LeftAlignAndTrimVariants to strip out PLs and AD, since it wasn't doing that before.
Updated the integration test for that walker to include such data.

2. Chris was calling Math.pow directly on the normalized posteriors which isn't safe.
Instead, the normalization routine itself can revert back to log scale in a safe manner so let's use it.
Also, renamed the variable to posteriorProbabilities (and not likelihoods).

3. Have CGP update the AC/AF/AN counts after fixing GTs.
2014-02-14 13:48:14 -05:00
Joel Thibault cb7ad01202 Re-enable the relevant tests 2014-02-14 12:34:08 -05:00
Joel Thibault c8a5007c85 Add a comment to the method where the error appears 2014-02-14 11:40:22 -05:00
Joel Thibault ec16439387 Clear the ReadCovariates keysCache before runs of individual Unit Tests
- normal runs have a constant covariate count, so this is not necessary
2014-02-14 10:41:28 -05:00
Eric Banks 7095a60c8e Merge pull request #516 from broadinstitute/dr_reenable_tests_failing_due_to_java_update
Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation
2014-02-13 21:05:18 -05:00
David Roazen 4b4b93ad1b Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation
After extensive detective work, Joel determined that these tests were failing
due to changes in the implementation of Math.pow() in newer versions of
Java 1.7.

All GSA members should ensure that they're using a JDK that is at least
as current as the one in the Java-1.7 dotkit on the Broad servers
(build 1.7.0_51-b13).
2014-02-12 16:08:16 -05:00
Joel Thibault cc9477aedb Minimal test for the multi-allelic reordering bug 2014-02-12 13:38:32 -05:00
Eric Banks 300b474c96 Several improvements to the single sample combining steps.
1. updated QualByDepth not to use AD-restricted depth if it is zero.
Added unit test this change.

2. Fixed small bug in CombineGVCFs where spanning deletions were not being treated consistently throughout.
Added test for this situation.

3. Make sure GenotypeGVCFs puts in the required headers.
Updated test files to make sure this is covered.

4. Have GenotypeGVCFs propagate up the MLEAC/AF (which were getting clobbered out).
Tests updated to account for this.
2014-02-12 10:15:12 -05:00
David Roazen 95e1402d21 Add ability to run *KnowledgeBaseTests to maven
Run with: mvn verify -Dsting.knowledgebasetests.skipped=false
2014-02-11 14:08:24 -05:00
Eric Banks 303a60c8c6 Adding smarts to the QD annotation:
when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1.

Added new unit tests for QualByDepth.
2014-02-11 12:56:49 -05:00
Eric Banks 2e36dd9001 Refactoring of CombineGVCFs to make it run a lot faster.
Creating new VariantContexts each time we broke up a block was very expensive because we break up
blocks so often.  Also, calling into GATKVariantContextUtils.simpleMerge was really hurting performance.

MD5 changes because we no longer propogate any INFO fields (except for END) for reference blocks; the tests
have the now unused BLOCK_SIZE field that now get dropped.
2014-02-11 03:18:52 -05:00
Eric Banks abef6cfcb6 Removing parameters that were incorrectly copied over from RegenotypeVariants. 2014-02-08 23:44:32 -05:00
Eric Banks 659a9f0e79 Removing the test for BLOCK_SIZE since we no longer emit it 2014-02-08 21:28:07 -05:00
Valentin Ruano-Rubio bf630abe88 Fixed nocall (./.) without PLs bug in GVCF output
Story:

https://www.pivotaltracker.com/story/show/65388246

Additional changes and notes:

1. The fix consist in forcing the output of all PLs by setting the standard flag for that '-allSitePLs'.

2. BP_RESOLUTION was handled differently to GVCF in some aspect that should be common. That has been fixed.
2014-02-07 19:30:26 -05:00
Eric Banks d689f61005 Fixed up some of the genotype-level annotations being propogated in the single sample HC pipeline.
1. AD values now propogate up (they weren't before).
2. MIN_DP gets transferred over to DP and removed.
3. SB gets removed after FS is calculated.

Also, added a bunch of new integration tests for GenotypeGVCFs.
2014-02-07 12:47:54 -05:00
Eric Banks 67ed0d2403 The UG engine can return a null VC if there are tons of alt alleles, causing Tim's merge jobs to fail.
Pushing the null check up so that it doesn't error out in such cases.
2014-02-07 12:41:20 -05:00
Valentin Ruano-Rubio 4a3c8e68fa Fixed out of order non-variant gVCF entries when trimming is active.
Story:

https://www.pivotaltracker.com/story/show/65319564
2014-02-07 11:03:26 -05:00
Eric Banks eb463b505d Remove a whole bunch of unused annotations from gVCF output.
AC,AF,AN,FS,QD - they'll all be recomputed later.
BLOCK_SIZE and MIN_GQ were not necessary.

I also made the StrandBiasBySample annotation forced on when in gVCF mode.
It turns out that its output wasn't compatible with BCF so I patched it (and the variant jar too).
2014-02-07 08:49:36 -05:00
Eric Banks 2648219c42 Implementation of a hierarchical merger for gVCFs, called CombineGVCFs.
This tool will take any number of gVCFs and create a merged gVCF (as opposed to
GenotypeGVCFs which produces a standard VCF).

Added unit/integration tests and fixed up GATK docs.
2014-02-07 08:49:18 -05:00
Eric Banks 71b47a6148 Rename CombineReferenceCalculationVariants to GenotypeGVCFs 2014-02-06 15:46:19 -05:00
Khalid Shakir 3848159086 Added a set of serial tests to gatk/queue packages, which runs all tests under their package in one TestNG execution.
New properties to disable regenerating example resources artifact when each parallel test runs under packagetest.
Moved collection of packagetest parameters from shell scripts into maven profiles.
Fixed necessity of test-utils jar by removing incorrect dependenciesToScan element during packagetests.
When building picard libraries, run clean first.
Fixed tools jar dependency in picard pom.
Integration tests properly use the ant-bridge.sh test.debug.port variable, like unit tests.
2014-02-06 08:25:38 -05:00
Valentin Ruano Rubio 988e3b4890 Merge pull request #487 from broadinstitute/vrr_reference_model_with_trimming
Get gVCF to work without --dontTrimActiveRegions
2014-02-05 22:52:17 -05:00
Valentin Ruano-Rubio 98ffcf6833 Get gVCF to work without --dontTrimActiveRegions
Story:

https://www.pivotaltracker.com/story/show/65048706
https://www.pivotaltracker.com/story/show/65116908

Changes:

ActiveRegionTrimmer in now an argument collection and it returns not only the trimmed down active region but also the non-variant containing flanking regions
HaplotypeCaller code has been simplified significantly pushing some functionality two other classes like ActiveRegion and AssemblyResultSet.

Fixed a problem with the way the trimming was done causing some gVCF non-variant records no have conservative 0,0,0 PLs
2014-02-05 22:50:45 -05:00
Ryan Poplin 693bfac341 Bug fix for missing annotations in CombineReferenceCalculationVariants. They were being dropped in the handoff between engines in a couple of places.
-- Updated single sample pipeline test data using Valentin's files and re-enabled CRCV tests
2014-02-05 12:58:48 -05:00
Eric Banks 91bdf069d3 Some updates to CRCV.
1. Throw a user error when the input data for a given genotype does not contain PLs.
2. Add VCF header line for --dbsnp input
3. Need to check that the UG result is not null
4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)
2014-02-05 10:12:37 -05:00
Joel Thibault 9eaee8c73c Integration test for the -nt race condition corrupting AD and PL fields 2014-02-04 22:04:27 -05:00
David Roazen 1de7a27471 Disable an additional test that is runtime dependent on one of the temporarily-disabled tests 2014-02-04 16:07:58 -05:00
David Roazen 76086f30b7 Temporarily disable tests that started failing post-maven
Joel is working on these failures in a separate branch. Since
maven (currently! we're working on this..) won't run the whole
test suite to completion if there's a failure early on, we need
to temporarily disable these tests in order to allow group members
to run tests on their branches again.
2014-02-04 15:31:24 -05:00
Khalid Shakir 857e6e0d6f Bumped version to 2.8-SNAPSHOT, using new update_pom_versions.sh script. 2014-02-03 13:50:46 -05:00
Khalid Shakir 9ca3004fc3 Setting the test-utils' type to test-jar, such that the multi-module build uses testClasses instead of classes as a directory dependency. 2014-02-03 13:50:46 -05:00
Khalid Shakir de13f41fc3 One step closer to a proper test-utils artifact. Using the maven-jar-plugin to create a test classifer, excluding actual tests, until we can properly separate the classes into separate artifacts/modules. 2014-02-03 13:50:46 -05:00
Khalid Shakir caa76cdac4 Added maven pom.xmls for various artifacts. 2014-02-03 13:50:46 -05:00
Khalid Shakir 1e25a758f5 Moved files to maven directories.
Here are the git moved directories in case other files need to be moved during a merge:
  git-mv private/java/src/        private/gatk-private/src/main/java/
  git-mv private/R/scripts/       private/gatk-private/src/main/resources/
  git-mv private/java/test/       private/gatk-private/src/test/java/
  git-mv private/testdata/        private/gatk-private/src/test/resources/
  git-mv private/scala/qscript/   private/queue-private/src/main/qscripts/
  git-mv private/scala/src/       private/queue-private/src/main/scala/
  git-mv protected/java/src/      protected/gatk-protected/src/main/java/
  git-mv protected/java/test/     protected/gatk-protected/src/test/java/
  git-mv public/java/src/         public/gatk-framework/src/main/java/
  git-mv public/java/test/        public/gatk-framework/src/test/java/
  git-mv public/testdata/         public/gatk-framework/src/test/resources/
  git-mv public/scala/qscript/    public/queue-framework/src/main/qscripts/
  git-mv public/scala/src/        public/queue-framework/src/main/scala/
  git-mv public/scala/test/       public/queue-framework/src/test/scala/
2014-02-03 13:50:44 -05:00
Valentin Ruano-Rubio 89c4e57478 gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested.
Changes:
-------

  <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles.

  When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake.

  Integration test MD5 have been changed accordingly.

Additional fix:
--------------

  Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick.

  In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.
2014-01-30 11:23:33 -05:00
Valentin Ruano-Rubio 748d2fdf92 Added Integration test to verify the bugs are not there anymore as reported in pivotracker 2014-01-26 23:29:31 -05:00
Valentin Ruano-Rubio 9e7bf75e89 Fix for the PairHMM transition probability miscalculation.
Problem:

matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.

Reports:

https://www.pivotaltracker.com/s/projects/793457/stories/62471780
https://www.pivotaltracker.com/s/projects/793457/stories/61082450

Changes:

The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.

MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.

Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
2014-01-26 16:30:36 -05:00
Ryan Poplin bdd06ebfc2 Merge pull request #478 from broadinstitute/eb_generalize_hc_values_as_args
Pulled out some hard-coded values from the read-threading and isActive c...
2014-01-21 09:01:54 -08:00
Eric Banks 9e858270d7 Moving this test up one level to where it actually belongs. 2014-01-19 02:33:11 -05:00
Eric Banks 64d5bf650e Pulled out some hard-coded values from the read-threading and isActive code of the HC, and made them into a single argument.
In unifying the arguments it was clear that the values were inconsistent throughout the code, so now there's a
single value that is intended to be more liberal in what it allows in (in an attempt to increase sensitivity).

Very little code actually changes here, but just about every md5 in the HC integration tests are different (as
expected).  Added another integration test for the new argument.

To be used by David R to test his per-branch QC framework: does this commit make the HC look better against the KB?
2014-01-19 01:15:13 -05:00
Eric Banks de56134579 Fixed up and refactored what seems to be a useful private tool to create simulated reads around a VCF.
It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now.
Since I thought that it might prove useful to others, I moved it to protected and added integration tests.

GERALDINE: NEW TOOL ALERT!
2014-01-15 13:49:31 -05:00