Commit Graph

969 Commits (c64131e9fd130c3e25ee0f2e7f046ae16ef79bc2)

Author SHA1 Message Date
Eric Banks 7095a60c8e Merge pull request #516 from broadinstitute/dr_reenable_tests_failing_due_to_java_update
Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation
2014-02-13 21:05:18 -05:00
David Roazen 4b4b93ad1b Re-enable tests that were failing post-maven due to changes in Java's Math.pow() implementation
After extensive detective work, Joel determined that these tests were failing
due to changes in the implementation of Math.pow() in newer versions of
Java 1.7.

All GSA members should ensure that they're using a JDK that is at least
as current as the one in the Java-1.7 dotkit on the Broad servers
(build 1.7.0_51-b13).
2014-02-12 16:08:16 -05:00
Joel Thibault cc9477aedb Minimal test for the multi-allelic reordering bug 2014-02-12 13:38:32 -05:00
Eric Banks 300b474c96 Several improvements to the single sample combining steps.
1. updated QualByDepth not to use AD-restricted depth if it is zero.
Added unit test this change.

2. Fixed small bug in CombineGVCFs where spanning deletions were not being treated consistently throughout.
Added test for this situation.

3. Make sure GenotypeGVCFs puts in the required headers.
Updated test files to make sure this is covered.

4. Have GenotypeGVCFs propagate up the MLEAC/AF (which were getting clobbered out).
Tests updated to account for this.
2014-02-12 10:15:12 -05:00
David Roazen 95e1402d21 Add ability to run *KnowledgeBaseTests to maven
Run with: mvn verify -Dsting.knowledgebasetests.skipped=false
2014-02-11 14:08:24 -05:00
Eric Banks 303a60c8c6 Adding smarts to the QD annotation:
when the AD annotation is present for a given genotype then we only use its depth for QD if the variant depth > 1.

Added new unit tests for QualByDepth.
2014-02-11 12:56:49 -05:00
Eric Banks 2e36dd9001 Refactoring of CombineGVCFs to make it run a lot faster.
Creating new VariantContexts each time we broke up a block was very expensive because we break up
blocks so often.  Also, calling into GATKVariantContextUtils.simpleMerge was really hurting performance.

MD5 changes because we no longer propogate any INFO fields (except for END) for reference blocks; the tests
have the now unused BLOCK_SIZE field that now get dropped.
2014-02-11 03:18:52 -05:00
Eric Banks abef6cfcb6 Removing parameters that were incorrectly copied over from RegenotypeVariants. 2014-02-08 23:44:32 -05:00
Eric Banks 659a9f0e79 Removing the test for BLOCK_SIZE since we no longer emit it 2014-02-08 21:28:07 -05:00
Valentin Ruano-Rubio bf630abe88 Fixed nocall (./.) without PLs bug in GVCF output
Story:

https://www.pivotaltracker.com/story/show/65388246

Additional changes and notes:

1. The fix consist in forcing the output of all PLs by setting the standard flag for that '-allSitePLs'.

2. BP_RESOLUTION was handled differently to GVCF in some aspect that should be common. That has been fixed.
2014-02-07 19:30:26 -05:00
Eric Banks d689f61005 Fixed up some of the genotype-level annotations being propogated in the single sample HC pipeline.
1. AD values now propogate up (they weren't before).
2. MIN_DP gets transferred over to DP and removed.
3. SB gets removed after FS is calculated.

Also, added a bunch of new integration tests for GenotypeGVCFs.
2014-02-07 12:47:54 -05:00
Eric Banks 67ed0d2403 The UG engine can return a null VC if there are tons of alt alleles, causing Tim's merge jobs to fail.
Pushing the null check up so that it doesn't error out in such cases.
2014-02-07 12:41:20 -05:00
Valentin Ruano-Rubio 4a3c8e68fa Fixed out of order non-variant gVCF entries when trimming is active.
Story:

https://www.pivotaltracker.com/story/show/65319564
2014-02-07 11:03:26 -05:00
Eric Banks eb463b505d Remove a whole bunch of unused annotations from gVCF output.
AC,AF,AN,FS,QD - they'll all be recomputed later.
BLOCK_SIZE and MIN_GQ were not necessary.

I also made the StrandBiasBySample annotation forced on when in gVCF mode.
It turns out that its output wasn't compatible with BCF so I patched it (and the variant jar too).
2014-02-07 08:49:36 -05:00
Eric Banks 2648219c42 Implementation of a hierarchical merger for gVCFs, called CombineGVCFs.
This tool will take any number of gVCFs and create a merged gVCF (as opposed to
GenotypeGVCFs which produces a standard VCF).

Added unit/integration tests and fixed up GATK docs.
2014-02-07 08:49:18 -05:00
Eric Banks 71b47a6148 Rename CombineReferenceCalculationVariants to GenotypeGVCFs 2014-02-06 15:46:19 -05:00
Khalid Shakir 3848159086 Added a set of serial tests to gatk/queue packages, which runs all tests under their package in one TestNG execution.
New properties to disable regenerating example resources artifact when each parallel test runs under packagetest.
Moved collection of packagetest parameters from shell scripts into maven profiles.
Fixed necessity of test-utils jar by removing incorrect dependenciesToScan element during packagetests.
When building picard libraries, run clean first.
Fixed tools jar dependency in picard pom.
Integration tests properly use the ant-bridge.sh test.debug.port variable, like unit tests.
2014-02-06 08:25:38 -05:00
Valentin Ruano Rubio 988e3b4890 Merge pull request #487 from broadinstitute/vrr_reference_model_with_trimming
Get gVCF to work without --dontTrimActiveRegions
2014-02-05 22:52:17 -05:00
Valentin Ruano-Rubio 98ffcf6833 Get gVCF to work without --dontTrimActiveRegions
Story:

https://www.pivotaltracker.com/story/show/65048706
https://www.pivotaltracker.com/story/show/65116908

Changes:

ActiveRegionTrimmer in now an argument collection and it returns not only the trimmed down active region but also the non-variant containing flanking regions
HaplotypeCaller code has been simplified significantly pushing some functionality two other classes like ActiveRegion and AssemblyResultSet.

Fixed a problem with the way the trimming was done causing some gVCF non-variant records no have conservative 0,0,0 PLs
2014-02-05 22:50:45 -05:00
Ryan Poplin 693bfac341 Bug fix for missing annotations in CombineReferenceCalculationVariants. They were being dropped in the handoff between engines in a couple of places.
-- Updated single sample pipeline test data using Valentin's files and re-enabled CRCV tests
2014-02-05 12:58:48 -05:00
Eric Banks 91bdf069d3 Some updates to CRCV.
1. Throw a user error when the input data for a given genotype does not contain PLs.
2. Add VCF header line for --dbsnp input
3. Need to check that the UG result is not null
4. Don't error out at positions with no gVCFs (which is possible when using a dbSNP rod)
2014-02-05 10:12:37 -05:00
Joel Thibault 9eaee8c73c Integration test for the -nt race condition corrupting AD and PL fields 2014-02-04 22:04:27 -05:00
David Roazen 1de7a27471 Disable an additional test that is runtime dependent on one of the temporarily-disabled tests 2014-02-04 16:07:58 -05:00
David Roazen 76086f30b7 Temporarily disable tests that started failing post-maven
Joel is working on these failures in a separate branch. Since
maven (currently! we're working on this..) won't run the whole
test suite to completion if there's a failure early on, we need
to temporarily disable these tests in order to allow group members
to run tests on their branches again.
2014-02-04 15:31:24 -05:00
Khalid Shakir 857e6e0d6f Bumped version to 2.8-SNAPSHOT, using new update_pom_versions.sh script. 2014-02-03 13:50:46 -05:00
Khalid Shakir 9ca3004fc3 Setting the test-utils' type to test-jar, such that the multi-module build uses testClasses instead of classes as a directory dependency. 2014-02-03 13:50:46 -05:00
Khalid Shakir de13f41fc3 One step closer to a proper test-utils artifact. Using the maven-jar-plugin to create a test classifer, excluding actual tests, until we can properly separate the classes into separate artifacts/modules. 2014-02-03 13:50:46 -05:00
Khalid Shakir caa76cdac4 Added maven pom.xmls for various artifacts. 2014-02-03 13:50:46 -05:00
Khalid Shakir 1e25a758f5 Moved files to maven directories.
Here are the git moved directories in case other files need to be moved during a merge:
  git-mv private/java/src/        private/gatk-private/src/main/java/
  git-mv private/R/scripts/       private/gatk-private/src/main/resources/
  git-mv private/java/test/       private/gatk-private/src/test/java/
  git-mv private/testdata/        private/gatk-private/src/test/resources/
  git-mv private/scala/qscript/   private/queue-private/src/main/qscripts/
  git-mv private/scala/src/       private/queue-private/src/main/scala/
  git-mv protected/java/src/      protected/gatk-protected/src/main/java/
  git-mv protected/java/test/     protected/gatk-protected/src/test/java/
  git-mv public/java/src/         public/gatk-framework/src/main/java/
  git-mv public/java/test/        public/gatk-framework/src/test/java/
  git-mv public/testdata/         public/gatk-framework/src/test/resources/
  git-mv public/scala/qscript/    public/queue-framework/src/main/qscripts/
  git-mv public/scala/src/        public/queue-framework/src/main/scala/
  git-mv public/scala/test/       public/queue-framework/src/test/scala/
2014-02-03 13:50:44 -05:00
Valentin Ruano-Rubio 89c4e57478 gVCF <NON_REF> in all vcf lines including variant ones when –ERC gVCF is requested.
Changes:
-------

  <NON_REF> likelihood in variant sites is calculated as the maximum possible likelihood for an unseen alternative allele: for reach read is calculated as the second best likelihood amongst the reported alleles.

  When –ERC gVCF, stand_conf_emit and stand_conf_call are forcefully set to 0. Also dontGenotype is set to false for consistency sake.

  Integration test MD5 have been changed accordingly.

Additional fix:
--------------

  Specially after adding the <NON_REF> allele, but also happened without that, QUAL values tend to go to 0 (very large integer number in log 10) due to underflow when combining GLs (GenotypingEngine.combineGLs). To fix that combineGLs has been substituted by combineGLsPrecise that uses the log-sum-exp trick.

  In just a few cases this change results in genotype changes in integration tests but after double-checking using unit-test and difference between combineGLs and combineGLsPrecise in the affected integration test, the previous GT calls were either border-line cases and or due to the underflow.
2014-01-30 11:23:33 -05:00
Valentin Ruano-Rubio 748d2fdf92 Added Integration test to verify the bugs are not there anymore as reported in pivotracker 2014-01-26 23:29:31 -05:00
Valentin Ruano-Rubio 9e7bf75e89 Fix for the PairHMM transition probability miscalculation.
Problem:

matchToMatch transition calculation was wrong resulting in transition probabilites coming out of the Match state that added more than 1.

Reports:

https://www.pivotaltracker.com/s/projects/793457/stories/62471780
https://www.pivotaltracker.com/s/projects/793457/stories/61082450

Changes:

The transition matrix update code has been moved to a common place in PairHMMModel to dry out its multiple copies.

MatchToMatch transtion calculation has been fixed and implemented in PairHMMModel.

Affected integration test md5 have been updated, there were no differences in GT fields and example differences always implied
small changes in likelihoods that is what is expected.
2014-01-26 16:30:36 -05:00
Ryan Poplin bdd06ebfc2 Merge pull request #478 from broadinstitute/eb_generalize_hc_values_as_args
Pulled out some hard-coded values from the read-threading and isActive c...
2014-01-21 09:01:54 -08:00
Eric Banks 9e858270d7 Moving this test up one level to where it actually belongs. 2014-01-19 02:33:11 -05:00
Eric Banks 64d5bf650e Pulled out some hard-coded values from the read-threading and isActive code of the HC, and made them into a single argument.
In unifying the arguments it was clear that the values were inconsistent throughout the code, so now there's a
single value that is intended to be more liberal in what it allows in (in an attempt to increase sensitivity).

Very little code actually changes here, but just about every md5 in the HC integration tests are different (as
expected).  Added another integration test for the new argument.

To be used by David R to test his per-branch QC framework: does this commit make the HC look better against the KB?
2014-01-19 01:15:13 -05:00
Eric Banks de56134579 Fixed up and refactored what seems to be a useful private tool to create simulated reads around a VCF.
It didn't completely work before (it was hard-coded for a particular long-lost data set) but it should work now.
Since I thought that it might prove useful to others, I moved it to protected and added integration tests.

GERALDINE: NEW TOOL ALERT!
2014-01-15 13:49:31 -05:00
Eric Banks 9f1ab0087a Added in a check for what would be an empty allele after trimming. 2014-01-15 11:04:19 -05:00
Ryan Poplin 201ad398ac Merge pull request #473 from broadinstitute/eb_fix_qd_indel_normalization
The QD normalization for indels was busted and is now fixed.
2014-01-14 08:56:19 -08:00
Eric Banks e4fdc5ac44 Merge pull request #474 from broadinstitute/eb_fix_haplotype_resolver_PT63333488
Fixing the Haplotype Resolver so that it doesn't complain about missing header lines
2014-01-14 07:36:53 -08:00
Eric Banks fd511d12a2 Fixing the Haplotype Resolver so that it doesn't complain about missing header lines.
The code comments very clearly state that INFO fields shouldn't be propagated into the output,
but someone must have accidentally changed it afterwards.  This is just a simple one-line fix
to make sure the code adhered to the comments.

Delivers #63333488.
2014-01-13 22:47:43 -05:00
Geraldine Van der Auwera 8fcad6680b Assorted fixes and improvements to gatkdocs
-Added docs for ERC mode in HC
 -Move RecalibrationPerformance walker since to private since it is experimental and unsupported
 -Updated VR docs and restored percentBad/numBad (but @Hidden) to enable deprecation alert if users try to use them
 -Improved error msg for conflict between per-interval aggregation and -nt
 -Minor clean up in exception docs
 -Added Toy Walkers category for devs and dev supercat (to build out docs for developers)
 -Added more detailed info to GenotypeConcordance doc based on Chris forum post
 -Added system to include min/max argument values in gatkdocs (build gatkdocs with 'ant gatkdocs' to test it, see engine and DoC args for in situ examples)
 -Added tentative min/max argument annotations to DepthOfCoverage and CommandLineGATK arguments (and improved docs while at it)
 -Added gotoDev annotation to GATKDocumentedFeature to track who is the go-to person in GSA for questions & issues about specific walkers/tools (now discreetly indicated in each gatkdoc)
2014-01-13 17:46:22 -05:00
Eric Banks c7e08965d0 The QD normalization for indels was busted and is now fixed.
It is true that indels of length > 1 have higher QUALS than those of length = 1.  But for the HC those
QUALS are not that much higher, and it doesn't continue scaling up as the indels get larger.  So we no
longer normalize by indel length (which massively over-penalizes larger events and effectively drops their
QD to 0).

For the UG the previous normalization also wasn't perfect.  Now we divide the indel length by a factor
of 3 to make sure that QD is consistent over the range of indel lengths.

Integration tests change because QD is different for indels.
Also, got permission from Valentin to archive a failing test that no longer applies.

Thanks to Kurt on the GATK forum for pointing this all out.
2014-01-13 15:23:36 -05:00
droazen 7cd304fb41 Merge pull request #470 from broadinstitute/mf_new_RBP
Mf new rbp
2014-01-13 08:46:27 -08:00
MauricioCarneiro 50cd6781b3 Merge pull request #465 from broadinstitute/eb_improvements_to_ref_confidence_merger
Improvements to ref confidence merger
2014-01-08 10:51:01 -08:00
Eric Banks f172c349f6 Adding the functionality to enable users to input a file of VCFs for -V.
To do this I have added a RodBindingCollection which can represent either a VCF or a
file of VCFs.  Note that e.g. SelectVariants allows a list of RodBindingCollections so
that one can intermix VCFs and VCF lists.

For VariantContext tags with a list, by default the tags for the -V argument are applied
unless overridden by the individual line.  In other words, any given line can have either
one token (the file path) or two tokens (the new tags and the file path).  For example:
foo.vcf
VCF,name=bar bar.vcf

Note that a VCF list file name must end with '.list'.

Added this functionality to CombineVariants, CombineReferenceCalculationVariants, and VariantRecalibrator.
2014-01-08 00:45:00 -05:00
Eric Banks c133909d32 Fixed edge condition in the realigner where a realigned read can sometimes get partially aligned off the end of the contig.
Now we ignore such reads (which is much easier than trying to figure out when to soft-clip).
Added unit test.
2014-01-08 00:37:28 -05:00
Menachem Fromer e33d3dafc6 Add documentation for RBP, and also update the MD5 for the tests now that the output uses HP tags instead of '|', which is now reserved for trio-based phasing 2014-01-03 12:04:47 -05:00
Menachem Fromer d1275651ae Merge remote-tracking branch 'origin/master' into mf_new_RBP 2014-01-03 01:13:40 -05:00
Ryan Poplin 856c1f87c1 Allow for additional input data to be used in the VQSR for clustering but don't carry it forward into the output VCF file.
-- New -a argument in the VQSR for specifying additional data to be used in the clustering
-- New NA12878KB walker which creates ROC curves by partitioning the data along VQSLOD and calculating how many KB TP/FP's are called.
2014-01-02 14:46:04 -05:00
amilev f81a38f596 Merge pull request #446 from broadinstitute/ami-RNAseq-tools
Write a new tool for spliting reads that have N cigar string.
2014-01-01 21:06:25 -08:00