Commit Graph

1209 Commits (75c17bbd6fbe800a46168c6313e45334ae3b4e22)

Author SHA1 Message Date
Geraldine Van der Auwera 3ba94b987c Minor documentation clarifications 2014-10-22 17:54:11 -04:00
rpoplin 0f89d1a362 Merge pull request #755 from broadinstitute/sc_Annotation_Docs_73647570
Improvements to documentation of variant annotations
2014-10-22 13:41:00 -04:00
Sheila Chandran b3c5ed4414 Improvements to documentation of variant annotations
- Added or modified explanations for majority of variant annotations
	- Generalized NBaseCount to include all tech platforms (not just SOLiD)
2014-10-21 18:20:04 -04:00
Geraldine Van der Auwera 895b8c5931 Minor fix for missing INFO key definition in VCF header 2014-10-21 16:50:37 -04:00
rpoplin c4fcd70a88 Merge pull request #754 from broadinstitute/rhl_variant_array_exception
Do not process a variant if it is too large (> readLength), and log an e...
2014-10-21 12:01:52 -04:00
rpoplin bcf6be0b08 Merge pull request #753 from broadinstitute/ldg_HCzeroDepth
Fix GenotypeGVCF bugs in -allSites mode
2014-10-21 12:00:04 -04:00
Laura Gauthier 2b848ad859 Variants that become hom-ref after regenotyping in GenotypeGVCFs are now getting output in -allSites mode. 2014-10-21 08:21:53 -04:00
Laura Gauthier 5465e4484e For GenotypeGVCFs -allSites mode, make genotypes no-call if depth is zero. 2014-10-21 08:21:43 -04:00
Ron Levine 239151ac7b Do not process a variant if it is too large (> readLength), and log an error
remove final keyword before refMap and altMap, constructHaplotype() changes their values

return ArtificialHaplotype from constructHaplotype instaed of passing as an argument

Add logic so arraycopy does not throw an IndexOutOfBoundsException, add test for a long insert
2014-10-20 15:51:32 -04:00
Phillip Dexheimer b348ce8f25 Added -disableOptimizations argument to HaplotypeCaller.
* This argument is intended to be used in conjunction with -bamout, and disable early-exit optimizations to allow reference regions to be contained in the output bam
  * Also forcibly includes the reference haplotype in the set of haplotypes given to the BAMWriter
  * Made -dontTrimActiveRegions visible, as it is likely also desirable in this use case
  * Addresses PT 77731660
2014-10-16 21:11:20 -04:00
Laura Gauthier 0f08065ebc Throw UserException if input VCFs have duplicate samples but no genotypemergeoption is specified 2014-10-15 16:03:10 -04:00
Laura Gauthier 81482138ca Decrease interval on CGP integration test to reduce test execution time 2014-10-15 11:28:27 -04:00
Geraldine Van der Auwera e7e8052f84 Updated license information
- Updated license files (private/protected) for version, address and a couple of legal clauses
- Updated license snippet throught the codebase
2014-10-14 17:10:12 -04:00
Ron Levine 36c27155af Made the threshold for the probability of a state being active a command line argument
remove TODO comment after activeProbThreshold

recover static ACTIVE_PROB_THRESHOLD for unit tests

Add min/max values for active_probability_threshold parameter

Move activeProbThreshold parameter to GATKArguemtnCollection

define ACTIVE_PROB_THRESHOLD in unit tests

add construction of argCollection in in ctor

Move arguments from GATKArgumentCollection to ActiveRegionWalker

Throw exception if threshold < 0 or > 1 in ActivityProfile ctor

max propogation distance parameter to ActiveRegionWalker for AcrtivityProfile

Use polymorphic getMaxProbPropagationDistance() so BandPassActivityProfile computes the crrect region size cutoff

Get the maxProbPropagationDistance from the super class's method, instead of directly, this is safer

Removed extraneous command line imports and make maxProbPropagationDistance a hidden argument

remove limit check for activeProbThreshold, not necessary because the check is made when imput as a command line arg

Remove extra 'region' in the doxygen param description for maxProbPropagationDistance
2014-10-10 10:36:02 -04:00
Ron Levine 645d418015 Changed hardcoded downsampling max/min coverage values to parameters
Rename parameters using camel case and add to integration test

Correct documentation for maxReadsInRegionPerSample and minReadsPerAlignmentStart

Change the argument--minReadsPerAlignmentStart in the integration test from 50 to 5

'each genomic location' only pertains to minReadsPerAlignmentStart, not maxReadsInRegionPerSample
2014-10-09 17:09:26 -04:00
Valentin Ruano-Rubio a3ad6f63bd Reduce execution time of various integration tests
Story:

    https://www.pivotaltracker.com/story/show/79461912
2014-09-30 13:28:55 -04:00
rpoplin 329bd081b7 Merge pull request #736 from broadinstitute/rhl_remove_line
removed an unneed import that broke maven
2014-09-29 15:03:55 -04:00
Ron Levine 1c9d60c9a0 removed an unneed import that broke maven 2014-09-29 12:57:33 -04:00
Valentin Ruano-Rubio 311b6815b3 Fixed the QUAL calculation of the EXACT_INDEPENDENT.
The QUAL value calculated by this Exact AF Calculator is very underestimated when
there are more than one alternative allele (non-biallelic sites). The reason is
that the QUAL was roughly calculated by adding the QUALs resulting of each alternative
alleles vs all other alleles, reference and alts, collapsed. This is ok for MLEAC
calculations but not for QUAL.

Now, for calculating the QUAL we collapse all the alternatives as only one. This change
improves sensitivy with a cost of additional false positives, but this is naturally expected.
The resulting QUAL column is much closer to the one returned by the reference implementation.

Story:

  https://www.pivotaltracker.com/story/show/75926368.

Changes:

  Changed the QUAL calculation as described above.
  Updated MD5s.

Fixed MD5s
2014-09-29 11:04:52 -04:00
Valentin Ruano-Rubio 0e52b8ba5a Fixed MLEAC and QUAL inaccuracy in GeneralPloidyExactAFCalculator.
The problem whas that the MLE table calculation aborted "unlikely"
genotype combinations to aggresively.

This also uncovered another bug where GeneralPloidyExactAFCalculation
makes a slightly different use of StateTracker
as compared to DiploidExactAFCalculation. We have changed StateTracker
generalizing it to be able to work with both using code behaviors.

Story:
-----

  * https://www.pivotaltracker.com/story/show/78920568

Changes:
-------

  * Fixes in GeneralPloidyExactAFCalculator.
  * Needed changes in StateTracker API and its consequences in DiploidExactAFCalculation.
  * Updated affected integrated tests' MD5s after fixing the GeneralPloidyExactAF.
2014-09-23 15:40:54 -04:00
Valentin Ruano-Rubio f6cb83d476 Renamed AFCalc to AFCalculator for a better class naming 2014-09-12 14:59:58 -04:00
Valentin Ruano-Rubio 95b45443ae Updated test according to changes in the AF calculator framework.
Changes:
-------

* Updated current unit and integration test to use the new API components.
* Added unit tests for new classes AFPriorProvider and AFCalculatorProviders.
* Added integration test for mixed ploidy GenotypeGVCFs and CombineGVCFs
2014-09-12 14:59:47 -04:00
Valentin Ruano-Rubio 3cdeab6e9e GenotypingEngines and walkers now use AFCalc(ulator) providers rathern than instanciate their own (fixed) calculators directly.
Changes:
-------

* GenotypingEngine uses now a AFCalc provider instead of
  its own thread-local with one-time initialized and fixed
  AF calculator.

* All walkers that use a GenotypingEngine now are passing
  the appropiate AF calculator provider. For now most
  just use a fix calculator (FixedAFCalculatorProvider)
  except GenotypeGVCFs as this one now can cope with
  mixture of ploidies failing-over to a general-ploidy
  calculator when the preferred implementation is not
  capable to handle a site's analysis.
2014-09-12 14:25:09 -04:00
Valentin Ruano-Rubio 935bd1394b AFCalculatorProvider components to allow for dynamic instantiation of different AFCalc(ulators) to cope with
dynamic ploidy and max-alt-allele counts (the latter not used for now).
2014-09-12 14:23:45 -04:00
Valentin Ruano-Rubio ce8e93fa51 Made the AF prior probability distribution dynamic respect
to the total-ploidy (added ploidy accross samples).

Changes:
--------

* Instead of calculate a fixed log10 prior array with a fix
   total likelihood we use a new component, the AFPriorProvider
   to generate the priors for different total plodies on
   demand; these are cached however so there is no unecessary
   recompute involved.
2014-09-12 14:23:37 -04:00
Valentin Ruano-Rubio 31e58ae4ec Refactored AFCalc to remove unecessary capability limits allowing to deal
with mixed ploidies and max-alt-allele number changes dynamically.

Changes:
--------

* Moved the AFCalcFactory.Calculation enum in a top level class
    AFCalculatorImplementation.
* Given more reponsabilities to the enum like resolving the constructor
    method once per implementation and the best-model selection algorithm.
* Removed test-code only fields and methods from AFCalc; just used to perform
    unit-testing and not any actual functionality of this component.
* Removed the fixed ploidy constraint of GeneralPloidyExactAFCalc
    implementation... now can deal with mixed ploidies that may change
    per site and sample.
* Removed the fixed maxAltAllele restriction by allowing resizing of
    the stateTracker structures.
* Due to previous two points now call the the AFCalc object are passed
    the default-ploidy to assume in case some genotype in the input
    VC does not have it and the max-alt-allele.
* Also due to those changes, removed the now totally useless 3 int
    parameters from all AFCalc constructors.
* Cleaned the code a bit from no further used components and methods.
2014-09-12 14:17:36 -04:00
Ryan Poplin 48252897b4 Added ignore all filters options to VQSR walkers 2014-09-11 15:11:41 -04:00
Eric Banks 31cea25c36 Merge pull request #730 from broadinstitute/eb_inbreeding_coeff_unit_test
Cleaned up and fleshed out unit tests for the Inbreeding Coefficient annotation class
2014-09-10 09:32:49 -04:00
Eric Banks 5e490362ca Cleaned up and fleshed out unit tests for the Inbreeding Coefficient annotation class. 2014-09-08 11:40:39 -04:00
Eric Banks cc175bad40 Improve the accuracy of dangling head merging in the HC assembler.
Dangling head merging (like with tails) in now enabled by default.
The --recoverDanglingHeads argument is now deprecated so that users know not to use it anymore.
We now also allow the user to set the minimum branch length for merging.  This will be different
for exomes and RNA (see below).

The other changes in the code itself:
1. We no longer allow an arbitrarily large number of mismatches in the dangling head for merging
2. The max number of mismatches allowed in a dangling head is proportional to the kmer size

There will be a difference in the RNA calling pipeline.  Instead of invoking '--recoverDanglingHeads'
the user will instead want to use '--minDanglingBranchLength 0'.

Below are the knowledgebase results of the master branch vs. this one.

For NA12878 DNA Exome:

master  SNPS         TRUE_POSITIVE                                36722
master  SNPS         CALLED_NOT_IN_DB_AT_ALL                       2699
master  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE        292
master  SNPS         FALSE_POSITIVE_SITE_IS_FP                       70

branch  SNPS         TRUE_POSITIVE                                36867
branch  SNPS         CALLED_NOT_IN_DB_AT_ALL                       2952
branch  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE        387
branch  SNPS         FALSE_POSITIVE_SITE_IS_FP                       94

As I discussed with Ryan in person, there are a good number of FPs that are called in the new
code, but they nearly all have bad strand bias and should be easily filtered by VQSR.
Note that there is no change for indels.

For NA12878 RNA from Ami:

master  SNPS         TRUE_POSITIVE                                11055
master  SNPS         CALLED_NOT_IN_DB_AT_ALL                        831
master  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE         44
master  SNPS         FALSE_POSITIVE_SITE_IS_FP                       96

branch  SNPS         TRUE_POSITIVE                                11113
branch  SNPS         CALLED_NOT_IN_DB_AT_ALL                        874
branch  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE         47
branch  SNPS         FALSE_POSITIVE_SITE_IS_FP                       92

Again, there's basically no change for indels.
2014-09-07 08:55:59 -04:00
Phillip Dexheimer a35f5b8685 Moved arguments controlling options in output files into the engine
* Arguments involved are --no_cmdline_in_header, --sites_only, and --bcf for VCF files and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files
 * PT 52740563
 * Removed ReadUtils.createSAMFileWriterWithCompression(), replaced with ReadUtils.createSAMFileWriter(), which applies all appropriate engine-level arguments
 * Replaced hard-coded field names in ArgumentDefinitionField (Queue extension generator) with a Reflections-based lookup that will fail noisily during extension generation if there's an error
2014-09-05 21:18:11 -04:00
droazen 5c4a3eb89c Merge pull request #727 from broadinstitute/ks_gatk_queue_package_test_updates
Various fixes for package tests.
2014-09-05 10:17:32 -04:00
Ryan Poplin a45acdfb89 StrandOddsRatio is now a standard annotation. 2014-09-05 08:33:37 -04:00
Khalid Shakir 376592f423 Various fixes for package tests.
Explicitly including gatk/queue test-jar artifacts in package test classpaths.
SelectVariantsIntegrationTest#testInvalidJexl now resets the JexlEngine silent flag that VariantFiltration.initialize() toggles.
External example no longer tries to unpack nonexistent gatk artifact jars during package tests.
2014-09-04 15:30:31 -04:00
Ryan Poplin 1b809268d5 fixing a few small typos in the HaplotypeCaller and related classes 2014-09-04 14:48:27 -04:00
droazen 5c087a6e1f Merge pull request #724 from broadinstitute/ks_remove_test_qscript_symbolic_links
Removed symlink creation for tests and qscripts
2014-09-04 09:10:54 -04:00
Eric Banks 538537dbf1 Merge pull request #718 from broadinstitute/mf_rbp_fix
Fix MNP merging code to work with explicit HP phase representation
2014-09-02 20:39:22 -04:00
Eric Banks 01e725cd1a Merge pull request #723 from broadinstitute/eb_fix_rna_splitting_PT77878554
Make sure that the OverhangFixingManager (used for splitting RNA reads) ...
2014-09-02 20:39:01 -04:00
Menachem Fromer 10f9001738 Fix MNP merging code to work with explicit HP phase representation 2014-09-02 17:25:08 -04:00
Eric Banks ff91ab8ba2 Make sure that the OverhangFixingManager (used for splitting RNA reads) handles unmapped reads. 2014-09-02 16:56:17 -04:00
Valentin Ruano Rubio c7925f6e5c Merge pull request #719 from broadinstitute/vrr_generalize_ploidy_in_genotype_gvcfs
Adds support for omniploidy to GenotypeGVCFs and CombineGVCFs.
2014-09-02 16:51:02 -04:00
Valentin Ruano-Rubio d363725b4b Adds support for omniploidy to GenotypeGVCFs and CombineGVCFs.
Same changes fixed the problem for GenotypeGVCFs and CombineGVCFs.

Stories:

  - https://www.pivotaltracker.com/story/show/77626044
  - https://www.pivotaltracker.com/story/show/77626854

Changes:

  - Generalized the code for the merging in GATKVariantContextUtils to cope
    with ploidy != 2.
  - GenotypeGVCFs now check that the input's ploidy conform to the '-ploidy'
    argument.
  - Moved out Refernce Confidence VC merging code from GATKVariantContextUtils
    so that we can keep new code in protected.

Caveats:

  - GenotypeGVCFs only can deal with input files that have the same ploidy in
    all positions; the one that the user MUST indicate in the -ploidy argument
    (if different to the default 2).
  - CombineGVCFs won't necessarely complain if its passed mixed ploidy
    inputs but you won't be able to genotype it with GenotypeGVCFs.

Test:

   - Removed deprecated unit tests for GATKVariantContextUtils.
   - Moved unit-tests regarding GVCF merging from GATKVariantContextUtilsUnitTest
     to ReferenceConfidenceVariantContextUtilsUnitTest.
   - Added unit test for new code for mapping genotype indices between allele
     index encoding in GenotypeLikelihoodCalculator.
   - GenotypeGVCFs and CombineGVCFs original integration test are unaffected
     by the change.
   - Added tetraploid run integration tests to check on non-diploid execution
     of GenotypeGVCFs and CombineGVCFs.
2014-09-02 15:06:47 -04:00
Khalid Shakir fcb0eca203 Now passing in the path to the GATK directory to tests.
Changed tests and scripts to use gatkdir full path instead of relative testdata/qscripts symbolic links.
Although symlinks not created, left the symlink deletion script execution with a comment about future removal.
Re-enabled example UG pipeline queue test.
Replaced all hardcoded strings of {public,private}/testdata with BaseTest variables.
Refactored temp list creation method from ListFileUtilsUnitTest to BaseTest.createTempListFile.
Removed list files with hardcoded paths, now using createTempListFile instead with private test dir variable.
2014-09-02 01:40:59 +08:00
Khalid Shakir 2d28972c88 The 'after' files are @Input files and commited in git, so don't delete them after tests. 2014-08-30 03:04:54 +08:00
Eric Banks 5b087c9897 Changed the functionality of the physical phasing in the HC: now hom vars are output as 0|1.
We do this for technical reasons, mostly because we don't genotype in the HC anymore; it's all
done downstream by GenotypeGVCFs so we can't be sure that the genotype will be hom var.  Also,
there are steps in the downstream pipeline where genotypes can change, so assuming anything in
the HC is a bad idea, and if we have phasing info in the het state, we want to propagate that forward.

Now, PGT tag fixing happens downstream in GenotypeGVCFs.
While I was in there I also cleaned up the code a bit and fixed a bug where annotation was happening
before genotype creation when using the --includeNonVariantSites argument.

Added tests accordingly.
2014-08-25 21:40:14 -04:00
Valentin Ruano-Rubio 6dc5cf0be0 Fixes some missmerged md5 updates from a previous merge into master 2014-08-24 20:47:07 -04:00
Eric Banks 9009c1e996 Merge pull request #715 from broadinstitute/vrr_disable_physical_phasing_for_nondiploid_hc
Disable physical phasing for non-diploid HC calling.
2014-08-23 20:58:51 -04:00
Valentin Ruano-Rubio 6695aeafd9 Disable physical phasing for non-diploid HC calling.
Story:

    https://www.pivotaltracker.com/story/show/77452256

Changes:

    If ploidy != 2, disable physical phasing and log an info message to let the user know.

Tests:

    Change md5s affected by this change.
2014-08-23 10:52:07 -04:00
Phillip Dexheimer 931890915f Add the --sample_name argument to HaplotypeCaller
* This is a shortcut for people who have multi-sample BAMs but would like to use GVCF mode.  Rather than creating single-sample BAMs with PrintReads, one could use the --sample_name argument to HaplotypeCaller to specify the single sample to make calls on
 * Completes PT 73075482
2014-08-22 23:22:03 -04:00
Valentin Ruano-Rubio fc5ce4b662 Created the stand-alone AC and AF annotation AlleleCountBySample
Story:

  https://www.pivotaltracker.com/story/show/77250524

Changes:

  - Remove the annotating code in GeneralPloidyExactAFCalc (GPEAFC) class.
  - Added the asAlleleList to GenotypeAlleleCounts class and get (GPEAFC) to use that instead of implementing its own (nicer and more reusable code).
  - Removed the explicit addition of AlleleCountBySample fields to the VCF header by the walker initialize
  - Added utility methods in Utils to wrap and int[] array into a List<Integer>, and double[] array into a List<Double> efficiently.

Test:

  - Added unit-testing for asAlleleList in GenotypeAlleleCountsUnitTest (within testFirst and testNext).
  - Added unit-testing for new methods in Utils : asList(int[]) and asList(double[])
  - Changed UG General Ploidy test to add explicitly those annotations.
  - Non-trivial changes in integration tests involving non-diploid runs (namelly haploid and tetraploid) as they are not showing
    those annotations anylonger, so the MD5s have been changed accordingly.
2014-08-22 20:33:25 -04:00