Commit Graph

1463 Commits (e34ec0fbbba2dc1d424449ed30dc3b53e4ce6f6f)

Author SHA1 Message Date
Phillip Dexheimer b73e9d506a Added GATKVCFConstants and GATKVCFHeaderLines to consolidate the GATK-specific VCF annotations
* Removed unused annotations (CCC and HWP)
 * Renamed one of the two GC annotations to "IGC" (for Interval GC)
 * Revved picard & htsjdk (GATK constants are now removed from htsjdk)
 * PT 82046038
2015-01-13 21:32:09 -05:00
Laura Gauthier 6b2bd5ed09 Address user-reported bug featuring "trio" family with two children, one parent
Add test to cover case with family of one parent, two children
2015-01-13 18:35:44 -05:00
Ryan Poplin 2e5f9db758 Raising per-sample limits on the number of reads in ART and HC.
-- Active Region Traversal was using per sample limits on the number of reads that were too low, especially now that we are running one sample at a time. This caused issues with high confidence variants being dropped in high coverage data.
-- HaplotypeCallerGVCFIntegrationTest PL/annotation changes due to using more reads in those tests
-- Removed a CountReadsInActiveRegionsIntegrationTest test for excessive coverage because the read coverage no longer goes over the limits in ART
2015-01-09 11:21:42 -05:00
rpoplin 03203e249e Merge pull request #792 from broadinstitute/rhl_pairhmm_log_stderr
Rhl pairhmm log stderr
2015-01-07 12:41:10 -05:00
Valentin Ruano-Rubio aae04b6122 Fixes explicit limitation of the maximum ploidy of the reference-confidence model
Story:
=====

 - https://www.pivotaltracker.com/story/show/83803796

Changes:
=======

  - From a fix maximum ploidy indel RCM likelihood cache to a
    dynamically resizable one.
  - Used the occassion to removed an unused and deprecated method from ReferenceConfidenceModel

Testing:
=======

  - Added integration test to check on ploidies larger than the previous limit of 20.
2015-01-07 10:43:22 -05:00
Ron Levine b4fda38922 Use logging system instead of stderr 2015-01-05 14:04:10 -05:00
Laura Gauthier 88b6f3aa50 Change []-type arrays to lists so argument parsing works in VCF header commandline output 2015-01-05 10:21:06 -05:00
rpoplin 3240b3538a Merge pull request #794 from broadinstitute/rhl_read_backed_phasing
Rhl read backed phasing
2015-01-05 09:47:25 -05:00
Ron Levine c6840124fe clean up, add final 2015-01-04 23:01:24 -05:00
Ron Levine 85dc703461 Add TestMergeIntoMNP() and TestReallyMergeIntoMNP() 2015-01-01 09:51:20 -05:00
Ron Levine bb94833750 Add more tests 2014-12-30 22:45:44 -05:00
Ron Levine 714d575e3b correct reference file name 2014-12-25 14:00:39 -05:00
Ron Levine a7fba5c209 restructure and add more tests 2014-12-25 13:57:54 -05:00
Ron Levine 64375f6341 Messages that were going to stdout now going to stderr
Make PairHMM outputs go to stderr instead of stdout

Change output from stdout to stderr in close()

Updated lib with output going to stderr
2014-12-23 11:03:29 -05:00
Ron Levine 069398ad46 Added more tests and documentation 2014-12-19 12:57:43 -05:00
Laura Gauthier a9694951d2 Add error handling for genotypes that are called but have no PLs 2014-12-18 15:03:20 -05:00
Geraldine Van der Auwera b0e615251b Updated VQSR tool docs 2014-12-18 12:59:37 -05:00
rpoplin 4a2ac38308 Merge pull request #790 from broadinstitute/rp_nsubtil_fix-snp-detection
BQSR bug fix from @nsubtil
2014-12-18 09:19:53 -05:00
Ron Levine 08790e1dab Fix mmultiallelic info field annotation for VariantAnnotator
Add multi-allele test for info field annotations

Fix to process all types of INFO annotations

roll back to previous version, removes INFO and FORMAT

Correct @return for VariantAnnotatorEngine.getNonReferenceAlleles()

Enhance comments and clean up multi-allelic logic, handle header info number = R

only parse counts of A & R

Add INFO for AC

update MD5

Performance enhancement, only parse multiallelic with a count A or R

Make argument final in getNonReferenceAlleles()

Code cleanup, add exceptions for bad expression/allele size mismatch and missing header info for an expression

Change exception to warning for expression value/number of alleles check

remove adevertised exceptions
2014-12-17 22:21:00 -05:00
Ron Levine ba949389c5 matchHaplotypeAlleles() no longer calls alleleSegregationIsKnown(), added a TODO to investigate 2014-12-17 14:02:24 -05:00
Ryan Poplin d84970ff75 BQSR bug fix from @nsubtil
-- Ignore SNP matches that lie outside the clipped read window
-- This fixes an issue where GATK would skip the entire read if a SNP is entirely
contained within a sequencing adapter.
2014-12-17 10:04:37 -05:00
Ron Levine 56f8e4f9cf Add comments, alleleSegregationIsKnown() check is added to matchHaplotypeAlleles() 2014-12-17 03:25:26 -05:00
Laura Gauthier 011843c569 Fixed huge bug from 9895005a (CombineGVCFs used to stop after the first contig) 2014-12-16 12:43:32 -05:00
rpoplin bcc6b73e9b Merge pull request #786 from broadinstitute/pd_variantstotable_sma
Fix VariantsToTable output of FORMAT record lists when -SMA is specified
2014-12-16 10:37:22 -05:00
Valentin Ruano-Rubio 736a857e82 Fixing CombineGVCFs that writes out the wrong REF allele
Story:
=====

  - https://www.pivotaltracker.com/story/show/83259038

Changes:
=======

  - Done minimal changes to make the fix after an arduous attempt to understand
    CombineGVCFs code.

Test:
====

  - Added a integration test to explicitly test for the bug.

  - Updated a md5 changes as the bug was actually affecting one of the existing
    integration tests.
2014-12-13 22:38:24 -05:00
Phillip Dexheimer 71bdfbe465 Fix VariantsToTable output of FORMAT record lists when -SMA is specified
* PT 84242218
 * Note that FORMAT fields behave the same as INFO fields - if the annotation has a count of A (one entry per Alt Allele), it is split across the multiple output lines.  Otherwise, the entire list is output with each field
2014-12-10 21:41:15 -05:00
rpoplin bf2911d62c Merge pull request #783 from broadinstitute/pd_splitsamfile
Fix NPE in SplitSamFile
2014-12-08 09:39:03 -05:00
Valentin Ruano-Rubio 385186e11b Makes GQ of Hom-Ref Blocks in GVCF output to be consistent with PLs
Story:
-----

  - https://www.pivotaltracker.com/story/show/83800586

Changes:
-------

  - In GVCFWriter GQ is now recalculated out of the fianl PL array for the block.

Testing:
-------

  - Updated affected integration test md5s
2014-12-07 16:45:32 -05:00
Phillip Dexheimer a5dee8a42e Fix NPE in SplitSamFile
* PT 82892316
  * Added integration test
  * Fixed similar error in debug output of HC
2014-12-07 10:37:30 -05:00
Ron Levine c9175eeee8 Renamed PhasingUtilitiesUnitTest to PhasingUtilsUnitTest 2014-12-02 18:20:12 -05:00
Ron Levine b8f0f3fdd2 Add argument for loading the vector HMM library once 2014-12-02 10:13:56 -05:00
Ron Levine 386aeda022 Add HaplotypeCaller argument so integration tests can specify the hardware dependent PairHMM sub-implementation 2014-11-25 21:53:53 -05:00
Ron Levine 34241a62f6 Use a publicly accessible sequence file 2014-11-24 11:18:21 -05:00
Ron Levine 6ff698c556 Added HP and non-HP tests for matchHaplotypeAlleles(), added a nominal test for mergeIntoMNPvalidationCheck() 2014-11-24 11:08:04 -05:00
Ron Levine 61e1a3ecd1 Added the framework for testing the PhasingUtilies methods matchHaplotypeAlleles() and reallyMergeIntoMNP() 2014-11-22 22:01:39 -05:00
Menachem Fromer 9b73c8a841 Fix MNP merging bugs 2014-11-21 06:42:51 -05:00
rpoplin 00027e1555 Merge pull request #774 from broadinstitute/ldg_makeSelectVariantsTrimAlleles
Add -trim argument to SelectVariants to trim alleles to minimal represen...
2014-11-13 13:58:13 -05:00
Ron Levine 67656bab23 Resolved conflict during rebasing
Add more logging to annotators, change loggers from info to warn

Add comments to testStrandBiasBySample()

Clarify comments in testStrandBiasBySample

remove logic for not prcossing an indel if strand bias (SB) was not computed

remove per variant warnings in annotate()

Log warnings if using the wrong annotator or missing a pedgree file

Log test failures once in annotate(), because HaplotypeCaller does not call initialize(). Avoid using exceptions

Fix so only log once in annotate(), Hardey-Weinberg does not require pedigree files, fix test MD5s so pass

Check if founderIds == null

Update MD5s from HaplotypeCaller integrations tests and clean up code

Change logic so SnpEff does not throw excpetions, change engine to utils in imports

Update test MD5s, return immediately if cannot annotate in SnpEff.initialization()

Post peer review, add more logging warnings

Update MD5 for testHaplotypeCallerMultiSampleComplex1, return null if PossibleDeNovo.annotate() is not called by VariantAnnotator
2014-11-12 02:45:49 -05:00
Laura Gauthier 783a4fd651 Change default behavior of SelectVariants to trim remaining alleles when samples are subset. -noTrim argument preserves original alleles. Add test for trimming. 2014-11-11 16:32:25 -05:00
Valentin Ruano-Rubio c5977e5c8f Correct wrong left-alignment of reads in HC bamout
Story:
-----

  https://www.pivotaltracker.com/story/show/80684230

Changes:
-------

  - Corrected the bug: AlignmentUtils#createReadAlignedToRef was
    not realigning against the reference but the best haplotype for
    the read.

Test:
----

  - Added integration test in HaplotypeCallerIntegrationTest to check
    that the bug has been fixed.
  - Fixed md5s modified by this change; these are cause due to small
    changes in the state of the random-number generator and read vs
    variant site overlapping.
2014-11-10 10:09:58 -05:00
Laura Gauthier c09667a20d Fix bug in CombineGVCFs so now sample 2 variants occuring within sample 1 deletions get merged properly.
CombineGVCFs now outputs ref conf for the duration of deletions so that SNPs occuring in other samples aligned with those deletions will be genotyped correctly
2014-11-05 09:11:47 -05:00
Khalid Shakir 0092a0b9eb Faster builds, with updates to documentation generation.
Reading the multiple GATKText files as a single stream, especially with new top level target executable jar files pointing to a lib folder.
Don't dirty the build with a new GATKText.properties if input files are unmodified.
Stop warning on undocumented abstract classes.
Fixed ClassNotFoundException/NoClassDefFoundError by fixing ResourceBundleExtractorDoclet artifact.
Excluding Exceptions from documentation.
Removed custom log4j dependency from ResourceBundleExtractorDoclet.
Stop generating the dependency reduced pom during shade.
Stop regenerating gsalib when the files are already up to date.
Disabled mvn site generation from external-example.
2014-11-05 00:32:23 +08:00
Khalid Shakir 1cb4b99548 Added faster built executable, non-packaged jars.
Moved top level target symlinks to package jar files to under target/package.
Executable jar files are placed under target/executable with the new target[/lib] directories.
Under top level target, symlinks to *either* the package *or* the executable jars replace what was a symlink to the package jar path.
Allow disabling of the shade package.
ant-bridge.sh by default only builds executable jars, and doesn't package by default, as did the old ant build.xml.
Added a new package_path.sh utility script for other scripts to use instead of anything in the target folder.
2014-11-05 00:30:46 +08:00
Phillip Dexheimer 10f99cbe04 Added StrandAlleleCountsBySample annotation
This annotation outputs the number of reads supporting each allele, stratified by sample and read strand.
Addresses PT 76958712
2014-11-03 21:35:58 -05:00
Khalid Shakir 8b81031bf8 Disabling tests for Lsf706 specific functionality. 2014-11-04 01:31:18 +08:00
Phillip Dexheimer bcfd9ce19a Moved platform flow information into NGSPlatform
* Explicitly added a type for rarely used platforms
 * PT 81767718
2014-10-31 22:27:34 -04:00
rpoplin c84805c402 Merge pull request #768 from broadinstitute/pd_bcf_failures
Fix BCF writing when FORMAT annotations contain arrays
2014-10-31 15:30:56 -04:00
rpoplin eecb56e0ae Merge pull request #766 from broadinstitute/ldg_StrandBiasForMultiallelics
Calculate StrandBiasBySample using all alternate alleles as ref vs. any ...
2014-10-31 15:26:07 -04:00
Phillip Dexheimer fc67e50faa Revved Picard/htsjdk
Removed inefficient array->List conversion in AlleleCountBySample
2014-10-30 21:16:25 -04:00
Laura Gauthier bc7202fff7 Calculate StrandBiasBySample using all alternate alleles as ref vs. any alt 2014-10-30 11:52:06 -04:00
Khalid Shakir 5c9fe1a06d Split all imports of tools|engine from utils, and all tools from engine.
Second of two commits, modifying actual files.
2014-10-24 20:59:46 +08:00
Khalid Shakir bb7151192a Split all imports of tools|engine from utils, and all tools from engine.
First of two commits, renaming files only.
2014-10-24 20:59:45 +08:00
Geraldine Van der Auwera b69b256003 Update pom versions to mark the start of GATK 3.4 development 2014-10-23 22:31:44 -04:00
Geraldine Van der Auwera eee94ec81f Update pom versions for the 3.3 release 2014-10-23 22:25:17 -04:00
Geraldine Van der Auwera 3ba94b987c Minor documentation clarifications 2014-10-22 17:54:11 -04:00
rpoplin 0f89d1a362 Merge pull request #755 from broadinstitute/sc_Annotation_Docs_73647570
Improvements to documentation of variant annotations
2014-10-22 13:41:00 -04:00
Sheila Chandran b3c5ed4414 Improvements to documentation of variant annotations
- Added or modified explanations for majority of variant annotations
	- Generalized NBaseCount to include all tech platforms (not just SOLiD)
2014-10-21 18:20:04 -04:00
Geraldine Van der Auwera 895b8c5931 Minor fix for missing INFO key definition in VCF header 2014-10-21 16:50:37 -04:00
rpoplin c4fcd70a88 Merge pull request #754 from broadinstitute/rhl_variant_array_exception
Do not process a variant if it is too large (> readLength), and log an e...
2014-10-21 12:01:52 -04:00
rpoplin bcf6be0b08 Merge pull request #753 from broadinstitute/ldg_HCzeroDepth
Fix GenotypeGVCF bugs in -allSites mode
2014-10-21 12:00:04 -04:00
Laura Gauthier 2b848ad859 Variants that become hom-ref after regenotyping in GenotypeGVCFs are now getting output in -allSites mode. 2014-10-21 08:21:53 -04:00
Laura Gauthier 5465e4484e For GenotypeGVCFs -allSites mode, make genotypes no-call if depth is zero. 2014-10-21 08:21:43 -04:00
Ron Levine 239151ac7b Do not process a variant if it is too large (> readLength), and log an error
remove final keyword before refMap and altMap, constructHaplotype() changes their values

return ArtificialHaplotype from constructHaplotype instaed of passing as an argument

Add logic so arraycopy does not throw an IndexOutOfBoundsException, add test for a long insert
2014-10-20 15:51:32 -04:00
Phillip Dexheimer b348ce8f25 Added -disableOptimizations argument to HaplotypeCaller.
* This argument is intended to be used in conjunction with -bamout, and disable early-exit optimizations to allow reference regions to be contained in the output bam
  * Also forcibly includes the reference haplotype in the set of haplotypes given to the BAMWriter
  * Made -dontTrimActiveRegions visible, as it is likely also desirable in this use case
  * Addresses PT 77731660
2014-10-16 21:11:20 -04:00
Laura Gauthier 0f08065ebc Throw UserException if input VCFs have duplicate samples but no genotypemergeoption is specified 2014-10-15 16:03:10 -04:00
Laura Gauthier 81482138ca Decrease interval on CGP integration test to reduce test execution time 2014-10-15 11:28:27 -04:00
Geraldine Van der Auwera e7e8052f84 Updated license information
- Updated license files (private/protected) for version, address and a couple of legal clauses
- Updated license snippet throught the codebase
2014-10-14 17:10:12 -04:00
Ron Levine 36c27155af Made the threshold for the probability of a state being active a command line argument
remove TODO comment after activeProbThreshold

recover static ACTIVE_PROB_THRESHOLD for unit tests

Add min/max values for active_probability_threshold parameter

Move activeProbThreshold parameter to GATKArguemtnCollection

define ACTIVE_PROB_THRESHOLD in unit tests

add construction of argCollection in in ctor

Move arguments from GATKArgumentCollection to ActiveRegionWalker

Throw exception if threshold < 0 or > 1 in ActivityProfile ctor

max propogation distance parameter to ActiveRegionWalker for AcrtivityProfile

Use polymorphic getMaxProbPropagationDistance() so BandPassActivityProfile computes the crrect region size cutoff

Get the maxProbPropagationDistance from the super class's method, instead of directly, this is safer

Removed extraneous command line imports and make maxProbPropagationDistance a hidden argument

remove limit check for activeProbThreshold, not necessary because the check is made when imput as a command line arg

Remove extra 'region' in the doxygen param description for maxProbPropagationDistance
2014-10-10 10:36:02 -04:00
Ron Levine 645d418015 Changed hardcoded downsampling max/min coverage values to parameters
Rename parameters using camel case and add to integration test

Correct documentation for maxReadsInRegionPerSample and minReadsPerAlignmentStart

Change the argument--minReadsPerAlignmentStart in the integration test from 50 to 5

'each genomic location' only pertains to minReadsPerAlignmentStart, not maxReadsInRegionPerSample
2014-10-09 17:09:26 -04:00
Valentin Ruano-Rubio a3ad6f63bd Reduce execution time of various integration tests
Story:

    https://www.pivotaltracker.com/story/show/79461912
2014-09-30 13:28:55 -04:00
rpoplin 329bd081b7 Merge pull request #736 from broadinstitute/rhl_remove_line
removed an unneed import that broke maven
2014-09-29 15:03:55 -04:00
Ron Levine 1c9d60c9a0 removed an unneed import that broke maven 2014-09-29 12:57:33 -04:00
Valentin Ruano-Rubio 311b6815b3 Fixed the QUAL calculation of the EXACT_INDEPENDENT.
The QUAL value calculated by this Exact AF Calculator is very underestimated when
there are more than one alternative allele (non-biallelic sites). The reason is
that the QUAL was roughly calculated by adding the QUALs resulting of each alternative
alleles vs all other alleles, reference and alts, collapsed. This is ok for MLEAC
calculations but not for QUAL.

Now, for calculating the QUAL we collapse all the alternatives as only one. This change
improves sensitivy with a cost of additional false positives, but this is naturally expected.
The resulting QUAL column is much closer to the one returned by the reference implementation.

Story:

  https://www.pivotaltracker.com/story/show/75926368.

Changes:

  Changed the QUAL calculation as described above.
  Updated MD5s.

Fixed MD5s
2014-09-29 11:04:52 -04:00
Valentin Ruano-Rubio 0e52b8ba5a Fixed MLEAC and QUAL inaccuracy in GeneralPloidyExactAFCalculator.
The problem whas that the MLE table calculation aborted "unlikely"
genotype combinations to aggresively.

This also uncovered another bug where GeneralPloidyExactAFCalculation
makes a slightly different use of StateTracker
as compared to DiploidExactAFCalculation. We have changed StateTracker
generalizing it to be able to work with both using code behaviors.

Story:
-----

  * https://www.pivotaltracker.com/story/show/78920568

Changes:
-------

  * Fixes in GeneralPloidyExactAFCalculator.
  * Needed changes in StateTracker API and its consequences in DiploidExactAFCalculation.
  * Updated affected integrated tests' MD5s after fixing the GeneralPloidyExactAF.
2014-09-23 15:40:54 -04:00
Valentin Ruano-Rubio f6cb83d476 Renamed AFCalc to AFCalculator for a better class naming 2014-09-12 14:59:58 -04:00
Valentin Ruano-Rubio 95b45443ae Updated test according to changes in the AF calculator framework.
Changes:
-------

* Updated current unit and integration test to use the new API components.
* Added unit tests for new classes AFPriorProvider and AFCalculatorProviders.
* Added integration test for mixed ploidy GenotypeGVCFs and CombineGVCFs
2014-09-12 14:59:47 -04:00
Valentin Ruano-Rubio 3cdeab6e9e GenotypingEngines and walkers now use AFCalc(ulator) providers rathern than instanciate their own (fixed) calculators directly.
Changes:
-------

* GenotypingEngine uses now a AFCalc provider instead of
  its own thread-local with one-time initialized and fixed
  AF calculator.

* All walkers that use a GenotypingEngine now are passing
  the appropiate AF calculator provider. For now most
  just use a fix calculator (FixedAFCalculatorProvider)
  except GenotypeGVCFs as this one now can cope with
  mixture of ploidies failing-over to a general-ploidy
  calculator when the preferred implementation is not
  capable to handle a site's analysis.
2014-09-12 14:25:09 -04:00
Valentin Ruano-Rubio 935bd1394b AFCalculatorProvider components to allow for dynamic instantiation of different AFCalc(ulators) to cope with
dynamic ploidy and max-alt-allele counts (the latter not used for now).
2014-09-12 14:23:45 -04:00
Valentin Ruano-Rubio ce8e93fa51 Made the AF prior probability distribution dynamic respect
to the total-ploidy (added ploidy accross samples).

Changes:
--------

* Instead of calculate a fixed log10 prior array with a fix
   total likelihood we use a new component, the AFPriorProvider
   to generate the priors for different total plodies on
   demand; these are cached however so there is no unecessary
   recompute involved.
2014-09-12 14:23:37 -04:00
Valentin Ruano-Rubio 31e58ae4ec Refactored AFCalc to remove unecessary capability limits allowing to deal
with mixed ploidies and max-alt-allele number changes dynamically.

Changes:
--------

* Moved the AFCalcFactory.Calculation enum in a top level class
    AFCalculatorImplementation.
* Given more reponsabilities to the enum like resolving the constructor
    method once per implementation and the best-model selection algorithm.
* Removed test-code only fields and methods from AFCalc; just used to perform
    unit-testing and not any actual functionality of this component.
* Removed the fixed ploidy constraint of GeneralPloidyExactAFCalc
    implementation... now can deal with mixed ploidies that may change
    per site and sample.
* Removed the fixed maxAltAllele restriction by allowing resizing of
    the stateTracker structures.
* Due to previous two points now call the the AFCalc object are passed
    the default-ploidy to assume in case some genotype in the input
    VC does not have it and the max-alt-allele.
* Also due to those changes, removed the now totally useless 3 int
    parameters from all AFCalc constructors.
* Cleaned the code a bit from no further used components and methods.
2014-09-12 14:17:36 -04:00
Ryan Poplin 48252897b4 Added ignore all filters options to VQSR walkers 2014-09-11 15:11:41 -04:00
Eric Banks 31cea25c36 Merge pull request #730 from broadinstitute/eb_inbreeding_coeff_unit_test
Cleaned up and fleshed out unit tests for the Inbreeding Coefficient annotation class
2014-09-10 09:32:49 -04:00
Eric Banks 5e490362ca Cleaned up and fleshed out unit tests for the Inbreeding Coefficient annotation class. 2014-09-08 11:40:39 -04:00
Eric Banks cc175bad40 Improve the accuracy of dangling head merging in the HC assembler.
Dangling head merging (like with tails) in now enabled by default.
The --recoverDanglingHeads argument is now deprecated so that users know not to use it anymore.
We now also allow the user to set the minimum branch length for merging.  This will be different
for exomes and RNA (see below).

The other changes in the code itself:
1. We no longer allow an arbitrarily large number of mismatches in the dangling head for merging
2. The max number of mismatches allowed in a dangling head is proportional to the kmer size

There will be a difference in the RNA calling pipeline.  Instead of invoking '--recoverDanglingHeads'
the user will instead want to use '--minDanglingBranchLength 0'.

Below are the knowledgebase results of the master branch vs. this one.

For NA12878 DNA Exome:

master  SNPS         TRUE_POSITIVE                                36722
master  SNPS         CALLED_NOT_IN_DB_AT_ALL                       2699
master  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE        292
master  SNPS         FALSE_POSITIVE_SITE_IS_FP                       70

branch  SNPS         TRUE_POSITIVE                                36867
branch  SNPS         CALLED_NOT_IN_DB_AT_ALL                       2952
branch  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE        387
branch  SNPS         FALSE_POSITIVE_SITE_IS_FP                       94

As I discussed with Ryan in person, there are a good number of FPs that are called in the new
code, but they nearly all have bad strand bias and should be easily filtered by VQSR.
Note that there is no change for indels.

For NA12878 RNA from Ami:

master  SNPS         TRUE_POSITIVE                                11055
master  SNPS         CALLED_NOT_IN_DB_AT_ALL                        831
master  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE         44
master  SNPS         FALSE_POSITIVE_SITE_IS_FP                       96

branch  SNPS         TRUE_POSITIVE                                11113
branch  SNPS         CALLED_NOT_IN_DB_AT_ALL                        874
branch  SNPS         REASONABLE_FILTERS_WOULD_FILTER_FP_SITE         47
branch  SNPS         FALSE_POSITIVE_SITE_IS_FP                       92

Again, there's basically no change for indels.
2014-09-07 08:55:59 -04:00
Phillip Dexheimer a35f5b8685 Moved arguments controlling options in output files into the engine
* Arguments involved are --no_cmdline_in_header, --sites_only, and --bcf for VCF files and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files
 * PT 52740563
 * Removed ReadUtils.createSAMFileWriterWithCompression(), replaced with ReadUtils.createSAMFileWriter(), which applies all appropriate engine-level arguments
 * Replaced hard-coded field names in ArgumentDefinitionField (Queue extension generator) with a Reflections-based lookup that will fail noisily during extension generation if there's an error
2014-09-05 21:18:11 -04:00
droazen 5c4a3eb89c Merge pull request #727 from broadinstitute/ks_gatk_queue_package_test_updates
Various fixes for package tests.
2014-09-05 10:17:32 -04:00
Ryan Poplin a45acdfb89 StrandOddsRatio is now a standard annotation. 2014-09-05 08:33:37 -04:00
Khalid Shakir 376592f423 Various fixes for package tests.
Explicitly including gatk/queue test-jar artifacts in package test classpaths.
SelectVariantsIntegrationTest#testInvalidJexl now resets the JexlEngine silent flag that VariantFiltration.initialize() toggles.
External example no longer tries to unpack nonexistent gatk artifact jars during package tests.
2014-09-04 15:30:31 -04:00
Ryan Poplin 1b809268d5 fixing a few small typos in the HaplotypeCaller and related classes 2014-09-04 14:48:27 -04:00
droazen 5c087a6e1f Merge pull request #724 from broadinstitute/ks_remove_test_qscript_symbolic_links
Removed symlink creation for tests and qscripts
2014-09-04 09:10:54 -04:00
Eric Banks 538537dbf1 Merge pull request #718 from broadinstitute/mf_rbp_fix
Fix MNP merging code to work with explicit HP phase representation
2014-09-02 20:39:22 -04:00
Eric Banks 01e725cd1a Merge pull request #723 from broadinstitute/eb_fix_rna_splitting_PT77878554
Make sure that the OverhangFixingManager (used for splitting RNA reads) ...
2014-09-02 20:39:01 -04:00
Menachem Fromer 10f9001738 Fix MNP merging code to work with explicit HP phase representation 2014-09-02 17:25:08 -04:00
Eric Banks ff91ab8ba2 Make sure that the OverhangFixingManager (used for splitting RNA reads) handles unmapped reads. 2014-09-02 16:56:17 -04:00
Valentin Ruano Rubio c7925f6e5c Merge pull request #719 from broadinstitute/vrr_generalize_ploidy_in_genotype_gvcfs
Adds support for omniploidy to GenotypeGVCFs and CombineGVCFs.
2014-09-02 16:51:02 -04:00
Valentin Ruano-Rubio d363725b4b Adds support for omniploidy to GenotypeGVCFs and CombineGVCFs.
Same changes fixed the problem for GenotypeGVCFs and CombineGVCFs.

Stories:

  - https://www.pivotaltracker.com/story/show/77626044
  - https://www.pivotaltracker.com/story/show/77626854

Changes:

  - Generalized the code for the merging in GATKVariantContextUtils to cope
    with ploidy != 2.
  - GenotypeGVCFs now check that the input's ploidy conform to the '-ploidy'
    argument.
  - Moved out Refernce Confidence VC merging code from GATKVariantContextUtils
    so that we can keep new code in protected.

Caveats:

  - GenotypeGVCFs only can deal with input files that have the same ploidy in
    all positions; the one that the user MUST indicate in the -ploidy argument
    (if different to the default 2).
  - CombineGVCFs won't necessarely complain if its passed mixed ploidy
    inputs but you won't be able to genotype it with GenotypeGVCFs.

Test:

   - Removed deprecated unit tests for GATKVariantContextUtils.
   - Moved unit-tests regarding GVCF merging from GATKVariantContextUtilsUnitTest
     to ReferenceConfidenceVariantContextUtilsUnitTest.
   - Added unit test for new code for mapping genotype indices between allele
     index encoding in GenotypeLikelihoodCalculator.
   - GenotypeGVCFs and CombineGVCFs original integration test are unaffected
     by the change.
   - Added tetraploid run integration tests to check on non-diploid execution
     of GenotypeGVCFs and CombineGVCFs.
2014-09-02 15:06:47 -04:00
Khalid Shakir fcb0eca203 Now passing in the path to the GATK directory to tests.
Changed tests and scripts to use gatkdir full path instead of relative testdata/qscripts symbolic links.
Although symlinks not created, left the symlink deletion script execution with a comment about future removal.
Re-enabled example UG pipeline queue test.
Replaced all hardcoded strings of {public,private}/testdata with BaseTest variables.
Refactored temp list creation method from ListFileUtilsUnitTest to BaseTest.createTempListFile.
Removed list files with hardcoded paths, now using createTempListFile instead with private test dir variable.
2014-09-02 01:40:59 +08:00
Khalid Shakir 2d28972c88 The 'after' files are @Input files and commited in git, so don't delete them after tests. 2014-08-30 03:04:54 +08:00
Eric Banks 5b087c9897 Changed the functionality of the physical phasing in the HC: now hom vars are output as 0|1.
We do this for technical reasons, mostly because we don't genotype in the HC anymore; it's all
done downstream by GenotypeGVCFs so we can't be sure that the genotype will be hom var.  Also,
there are steps in the downstream pipeline where genotypes can change, so assuming anything in
the HC is a bad idea, and if we have phasing info in the het state, we want to propagate that forward.

Now, PGT tag fixing happens downstream in GenotypeGVCFs.
While I was in there I also cleaned up the code a bit and fixed a bug where annotation was happening
before genotype creation when using the --includeNonVariantSites argument.

Added tests accordingly.
2014-08-25 21:40:14 -04:00
Valentin Ruano-Rubio 6dc5cf0be0 Fixes some missmerged md5 updates from a previous merge into master 2014-08-24 20:47:07 -04:00
Eric Banks 9009c1e996 Merge pull request #715 from broadinstitute/vrr_disable_physical_phasing_for_nondiploid_hc
Disable physical phasing for non-diploid HC calling.
2014-08-23 20:58:51 -04:00
Valentin Ruano-Rubio 6695aeafd9 Disable physical phasing for non-diploid HC calling.
Story:

    https://www.pivotaltracker.com/story/show/77452256

Changes:

    If ploidy != 2, disable physical phasing and log an info message to let the user know.

Tests:

    Change md5s affected by this change.
2014-08-23 10:52:07 -04:00
Phillip Dexheimer 931890915f Add the --sample_name argument to HaplotypeCaller
* This is a shortcut for people who have multi-sample BAMs but would like to use GVCF mode.  Rather than creating single-sample BAMs with PrintReads, one could use the --sample_name argument to HaplotypeCaller to specify the single sample to make calls on
 * Completes PT 73075482
2014-08-22 23:22:03 -04:00
Valentin Ruano-Rubio fc5ce4b662 Created the stand-alone AC and AF annotation AlleleCountBySample
Story:

  https://www.pivotaltracker.com/story/show/77250524

Changes:

  - Remove the annotating code in GeneralPloidyExactAFCalc (GPEAFC) class.
  - Added the asAlleleList to GenotypeAlleleCounts class and get (GPEAFC) to use that instead of implementing its own (nicer and more reusable code).
  - Removed the explicit addition of AlleleCountBySample fields to the VCF header by the walker initialize
  - Added utility methods in Utils to wrap and int[] array into a List<Integer>, and double[] array into a List<Double> efficiently.

Test:

  - Added unit-testing for asAlleleList in GenotypeAlleleCountsUnitTest (within testFirst and testNext).
  - Added unit-testing for new methods in Utils : asList(int[]) and asList(double[])
  - Changed UG General Ploidy test to add explicitly those annotations.
  - Non-trivial changes in integration tests involving non-diploid runs (namelly haploid and tetraploid) as they are not showing
    those annotations anylonger, so the MD5s have been changed accordingly.
2014-08-22 20:33:25 -04:00
Eric Banks 36bdfa3918 Merge pull request #712 from broadinstitute/eb_physical_phasing_bug_PT77248992
Fixing bug in the physical phasing code, found by Valentin.
2014-08-21 15:25:51 -04:00
Eric Banks b1cb6196be Fixing bug in the physical phasing code, found by Valentin.
It turns out that there can be some really complex situations even with a single sample where
there are lots of unphasable hets around a hom.  Previously we were trying to phase each of the
hets against the hom, but that wasn't correct.  Instead we now detect that situation and don't
attempt to phase anything.
Added a unit test to cover this situation.
2014-08-21 15:24:09 -04:00
Laura Gauthier 9a5da41dd4 Add bells and whistles for Genotype Refinement Pipeline
New annotation for low= and high-confidence de novos (only annotates biallelics)
FamilyLikelihoodsUtils now add joint likelihood and joint posterior annotations
Restrict population priors based on discovered allele count to be valid for 10 or more samples.
2014-08-21 11:20:40 -04:00
Valentin Ruano-Rubio d31c5536aa Fixed the bug first by indicating the actual possible number of alternatives alleles considering the extra <NON_REF> and second by resizing the StateTracker capacity when invoked by GeneralPloidyExactAFCalc deep within its implementation of computeLog10PNonRef which is ultimatelly what get rids of the exception.
Story:

  https://www.pivotaltracker.com/story/show/74471252
2014-08-20 14:42:42 -04:00
Laura Gauthier b512c7eac9 Refactor StrandBiasTest (using template method) and add warnings for when annotations may not be calculated successfully.
VariantAnnotator/FS behavior changes slightly: VA used to output zeros for FS if there was no strand bias info, now skips FS output (but will still show FS in header)
2014-08-20 08:18:53 -04:00
Valentin Ruano-Rubio 8d9a55ae60 Moving new omniploidy likelihood calculation classes to their final package (as far as this pull-request is concerned) in org.broadinstitute.gatk.tools.walkers.genotyper 2014-08-19 11:54:29 -04:00
Valentin Ruano-Rubio 611b7f25ea Adds unit-test and integration test for new omniploidy likelihood calculation components
Added md5 to HaplotypeCallerIntegrationTest.testHaplotypeCallerSingleSampleWithDbsnp
2014-08-19 11:53:19 -04:00
Valentin Ruano-Rubio 9ee9da36bb Generalize the calculation of the genotype likelihoods in HC to cope with haploid and multiploidy
Changes in several walker to use new sample, allele closed lists and new GenotypingEngine constructors signatures

Rebase adoption of new calculation system in walkers
2014-08-19 11:53:06 -04:00
Valentin Ruano-Rubio f08dcbc160 Added the genotype likelihoods model interface and implementation for the random speciment sample from an infinite population with homogeneous ploidy accross samples. 2014-08-19 11:50:13 -04:00
Valentin Ruano-Rubio 4f993e8dbe Added read-likelihoods array base structure to substitute existing Map-of-Map-of-Maps. 2014-08-19 11:50:12 -04:00
Valentin Ruano-Rubio 242cd0e58f Added genotype allele counts and likelihood calculator utilities for arbitrary ploidy and number of alleles 2014-08-19 11:50:12 -04:00
Valentin Ruano-Rubio b0a4cb9f0c Added close sample and allele list data-structures and utility classes 2014-08-19 11:50:12 -04:00
Eric Banks d3f06024f8 Updated the physical phasing in the Haplotype Caller to address requests from ATGU.
1. It is now turned on by default
2. It now phases homozygous variants
3. Most importantly, it also phases variants that are always on opposite haplotypes

Changed the INFO keys to be PID and PGT, as described in the header.
2014-08-18 14:38:29 -04:00
Eric Banks 7e0c326e1c Merge pull request #706 from broadinstitute/vrr_reduce_hc_integration_test_time
Reduce intervals of integration tests in HaplotypeCallerIntegrationTest ...
2014-08-15 17:37:57 -04:00
Valentin Ruano-Rubio 2f79042dee Reduce intervals of integration tests in HaplotypeCallerIntegrationTest class
Story:

   https://www.pivotaltracker.com/story/show/74858854

Changes:

    Intervals have been shrunk so that the test run in 15s or less.
2014-08-15 14:20:10 -04:00
Eric Banks eb84091702 Update the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection. 2014-08-14 15:34:09 -04:00
Ryan Poplin 3a9a78c785 Removing an assumption that ADs were in the same order if the number of alleles matched. This happens for example when one sample is C->T and another sample is C->G. 2014-08-13 13:26:40 -04:00
Eric Banks 27193c5048 Merge pull request #700 from broadinstitute/eb_phase_HC_variants_PT74816060
Initial implementation of functionality to add physical phasing informat...
2014-08-13 12:30:32 -04:00
Eric Banks 4512940e87 Initial implementation of functionality to add physical phasing information to the output of the HaplotypeCaller.
If any pair of variants occurs on all used haplotypes together, then we propagate that information into the gVCF.
Can be enabled with the --tryPhysicalPhasing argument.
2014-08-13 12:25:31 -04:00
Valentin Ruano-Rubio b39508cd15 ReadLikelihoods class introduction final changes before merging
Stories:

        https://www.pivotaltracker.com/story/show/70222086
        https://www.pivotaltracker.com/story/show/67961652

Changes:

  Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM.
  Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set.
  Updated some integration test md5s.
2014-08-11 17:47:25 -04:00
Valentin Ruano-Rubio 9a9a68409e ReadLikelihoods class introduction final changes before merging
Stories:

        https://www.pivotaltracker.com/story/show/70222086
        https://www.pivotaltracker.com/story/show/67961652

Changes:

  Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM.
  Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set.
  Updated some integration test md5s.

Fixing GraphBased bugs with new master code
Fixed ReadLikelihoods.changeReads difficult to spot bug.
Changed PairHMM interface to fix a bug
Fixed missing changes for various PairHMM implementations to get them to use the new structure.
Fixed various bugs only detectable when running with full sample(s).
Believe to have fixed the lack of annotations in UG runs
Fixed integrationt test MD5s
Updating some md5s
Fixed yet another md5 probably left out by mistake
2014-08-11 17:46:28 -04:00
Valentin Ruano-Rubio 0b472f6bff Added new test to verify the functionality of ReadLikelihoods.java and its use in HC. Updated existing integration test md5s.
Stories:

    https://www.pivotaltracker.com/story/show/70222086
    https://www.pivotaltracker.com/story/show/67961652
2014-08-11 17:46:28 -04:00
Valentin Ruano-Rubio 2914ecb585 Change the Map-of-maps-of-maps for an array based implementation ReadLikelihoods to hold read likelihoods.
The array structure should be faster to populate and query (no properly benchmarked) and reduce memory footprint considerably.
    Nevertheless removing PairHMM factor (using likelihoodEngine Random) it only achieves a speed up of 15% in some example WGS dataset
    i.e. there are other bigger bottle necks in the system. Bamboo tests also seem to run significantly faster with this change.

    Stories:

      https://www.pivotaltracker.com/story/show/70222086
      https://www.pivotaltracker.com/story/show/67961652

    Changes:

       - ReadLikelihoods added to substitute  Map<String,PerSampleReadLikelihoods>
       - Operation that involve changes in full sets of ReadLikelihoods have been moved into that class.
       - Simplified a bit the code that handles the downsampling of reads based on contamination

    Caveats:

       - Still we keep Map<String,PerReadAlleleLikelihoodsMap> around to pass to annotators..., didn't feel like change the interface of so many public classes in this pull-request.
2014-08-11 17:46:28 -04:00
Ryan Poplin c56e493f98 Merge pull request #622 from broadinstitute/ldg_SORanalysis
Add StrandOddsRatio to default annotations produced by GenotypeGVCFs
2014-08-11 09:45:27 -04:00
Tim Fennell 5695f22da8 Changed the default GVCF Q Bands from 5,20,60 to be 1..60 by 1s, 60...90 by 10s and 99 in order to give finer resolution
for homref PLs and ADs at lower confidences and somewhat higher resolution at higher confidences.
2014-08-08 14:31:35 -04:00
Laura Gauthier 35de598e4b Modify StrandOddsRatio calculation to take on lower values in cases where reference +/- reads are skewed but alt reads are not. Add SOR to default annotations produced by GenotypeGVCFs. Add jitter to minimum SOR values 2014-08-07 12:09:19 -04:00
Laura Gauthier f532f1f843 Fix nullPointerException 2014-08-07 10:13:17 -04:00
Laura Gauthier 74affcc077 Update inbreeding coefficient calculation to give a better estimate for multialleleic sites
Add unit test for compound het and for multiallelic hets
2014-08-07 08:12:47 -04:00
Eric Banks b9486f5b4d Merge pull request #693 from broadinstitute/ldg_SORfromHC
Allow SOR to be calculated from HC
2014-08-06 21:48:09 -04:00
Phillip Dexheimer 593663d9b6 Improved detection of missing argument values
In particular, it was possible to specify arguments for Files or Compound types without values
 Added a special "none" value for annotations, since a bare "-A" is no longer allowed
 Delivers PT 71792842 and 59360374
2014-08-05 20:31:31 -04:00
Laura Gauthier 5533199402 Allow SOR to be calculated from HC
Refactor StrandBiasTest classes
2014-08-01 20:47:58 -04:00
Ryan Poplin 63b3f7dfd3 Fixing typos in AnalyzeCovariates 2014-07-31 10:36:18 -04:00
Valentin Ruano-Rubio 750eb4b5a6 Add diploid only support message to HaplotypeCaller
Story:

  https://www.pivotaltracker.com/story/show/73440292

Changes:

  - Just add the conditional in HaplotypeCaller#initialize

Testing:

  - Nothing added, checked locally, trivial change that would eventually be removed anyway.
2014-07-29 17:05:36 -04:00
David Roazen 0798a4b768 Update pom versions to mark the start of GATK 3.3 development 2014-07-17 12:09:33 -04:00
David Roazen 323f22f852 Update pom versions for the 3.2 release 2014-07-17 12:06:22 -04:00
Eric Banks 98d88eb07e Fixed IndexOutOfBounds error associated with tail merging.
Don't expand out source nodes for tail merging, since that's a head merging action only.
This shows up as a bug only because we now allow merging tails against non-reference paths.
2014-07-17 12:04:22 -04:00
Geraldine Van der Auwera a6f632874b Various documentation improvements
- Edited intervals merging docs for correctness & clarity
- Edited VQSR arg docs and made mode required (+added -mode SNP to VQSR tests)
- Moved PaperGenotyper to Toy Walkers to declutter the actually useful docs
- Moved GenotypeGVCFs to Variant Discovery category and clarified a few points
- Clarified that the -resource argument depends on using the -V:tag format
- Clarified how the pcr indel model works
- Added caveat for -U ALLOW_N_CIGAR_READS
- Added MathJax support for displaying equations in GATKDocs
- Updated HC example commands and caveats
2014-07-14 12:03:03 -04:00
droazen db53d096c9 Merge pull request #684 from broadinstitute/ks_add_cofoja_to_gatk_packages
Added cofoja to the gatk packages for tests to pass.
2014-07-14 11:15:49 -04:00
Eric Banks ecefcb383d Disable the complex variant merging for now, as requested by ATGU 2014-07-11 17:27:40 -04:00
Khalid Shakir c7e357eb59 Added cofoja to the gatk packages for tests to pass. 2014-07-11 23:19:42 +08:00
droazen b8751ad598 Merge pull request #680 from broadinstitute/ldg_VQSRscript
Update VQSR Rnd BQSR  script generation code for compatibility with late...
2014-07-11 10:16:37 -04:00
Eric Banks 1d97b4a191 Improved tail merging: now tails can be merged to branches that are not entirely reference.
This is useful for e.g. cases where there are SNPs on insertions.  Before tails were forced to be merged
(incorrectly) only to a reference node, but now they can be merged to any path in the graph from which they
directly branch.

Also, I've transferred over Ryan's code to refuse to process kmer sizes such that there are non-unique kmers
in the reference sequence with them.
2014-07-10 08:57:01 -04:00
Ryan Poplin 5eee065133 Merge pull request #674 from broadinstitute/rp_improve_genotyping
Improvements to genotyping accuracy.
2014-07-09 16:03:09 -04:00
Laura Gauthier 99026eb51b Update VQSR Rnd BQSR script generation code for compatibility with latest ggplot version. Update queueJobReport.R and public/gsalib/src/R/R/gsa.variantqc.utils.R also 2014-07-09 15:36:58 -04:00
Ryan Poplin 74a7674d70 Improvements to genotyping accuracy.
-- Global mismapping penalty was only applied to the reference haplotype. This led to problems with overlapping events, mostly STR haplotypes. Now the penalty is applied to every haplotype.
-- We subset the reads down to only those which overlap the event (after assembly based realignment) for likelihood calculations.
2014-07-09 13:11:07 -04:00
David Roazen 719e685759 Remove junit imports in the test suite 2014-07-09 12:09:27 -04:00
Eric Banks bad7865078 When converting a haplotype to a set of variants we now check for cases that are overly complex.
In these cases, where the alignment contains multiple indels, we output a single complex
variant instead of the multiple partial indels.

We also re-enable dangling tail recovery by default.
2014-07-01 14:18:59 -04:00
Ryan Poplin e14bff212d SB tables should be created even if the ref or alt columns have no counts. This is so that FS/SOR will still be calculated when the variant is extremely high or low frequency.
-- Removed long running HC integration test... sorry
2014-06-30 15:19:15 -04:00
Ryan Poplin 0127799cba Reads are now realigned to the most likely haplotype before being used by the annotations.
-- AD,DP will now correspond directly to the reads that were used to construct the PLs
-- RankSumTests, etc. will use the bases from the realigned reads instead of the original alignments
-- There is now no additional runtime cost to realign the reads when using bamout or GVCF mode
-- bamout mode no longer sets the mapping quality to zero for uninformative reads, instead the read will not be given an HC tag
2014-06-30 10:35:50 -04:00
Phillip Dexheimer 06d619e9aa Removed redundant SelectVariantsIntegrationTest, merged it's only test into protected version 2014-06-24 18:59:59 -04:00
Eric Banks 2df2a153e6 Merge pull request #658 from broadinstitute/ldg_PbyTwithPriors
Updated CalculateGenotypePosteriors to compute genotype posteriors using...
2014-06-18 15:04:39 -04:00
Laura Gauthier 2356d5d63f Updated CalculateGenotypePosteriors to compute genotype posteriors using likelihoods from all members of the trio.
(Right now it only works if all members of the trio are called.)
Takes posteriors as input, defaulting to PLs
Added annotations for possible de novos for us in full genotype refinement pipeline
Added family priors to CGP integration test.
Changed CGP to use PP tag instead of GP tag because posteriors are Phred-scaled. Updated CGP integration test md5s to reflect change.
2014-06-18 11:17:15 -04:00
Phillip Dexheimer 2e78815055 Added missing arguments to GenotypeGVCFs
- New arguments are nda, hets, indelHeterozygosity, stand_call_conf, stand_emit_conf, ploidy, and maxAltAlleles
 - Addresses PT 70110918
 - To do this, moved those arguments out of the StandardCallerArgumentCollection into a new GenotypeCalculationArgumentCollection, which is now included as a member of SCAC
2014-06-16 08:10:54 -04:00
droazen 3079755b4c Merge pull request #646 from broadinstitute/ks_disable_distribution_with_private
Add maven -Pgsadev flag to build private jars only
2014-06-11 11:00:31 -04:00
Khalid Shakir f082572593 If passed -Pgsadev, don't build the distribution package. 2014-06-10 23:33:33 -04:00
Valentin Ruano Rubio db96891d4b Merge pull request #638 from broadinstitute/vrr_createTempFile_testfix
Changed File.createTempFile to BaseTest.createTempFile calls Test
2014-05-29 10:15:05 -04:00
Valentin Ruano-Rubio 07567fdae3 Removed debug code outputing files not removed after VM exists in ReadThreadingLikelihoodCalculationEngineUnitTest.
Notice however that this should not be the cause of resent problems as the code was desactivated.
2014-05-28 19:03:25 -04:00
Valentin Ruano-Rubio e0c221470c Changed File.createTempFile to BaseTest.createTempFile 2014-05-28 18:59:48 -04:00
EvolvedMicrobe ef7531d4a5 Merge pull request #640 from broadinstitute/IntegerSWImplementation
Change SmithWaterman to use integers instead of doubles.
2014-05-28 15:10:05 -04:00
Nigel Delaney cc45e62e8e Change SmithWaterman to use integers instead of doubles. 2014-05-28 13:13:14 -04:00
droazen ac52fa581a Merge pull request #644 from broadinstitute/ks_queue_test_temp_fix
Disabled ExampleUG Queue Tests, fixed internal extensions dependency.
2014-05-28 11:29:08 -04:00
Phillip Dexheimer c15e6fcc0e Refactored the static lookup arrays in MathUtils (log10Cache, log10FactorialCache, jacobianLogTable)
-They are now only computed when necessary
 -Log10Cache is dynamically resizable, either by calling get() on an out-of-range value or by calling ensureCacheContains
 -Log10FactorialCache and JacobianLogTable are initialized to a fixed size on first access and are not resizable
 -Addresses PT 69124396
2014-05-27 22:27:57 -04:00
Eric Banks b77589696e Merge pull request #643 from broadinstitute/rp_remove_hwp
Removing HWP from GenotypeSummaries because of integer overflow issues w...
2014-05-27 17:21:19 -04:00
Khalid Shakir 6c9e68ef41 Disabled ExampleUG Queue Tests, fixed internal extensions dependency.
EUG tests disabled due to new protected qscript directory path, post GATK artifact splitting.
2014-05-27 16:16:53 -04:00
David Roazen 74b51c5c7a Improve test suite tmp file cleanup
-Make BaseTest.createTempFile() mark any possible corresponding index files for deletion on exit

-Make WalkerTest mark shadow BCF files and auxiliary for deletion on exit

-Make VariantRecalibrationWalkersIntegrationTest mark PDF files for deletion on exit
2014-05-27 13:41:44 -04:00
Ryan Poplin b24cff780b Removing HWP from GenotypeSummaries because of integer overflow issues with 91K samples. Removing CCC because it is redundant. 2014-05-27 10:14:49 -04:00
Ryan Poplin ec7c4ea2ba Unfortunately dangling tail recovery is dangerous in exome data. Turning it off by default for now.
-- disabling HC+VA integration test because, as noted in the comments, it keeps switching PairHMM implementations and giving different results at a particular site used in that particular test
2014-05-23 14:33:44 -04:00
Valentin Ruano-Rubio 979ab0453e Moved GlobalEdgeGreedySWPairwiseAlignment to the archive 2014-05-23 01:48:48 -04:00
Valentin Ruano-Rubio 7c8a1ae892 Fix for SW to make double comparisons with a tolerance
Stories:

  - https://www.pivotaltracker.com/story/show/69577868

Changes:

  - Added a epsilon difference tolerance in weight comparisons.

Tests:

  - Added HaplotypeCallerIntegrationTest#testDifferentIndelLocationsDueToSWExactDoubleComparisonsFix
  - Updated md5 due to minor likelihood changes.
  - Disabled a test for PathUtils.calculateCigar since does not work and is unclear what is causing the error (needs original author input)
2014-05-23 01:48:48 -04:00
Khalid Shakir b7e98bdae9 Fixed GATK docs artifact, moved protected ExampleUG tests. 2014-05-22 21:03:55 -04:00
Ryan Poplin 581843d994 Minor updates to HC docs. 2014-05-20 10:01:11 -04:00
Khalid Shakir 88d7e23c44 After talking with Mauricio and Karthik, updated MD5s and added a note about PairHMM causing test variability. 2014-05-19 17:36:41 -04:00
Karthik Gururaj 972a82d386 Changed 'sting' to 'gatk' in the VectorLoglessPairHMM classes and the
C++ code
2014-05-19 17:36:41 -04:00
Khalid Shakir 3939971d78 After renaming the packages, instead of updating the JNI library used for testing bwa, moving the classes to the archive.
NOTE: The migrated READEME.md has been added that will allow others to possibly ressurect this code as needed.
2014-05-19 17:36:41 -04:00
Khalid Shakir 2c854e554a Refactored maven directories and java packages replacing "sting" with "gatk".
To reduce merge conflicts, this commit modifies contents of files, while file renamings are in previous commit.
See previous commit message for list of changes.
2014-05-19 17:36:39 -04:00
Khalid Shakir 4e6d43d003 Refactored maven directories and java packages replacing "sting" with "gatk".
To reduce merge conflicts, this commit only renames files, while file modifications are in next commit.
Some updates/fixes here are actually included in the next commit.
= Maven updates
Moved artifacts to new package names:
* private/queue-private -> private/gatk-queue-private
* private/gatk-private -> private/gatk-tools-private
* public/gatk-package -> protected/gatk-package-distribution
* public/queue-package -> protected/gatk-queue-package-distribution
* protected/gatk-protected -> protected/gatk-tools-protected
* public/queue-framework -> public/gatk-queue
* public/gatk-framework -> public/gatk-tools-public
New poms for new artifacts and packages:
* private/gatk-package-internal
* private/gatk-queue-package-internal
* private/gatk-queue-extensions-internal
* protected/gatk-queue-extensions-distribution
* public/gatk-engine
Updated references to StingText.properties to GATKText.properties.
Updated ant-bridge.sh to use gatk.* properties instead of sting.*.
= Engine updates
Renaming files containing engine parts from o.b.gatk.tools to o.b.gatk.engine.
Changed package references from tools to engine for CommandLineGATK, GenomeAnalysisEngine, ReadMetrics, ReadProperties, and WalkerManager.
Changed package reference tools.phonehome to engine.phonehome.
Renamed classes *Sting* to *GATK*, such as ReviewedGATKException.
= Test updates
Moved gatk example resources.
Moved test engine files from tools to engine packages.
Moved resources for phonehome to proper package.
Moved test classes under o.b.gatk into packages:
* o.b.g.utils.{BaseTest,ExampleToCopyUnitTest,GATKTextReporter,MD5DB,MD5Mismatch,TestNGTestTransformer}
* o.b.g.engine.walkers.WalkerTest
Updated package names in DependencyAnalyzerOutputLoaderUnitTest's data.
= Queue updates
Moving queue scripts to location where generated extensions can be used.
Renamed *.q to *.scala, updating licenses previously missed by git hooks.
Moved queue extensions to new artifact gatk-queue-extensions.
Fixed import statments frequently merge-conflicting on FullProcessingPipeline.scala.
= BWA
Added README on how to obtain and include bwa as a library.
Updated libbwa build.
Fixed packaged names under bwa/java implementation.
Updated contents of BWCAligner native implementation.
= Other fixes
Don't duplicate the resource bundle entries by both unpacking *and* appending.
(partial fix) Staged engine and utils poms to build GATKText.properties, once Utils random generator dependency on GATK engine is fixed.
Re-enabled custom testng listeners/reporters and moved testng dependencies to the gatk-root.
Updated comments referencing Sting with GATK.
Moved a couple untangled classes from gatk-tools-public to gatk-utils and gatk-engine.
2014-05-19 16:43:47 -04:00
Khalid Shakir 67e44985b1 Java/Scala imports updated for new package names.
Fourth of four commits for picard/htsjdk package rename.
2014-05-08 19:13:31 +08:00
Laura Gauthier bf7b97393e Add ability to output to a file discordant loci and their respective genotypes for each sample 2014-05-07 10:12:45 -04:00
MauricioCarneiro f03a12263a Merge pull request #625 from broadinstitute/intel_updateCell_inlined
(Optional) Inlined the code from updateCell
2014-05-07 10:11:09 -04:00
Karthik Gururaj d9c489f928 Removed scary warning messages for VectorPairHMM 2014-05-06 10:59:24 -07:00
Karthik Gururaj fb8578ec8e Inlined the code from updateCell - helps Java JIT to detect hotspots and
produce good native code
2014-05-06 10:37:10 -07:00
Karthik Gururaj f6ea25b4d1 Parallel version of the JNI for the PairHMM
The JNI treats shared memory as critical memory and doesn't allow any
parallel reads or writes to it until the native code finishes. This is
not a problem *per se* it is the right thing to do, but we need to
enable **-nct** when running the haplotype caller and with it have
multiple native PairHMM running for each map call.

Move to a copy based memory sharing where the JNI simply copies the
memory over to C++ and then has no blocked critical memory when running,
allowing -nct to work.

This version is slightly (almost unnoticeably) slower with -nct 1, but
scales better with -nct 2-4 (we haven't tested anything beyond that
because we know the GATK falls apart with higher levels of parallelism

* Make VECTOR_LOGLESS_CACHING the default implementation for PairHMM.
* Changed version number in pom.xml under public/VectorPairHMM
* VectorPairHMM can now be compiled using gcc 4.8.x
* Modified define-* to get rid of gcc warnings for extra tokens after #undefs
* Added a Linux kernel version check for AVX - gcc's __builtin_cpu_supports function does not check whether the kernel supports AVX or not.
* Updated PairHMM profiling code to update and print numbers only in single-thread mode
* Edited README.md, pom.xml and Makefile for users to pass path to gcc 4.8.x if necessary
* Moved all cpuid inline assembly to single function Changed info message to clog from cinfo
* Modified version in pom.xml in VectorPairHMM from 3.1 to 3.2
* Deleted some unnecessary code
* Modified C++ sandbox to print per interval timing
2014-05-02 19:12:48 -04:00
Valentin Ruano-Rubio d563072282 Fix for CombineGVCFs and GenotypeGVCFs recurrent exception about missing PLs
Story:

  https://www.pivotaltracker.com/story/show/68220438

Changes:

   - PL-less input genotypes are now uncalled and so non-variant sites when combining GVCFs.
   - HC GVCF/BP_RESOLUTION Mode now outputs non-variant sites in sites covered by deletions.
   - Fixed existing tests

Test:

   - HaplotypeCallerGVCFIntegrationTest
   - ReferenceConfidenceModelUnitTest
   - CombineGVCFsIntegrationTest
2014-05-02 09:21:06 -04:00
Ryan Poplin 41d3069213 When we subset PLs because Alleles are removed during genotyping we also need to subset AD. 2014-04-28 15:52:26 -04:00
Ryan Poplin 06dbe74a23 Merge pull request #609 from kcibul/kc_cancersimreads
extended SimulateReadsForVariants to optionally use the AF field to indi...
2014-04-28 13:31:56 -04:00
Ami Levy-Moonshine 13dd755468 create a new read transformer that refactor NDN cigar elements to one N element.
story:
https://www.pivotaltracker.com/story/show/69648104

description:
This read transformer will refactor cigar strings that contain N-D-N elements to one N element (with total length of the three refactored elements).
This is intended primarily for users of RNA-Seq data handling programs such as TopHat2.
Currently we consider that the internal N-D-N motif is illegal and we error out when we encounter it. By refactoring the cigar string of
those specific reads, users of TopHat and other tools can circumvent this problem without affecting the rest of their dataset.

edit: address review comments - change the tool's name and change the tool to be a readTransformer instead of read filter
2014-04-28 11:29:00 -04:00
Ryan Poplin 221b999cb0 GenotypeGVCF was pulling the headers from all input rods including DBsnp. Now it pulls from just the input variant rods. 2014-04-25 13:16:28 -04:00
Laura Gauthier 9f3cbb2ef1 Improvements to CalculateGenotypePosteriors and CalibrateGenotypeLikelihoods
CalculateGenotypePosteriors now only computes posterior probs for SNP sites with SNP priors
(other sites have flat priors applied)

CalibrateGenotypeLikelihoods had originally applied HOM_REF/HET/HOM_VAR frequencies in callset as priors before empirical quality analysis. Now has option (-noPriors) to not apply/apply flat priors. Also takes in new external probabilities files, such as those generated by CGP, from which the genotype posterior probability qualities will be read.

Integration test was changed to account for new SNP-only behavior and default behavior to not use missing priors.

(Also, new numRefIfMissing is 0, which should only matter in cases using few samples when you probably don't want to be doing that anyway!)
2014-04-24 08:49:42 -04:00
Valentin Ruano-Rubio e610373169 Fixed integration test problems from previous premature merge 2014-04-20 17:11:51 -04:00
Valentin Ruano-Rubio 4e5850966a Reengineer engine constructors 2014-04-19 17:58:14 -04:00
Valentin Ruano-Rubio 7455ac9796 Addressed revisions 2014-04-19 16:48:48 -04:00
Kristian Cibulskis 6b9e38c8bb incorporated comments from review, made variables final, made AF paramater hidden, and added bounds checking to AF value 2014-04-16 19:29:25 -04:00
Kristian Cibulskis 7115cadbd8 extended SimulateReadsForVariants to optionally use the AF field to indicate allele fraction of the simulated event, useful in cancer and other variable ploidy use cases 2014-04-16 16:20:02 -04:00
Valentin Ruano-Rubio 08203b516e Disentangle UG and HC Genotyper engines.
Description:

  Transforms a delegation dependency from HC to UG genotyping engine into a reusage by inhertance where HC and UG engines inherit from a common superclass GenotyperEngine
  that implements the common parts. A side-effect some of the code is now more clear and redundant code has been removed.

  Changes have a few consequence for the end user. HC has now a few more user arguments, those that control the functionality that HC was borrowing directly from UGE.

     Added -ploidy argument although it is contraint to be 2 for now.
     Added -out_mode EMIT_ALL_SITES|EMIT_VARIANTS_ONLY ...
     Added -allSitePLs flag.

Stories:

   https://www.pivotaltracker.com/story/show/68017394

Changes:

   - Moved (HC's) GenotyperEngine to HaplotypeCallerGenotyperEngine (HCGE). Then created a engine superclass class GenotypingEngine (GE) that contains common parts between HCGE and the UG counterpart 'UnifiedGenotypingEngine' (UGE). Simplified the code and applied the template pattern to accomodate for small diferences in behaviour between both caller
   engines. (There is still room for improvement though).

   - Moved inner classes and enums to top-level components for various reasons including making them shorter and simpler names to refer to them.

   - Create a HomoSpiens class for Human specific constants; even if they are good default for most users we need to clearly identify the human assumption across the code if we want to make
   GATK work with any species in general; i.e. any reference to HomoSapiens, except as a default value for a user argument, should smell.

   - Fixed a bug deep in the genotyping calculation we were taking on fixed values for snp and indel heterozygisity to be the default for Human ignoring user arguments.

   - GenotypingLikehooldCalculationCModel.Model to Gen.*Like.*Calc.*Model.Name; not a definitive solution though as names are used often in conditionals that perhaps should be member methods of the
     GenLikeCalc classes.

   - Renamed LikelihoodCalculationEngine to ReadLikelihoodCalculationEngine to distinguish them clearly from Genotype likelihood calculation engines.

   - Changed copy by explicity argument listing to a clone/reflexion solution for casting between genotypers argument collection classes.

   - Created GenotypeGivenAllelesUtils to collect methods needed nearly exclusively by the GGA mode.

Tests :

    - StandardCallerArgumentCollectionUnitTest (check copy by cloning/reflexion).
    - All existing integration and unit tests for modified classes.
2014-04-13 03:09:55 -04:00
Khalid Shakir a6b0754990 After comments from @nh13, updated latest picard and setMateInfo call. 2014-04-08 15:22:45 -04:00
Khalid Shakir 3047d6ff32 BQSRGatherer handles missing read groups from some input files. [#68720468] 2014-04-08 23:58:54 +08:00
Eric Banks ad336375dc Merge pull request #590 from broadinstitute/vrr_validate_variants_unused_alleles_fix
Addresses issue with strict validation on GVCF files.
2014-04-07 22:10:49 -04:00
Valentin Ruano-Rubio 5afcc8e05f Change in the command line interface of ValidateVariants.
Following reviewers comments the command line interface has been simplified.
All extra strict validations are performed by default (as before) and the
user has to indicate which one he/she does not want to use with --validationTypeToExclude.

Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out.

Stories:

    - https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Removed validateType argument.
    - Improved documentation.
    - Added some warnning log message on suspicious argument combinations.

Tests:

    - ValidateVariantsIntegrationTest#*
2014-04-07 16:27:11 -04:00
Ryan Poplin f058224b3e Adding GenotypeSummaries as INFO field annotations.
-- This is needed so the ref model pipeline can cut down to sites-only files without losing these useful statistics.
-- Added new unit test to test this info field annotation.
-- GenotypeGVCF integration tests change because new annotations are present in the output
2014-04-06 11:50:10 -04:00
MauricioCarneiro 84861fa10a Merge pull request #587 from broadinstitute/eb_actually_fail_on_reduced_bams
Make sure to fail in all cases where the BAM being used was created by ReduceReads.
2014-04-04 17:27:57 -04:00
Valentin Ruano-Rubio 18deeec6b0 Addresses issue with strict validation on GVCF files.
More concretelly Picard's strict VCF validation does not like that there is alternative alleles that are not participating in any genotype call across samples.

This is an issue with GVCF in the single-sample pipeline where this is certainly expected with <NON_REF> and other relative unlikely alleles.

To solve this issue we allow the user to exclude some of the strict validations using a new argument --validationTypeToExclude. In order to avoid the validation
issue with GVCF the user needs to add the following to the command line: '--validationTypeToExclude ALLELES'

Story:

    https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Added validateTypeToExclude argument to ValidateVariants walker.
    - Implemented the selective exclusion of validation types.
    - Added new info and improved existing documentation of the ValidateVariants walker.

Tests:

    - ValidateVariantsIntegrationTest#testUnusedAlleleError
    - ValidateVariantsIntegrationTest#testUnusedAlleleFix
2014-04-04 14:37:10 -04:00
Eric Banks 7174f8cfeb IndelRealigner throws a user error when it encounters reads with I operators greater than the number of read bases.
Added test to ensure it works.
2014-04-03 18:16:24 -04:00
Eric Banks a3d55b3341 Make sure to fail in all cases where the BAM being used was created by ReduceReads.
In some cases, the program records were being removed from the BAM headers by the GATK engine
before we applied the check for reduced reads (so we did not fail appropriately).  Pushed up the
check to happen before the PG tags are modified and added a unit test to ensure it stays that way.
It turns out that some UG tests still used reduced bams so I switched to use different ones.

Based on reviewer feedback, made it more generic so that it's easy to add new unsupported tools.
2014-04-03 16:52:41 -04:00
Eric Banks 0b73573abc Slightly modifying the way to use the IUPAC ambiguity codes in the FastaAlternateReferenceMaker.
Previously it required you to create a single sample VCF and then to pass that in to the tool, but
Geraldine convinced me that this was a pain for users (because they usually have multi-sample VCFs).
Instead now you can pass in a multi-sample VCF and specify which sample's genotypes should be used
for the IUPAC encoding.  Therefore the argument changed from '--useIUPAC' to '--use_IUPAC_sample NA12878'.
2014-04-02 21:34:25 -04:00
Valentin Ruano-Rubio 84711b8e90 Fixed bug using GraphBased due to infinite likelihoods resulting from the calculation of alignment cost of very long insertion or deletions (done in linear scale)
Stories:

  https://www.pivotaltracker.com/story/show/66263868

Bug:

  The problem was due to the way we were calculating the fix penalty of a large deletion or insertion. In this case we calculate the alignment likelihood of the portion
  or read or haplotype deletion as the penalty of that deletion/insertion without going through the full pair-hmm process. For large events this resulted in a 0 in
  in linear scale computations that ins transformed into an infinity in log scale.

Changes:

  - Change to use log10 scale for calculate those penalties.
  - Minor addition of .gitignore to hide ./public/external-example/target which is generated by the building process.
2014-04-01 16:14:52 -04:00
Eric Banks 821fbe7260 Merge pull request #582 from broadinstitute/vrr_hc_bugfixes_dangling_heads
Fix loss of key alternative haplotypes due to a change on threading star...
2014-03-31 10:42:08 -04:00
Joel Thibault 2049eb1658 Rev Picard 1.110.1763
- SamPairUtils migrated in Picard r1737
- Revert IndelRealigner changes made in commit 4f4b85
-- Those changes were based on Picard revision 1722 to net/sf/picard/sam/SamPairUtil.java
-- Picard revision 1723 reverts these changes, so we also revert to match
2014-03-30 09:33:57 -04:00
Valentin Ruano-Rubio 258b2bce28 Fix loss of key alternative haplotypes due to a change on threading start policy required when recovering dangling heads.
Story:

  - https://www.pivotaltracker.com/story/show/67601310

Change:

  - Unless recover-danging-heads is active, the threading starting location policy is the original one. i.e. just at already existing unique kmer vertices.

Tests:

  - HaplotypeCallerIntegrationTest#testMissingKeyAlternativeHaplotypesBugFix
2014-03-29 22:40:26 -04:00
Ryan Poplin 6566dd6ca9 Fix for dropping of reference sample depth in the DP annotation.
-- In the case of hierarchical merge we can't assume that we have only one genotype.
-- Removed use of deprecated VC annotation access functions.
2014-03-24 14:01:50 -04:00
Eric Banks 32a96e3ab3 Fix for reads that are all insertions (e.g. 50I) and causing the IndelRealigner to error out. 2014-03-21 15:01:34 -04:00
Eric Banks 7c8ce3cd6a Several improvements to GenotypeGVCFs: --includeNonVariantSites now actually works and we propagate AD to hom ref samples 2014-03-20 00:35:54 -04:00
Eric Banks 824983af1d Enable CombineGVCFs to process gVCFs that were created with basepair resolution. 2014-03-19 19:23:05 -04:00
Eric Banks 3b1c337401 Have CombineVariants throw a UserError when trying to combine GVCFs from the HaplotypeCaller.
Was previously throwing an IllegalArgumentException (in the wrong place in the code).
Error message tells users to use CombineGVCFs.
2014-03-19 19:11:40 -04:00
Valentin Ruano-Rubio 905b6066b2 Reduce runtime of very long integration test 2014-03-18 21:48:13 -04:00
David Roazen 2d8653f493 Update pom versions to mark the start of GATK 3.2 development 2014-03-18 01:18:59 -04:00
David Roazen a6a41c777c Update pom versions for 3.1 2014-03-18 01:09:29 -04:00
Alec Wysoker 0369f93b24 GATK changes to conform to Tribble refactoring as part improving Tabix support in Tribble (among other things).
1. Enable on-the-fly indexing for vcf.gz.
2. Handle on-the-fly indexing where file to be indexed is not a regular file, thus index should not be created.
3. Add method setProgressLogger to all SAMFileWriter implementations.
4. Revved picard to 1.109.1722
5. IndelRealigner md5s change because the MC tag is added to records now.

Fixed up and signed off by ebanks.
2014-03-17 11:56:22 -04:00
Eric Banks 34c697bf12 Merge pull request #554 from broadinstitute/bh_SOR_new_annotation
Bh sor new annotation
2014-03-17 10:58:13 -04:00
Laura Gauthier 40c13d446a Added documentation category for CalculateGenotypePosteriors 2014-03-17 10:36:19 -04:00
Valentin Ruano-Rubio 2e964c59b4 Improved criteria to select best haplotypes out from the assembly graph.
Currently the best haplotypes are those that accumulate the largest ABSOLUTE edge *multiplicity* sum across their path in the assembly graph.

The edge *mulitplicity* is equal to the number of reads that expand through that edge, i.e. have a kmer that uniquely map to some vertex up-stream from the edge and the following base calls extend across that edge to vertices downstream from it.

Despite that it is obvious that higher multiplicties correlated with haplotype probability this criterion fails short in some regards of which the most relevant is:

As it is evaluated in condensed seq-graph (as supposed to uncompressed read-threading-graphs) it is bias to haplotypes that have more short-sequence vetices
  ( -> ATGC -> CA -> has worse score than -> A -> T -> G -> C -> C -> A ->). This is partly result of how we modify the edge multiplicities when we merge vertices from a linear chain.

This pull-request addresses the problem by changing to a new scoring schema based in likelihood estimates:

Each haplotype's likelihood can be calculated as the multiplication of the likelihood of "taking" its edges in the assembly graph. The likelihood of "taking" an edge in the assembly
graph is calculated as its multiplicity divide by the sum of multiplicity of edges that share the same source vertex.

This pull-request addresses the following stories:

https://www.pivotaltracker.com/story/show/66691418
https://www.pivotaltracker.com/story/show/64319760

Change Summary:

1. Change to the new scoring schema.
2. Added a graph DOT printing code to KBestHaplotypeFinder in order to diagnose scoring.
3. Graph transformation have been modified in order to generate no 0-multiplicity edges. (Nevertheless the schema above should work with 0 edges assuming that they are in fact 0.5)
2014-03-14 18:37:01 -04:00
Bertrand Haas 82108d110f New abstract class StrandBiasTest() with old sub-class FisherStrand() and new sub-class StrandOddsRatio(). Latter is test based on symmetric odds ratio more appropriate than Fisher exact test when number of samples is large.
https://www.pivotaltracker.com/story/show/66087886
2014-03-14 18:33:21 -04:00
Eric Banks 7c7ff90266 Merge pull request #558 from broadinstitute/rp_vqsr_nondeterminism_fix
Fix for non-determinism in the VQSR with very large data sets
2014-03-12 14:35:51 -04:00
Eric Banks ffaf92f871 Added new functionality to the FastaAlternateReferenceMaker to have it output IUPAC codes for het sites.
Enable it with the new --useIUPAC argument.
Added both unit and integration tests for the new functionality - and fixed up the
exising tests once I was in there.
2014-03-12 14:31:57 -04:00
Ryan Poplin 907d1d6160 Fix for non-determinism in the VQSR with very large data sets 2014-03-12 10:25:12 -04:00
ldgauthier 4e74e77e74 Merge pull request #555 from broadinstitute/eb_add_option_to_CGVCFs_for_all_sites_GVCF
Added an option to CombineGVCFs to create basepair resolution gVCFs from...
2014-03-12 10:01:18 -04:00
David Roazen c67ced5f3b Emit a warning whenever the VectorLoglessPairHMM is used 2014-03-12 09:55:35 -04:00
Eric Banks d697e0144f Added an option to CombineGVCFs to create basepair resolution gVCFs from banded ones.
Use the --convertToBasePairResolution argument to enable this functionality.
2014-03-12 01:32:51 -04:00
Ryan Poplin 34d11fe40c Added the consensus mode used for the 1000 Genomes Project to the HaplotypeCaller.
-- All the provided alleles are added to the assembly graph as potential haplotypes but they aren't forcibly genotyped like in GGA mode.
-- Added integration test for this mode
2014-03-11 09:56:35 -04:00
droazen 8b53567dc7 Merge pull request #553 from broadinstitute/dr_rename_pipeline_tests
Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests
2014-03-10 21:36:45 -04:00
David Roazen 78562c14bb Rename existing PipelineTests to QueueTests to prepare for upcoming push of new pipeline tests
-These tests are really integration tests for Queue rather than generalized
 pipeline tests, so it makes sense to call them QueueTests.

-Rename test classes and maven build targets, and update shell scripts
 to reflect new naming.
2014-03-10 21:24:03 -04:00
David Roazen 7c34f05082 Merge remote-tracking branch 'origin/master' into intel 2014-03-10 14:07:36 -04:00
David Roazen 5a6aa54673 Revert "Update HaplotypeCaller and VariantAnnotator test MD5s"
This reverts commit 7faa44d576b06d7aef29562e82590a7855f216f4.
2014-03-10 14:06:51 -04:00
David Roazen e7d6db033b Revert "Revert "Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING""
This reverts commit c8a34749e631b92214a57bba162c6e0d849425f1.
2014-03-10 14:05:51 -04:00
David Roazen f070583f29 Update HaplotypeCaller and VariantAnnotator test MD5s
There are a few innocuous test failures on this branch --
updating MD5s after reviewing the differences in output
2014-03-07 10:54:27 -05:00
Karthik Gururaj 6e98e9e589 Removed g_haplotype* global variables in native code so that it works
with multi-threading in Java.
Modified VectorLoglessPairHMM.java so that jniInitializeRegion and
jniFinalizeRegion are empty
2014-03-06 22:08:35 -08:00
David Roazen 3f3df90412 Revert "Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING"
This reverts commit cef03f089fb3f131f3a77664b71feaec51a74cc8.
2014-03-06 10:15:35 -05:00
David Roazen 9df59bd8cc Update pom versions to mark the start of GATK 3.1 development 2014-03-06 00:05:58 -05:00
David Roazen 34edcb8ddf Update pom versions for the 3.0 release 2014-03-05 23:37:21 -05:00
David Roazen 53895e15cd Change default HaplotypeCaller PairHMM implementation back to LOGLESS_CACHING 2014-03-05 19:26:37 -05:00
Eric Banks d3de6413c9 Move warnings to debug logging status because they will definitely scare users 2014-03-05 15:05:21 -05:00
Karthik Gururaj 51b8ea5d59 Reset version 2014-03-05 11:19:08 -08:00
Karthik Gururaj b9afe800ae Merge correction 2014-03-05 10:06:45 -08:00
Karthik Gururaj 8fcbf9272c Merge branch 'intel_pairhmm' of /data/broad/gsa-unstable into intel_pairhmm
Conflicts:
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
	public/VectorPairHMM/src/main/c++/Sandbox.java
2014-03-05 09:35:50 -08:00
Intel Repocontact d81116eb1d Added vectorized PairHMM implementation by Mohammad and Mustafa into the Maven build of GATK.
C++ code has PAPI calls for reading hardware counters

Followed Khalid's suggestion for packing libVectorLoglessCaching into
the jar file with Maven

Native library part of git repo

1. Renamed directory structure from public/c++/VectorPairHMM to
public/VectorPairHMM/src/main/c++ as per Khalid's suggestion
2. Use java.home in public/VectorPairHMM/pom.xml to pass environment
variable JRE_HOME to the make process. This is needed because the
Makefile needs to compile JNI code with the flag -I<JRE_HOME>/../include (among
others). Assuming that the Maven build process uses a JDK (and not just
a JRE), the variable java.home points to the JRE inside maven.
3. Dropped all pretense at cross-platform compatibility. Removed Mac
profile from pom.xml for VectorPairHMM

Moved JNI_README

1. Added the catch UnsatisfiedLinkError exception in
PairHMMLikelihoodCalculationEngine.java to fall back to LOGLESS_CACHING
in case the native library could not be loaded. Made
VECTOR_LOGLESS_CACHING as the default implementation.
2. Updated the README with Mauricio's comments
3. baseline.cc is used within the library - if the machine supports
neither AVX nor SSE4.1, the native library falls back to un-vectorized
C++ in baseline.cc.
4. pairhmm-1-base.cc: This is not part of the library, but is being
heavily used for debugging/profiling. Can I request that we keep it
there for now? In the next release, we can delete it from the
repository.
5. I agree with Mauricio about the ifdefs. I am sure you already know,
but just to reassure you the debug code is not compiled into the library
(because of the ifdefs) and will not affect performance.

1. Changed logger.info to logger.warn in PairHMMLikelihoodCalculationEngine.java
2. Committing the right set of files after rebase

Added public license text to all C++ files

Added license to Makefile

Add package info to Sandbox.java

Conflicts:
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/gatk/walkers/haplotypecaller/PairHMMLikelihoodCalculationEngine.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/DebugJNILoglessPairHMM.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/JNILoglessPairHMM.java
	protected/gatk-protected/src/main/java/org/broadinstitute/sting/utils/pairhmm/VectorLoglessPairHMM.java
	public/VectorPairHMM/src/main/c++/.gitignore
	public/VectorPairHMM/src/main/c++/LoadTimeInitializer.cc
	public/VectorPairHMM/src/main/c++/LoadTimeInitializer.h
	public/VectorPairHMM/src/main/c++/Makefile
	public/VectorPairHMM/src/main/c++/Sandbox.cc
	public/VectorPairHMM/src/main/c++/Sandbox.h
	public/VectorPairHMM/src/main/c++/Sandbox.java
	public/VectorPairHMM/src/main/c++/Sandbox_JNIHaplotypeDataHolderClass.h
	public/VectorPairHMM/src/main/c++/Sandbox_JNIReadDataHolderClass.h
	public/VectorPairHMM/src/main/c++/baseline.cc
	public/VectorPairHMM/src/main/c++/define-double.h
	public/VectorPairHMM/src/main/c++/define-float.h
	public/VectorPairHMM/src/main/c++/define-sse-double.h
	public/VectorPairHMM/src/main/c++/define-sse-float.h
	public/VectorPairHMM/src/main/c++/headers.h
	public/VectorPairHMM/src/main/c++/jnidebug.h
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.cc
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_DebugJNILoglessPairHMM.h
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.cc
	public/VectorPairHMM/src/main/c++/org_broadinstitute_sting_utils_pairhmm_VectorLoglessPairHMM.h
	public/VectorPairHMM/src/main/c++/pairhmm-template-kernel.cc
	public/VectorPairHMM/src/main/c++/pairhmm-template-main.cc
	public/VectorPairHMM/src/main/c++/run.sh
	public/VectorPairHMM/src/main/c++/shift_template.c
	public/VectorPairHMM/src/main/c++/utils.cc
	public/VectorPairHMM/src/main/c++/utils.h
	public/VectorPairHMM/src/main/c++/vector_function_prototypes.h
2014-03-05 09:30:29 -08:00
Laura Gauthier 43fdd38342 Add error handling to CalculateGenotypePosteriors to catch multiallelic variants with wrong number of ACs
-- throws UserException; added tests in PosteriorLikelihoodsUtilsUnitTests
Add error handling to CalculateGenotypePosteriors for cases where MLEAC>AN; add tests in PosteriorLikelihoodsUtilsUnitTests
Add unit tests to confirm that CalculateGenotypePosteriors has the ability to switch genotypes for four cases
2014-03-05 12:03:18 -05:00
Laura Gauthier 7f9f58dbd1 Added hidden flag to GenotypeConcordance to output sites of discordant genotypes (to System.out)
Revised ConcondanceMetrics tests to adapt to change
Added comments to PosteriorLikelihoodsUtils
2014-03-05 12:03:18 -05:00