Commit Graph

13375 Commits (9f3cbb2ef10fd1449d53906cbe7e458fdc57dca5)

Author SHA1 Message Date
Laura Gauthier 9f3cbb2ef1 Improvements to CalculateGenotypePosteriors and CalibrateGenotypeLikelihoods
CalculateGenotypePosteriors now only computes posterior probs for SNP sites with SNP priors
(other sites have flat priors applied)

CalibrateGenotypeLikelihoods had originally applied HOM_REF/HET/HOM_VAR frequencies in callset as priors before empirical quality analysis. Now has option (-noPriors) to not apply/apply flat priors. Also takes in new external probabilities files, such as those generated by CGP, from which the genotype posterior probability qualities will be read.

Integration test was changed to account for new SNP-only behavior and default behavior to not use missing priors.

(Also, new numRefIfMissing is 0, which should only matter in cases using few samples when you probably don't want to be doing that anyway!)
2014-04-24 08:49:42 -04:00
amilev 92a3aa35d5 Merge pull request #613 from broadinstitute/ami-RNAEdttingTool
create a new tool CountMutationTypes
2014-04-23 17:17:02 -04:00
Ami Levy-Moonshine 9e5333f1d1 create a new tool CountMutationTypes
The new tool gets an VCF file as an input and create a GATK report with the percentages of each mutation type (e.g. A->G, A->T...).
It allow the user to filter sites that will be count based of JXEL or based on the varait quals
A user can aslo print 12 VCF files (one for each mutation) with the VCF line of the mutations that were counted.
2014-04-23 14:22:33 -04:00
droazen 58c8b2dd84 Merge pull request #611 from broadinstitute/mm_otf_sample_rename_support_whitespacing_sample_names
Allow for whitespace in sample names when performing on-the-fly sample-renaming.
2014-04-22 13:01:15 -04:00
Michael McCowan 8290d3c8ac Allow for non-tab whitespace in sample names when performing on-the-fly sample-renaming. 2014-04-22 11:07:13 -04:00
Valentin Ruano Rubio d38835822e Merge pull request #612 from broadinstitute/vrr_integration_test_error_quickfix
Fixed integration test problems from previous premature merge
2014-04-20 18:40:22 -04:00
Valentin Ruano-Rubio e610373169 Fixed integration test problems from previous premature merge 2014-04-20 17:11:51 -04:00
MauricioCarneiro f03e5ffeb1 Merge pull request #604 from broadinstitute/vrr_hc_omniploidy_general_api
Disentangle UG and HC Genotyper engines.
2014-04-20 07:43:23 -04:00
Valentin Ruano-Rubio 4e5850966a Reengineer engine constructors 2014-04-19 17:58:14 -04:00
Valentin Ruano-Rubio 7455ac9796 Addressed revisions 2014-04-19 16:48:48 -04:00
Ryan Poplin a9a48f2459 Merge pull request #607 from broadinstitute/mm_bugfix_raise_mathutils_n_ceiling
Support more samples in math utilities.
2014-04-17 13:32:34 -04:00
jmthibault79 b840cf6b3f Merge pull request #610 from broadinstitute/jt_block_compressed_vcfs
Enable reading of other extensions for block-compressed VCFs
2014-04-17 12:32:49 -04:00
Joel Thibault 1ab50f4ba8 CatVariants now handles BCF and Block-Compressed VCF
[Delivers #67461500]
2014-04-17 12:31:38 -04:00
Joel Thibault 4c74319578 Update for Picard refactoring which improves block-compressed VCF reading
[Delivers #69215404]
2014-04-16 14:39:23 -04:00
Joel Thibault fd09cb7143 Rev Picard 1.111.1920 2014-04-16 14:39:19 -04:00
Joel Thibault f98df5c071 Integration test for the file extensions CatVariants should handle 2014-04-16 13:25:47 -04:00
Joel Thibault bdd7024d00 Integration test for block-compressed VCF reading 2014-04-16 13:09:40 -04:00
Joel Thibault ce770b032a Move execAndCheck() to ProcessController 2014-04-16 13:09:40 -04:00
Joel Thibault b197618d13 This comment is no longer true 2014-04-15 15:42:39 -04:00
MauricioCarneiro 34ece31f4a Merge pull request #605 from broadinstitute/ks_escape_dir_names
Quoting -out parameter during resource bundle creation
2014-04-15 05:56:35 -04:00
Khalid Shakir 218fe3875a Quoting -out parameter during resource bundle (StingText.properties) creation.
Fixes case where directory has parenthesis in it, like "Dropbox (Broad Dropbox1)".
2014-04-15 17:06:49 +08:00
Mike f0732d386c Support more samples in math utilities.
- Amend `MathUtils`' constants such that they support callings in excess of 70,000 samples (instead, 100,000).
2014-04-14 12:05:38 -04:00
Valentin Ruano-Rubio 08203b516e Disentangle UG and HC Genotyper engines.
Description:

  Transforms a delegation dependency from HC to UG genotyping engine into a reusage by inhertance where HC and UG engines inherit from a common superclass GenotyperEngine
  that implements the common parts. A side-effect some of the code is now more clear and redundant code has been removed.

  Changes have a few consequence for the end user. HC has now a few more user arguments, those that control the functionality that HC was borrowing directly from UGE.

     Added -ploidy argument although it is contraint to be 2 for now.
     Added -out_mode EMIT_ALL_SITES|EMIT_VARIANTS_ONLY ...
     Added -allSitePLs flag.

Stories:

   https://www.pivotaltracker.com/story/show/68017394

Changes:

   - Moved (HC's) GenotyperEngine to HaplotypeCallerGenotyperEngine (HCGE). Then created a engine superclass class GenotypingEngine (GE) that contains common parts between HCGE and the UG counterpart 'UnifiedGenotypingEngine' (UGE). Simplified the code and applied the template pattern to accomodate for small diferences in behaviour between both caller
   engines. (There is still room for improvement though).

   - Moved inner classes and enums to top-level components for various reasons including making them shorter and simpler names to refer to them.

   - Create a HomoSpiens class for Human specific constants; even if they are good default for most users we need to clearly identify the human assumption across the code if we want to make
   GATK work with any species in general; i.e. any reference to HomoSapiens, except as a default value for a user argument, should smell.

   - Fixed a bug deep in the genotyping calculation we were taking on fixed values for snp and indel heterozygisity to be the default for Human ignoring user arguments.

   - GenotypingLikehooldCalculationCModel.Model to Gen.*Like.*Calc.*Model.Name; not a definitive solution though as names are used often in conditionals that perhaps should be member methods of the
     GenLikeCalc classes.

   - Renamed LikelihoodCalculationEngine to ReadLikelihoodCalculationEngine to distinguish them clearly from Genotype likelihood calculation engines.

   - Changed copy by explicity argument listing to a clone/reflexion solution for casting between genotypers argument collection classes.

   - Created GenotypeGivenAllelesUtils to collect methods needed nearly exclusively by the GGA mode.

Tests :

    - StandardCallerArgumentCollectionUnitTest (check copy by cloning/reflexion).
    - All existing integration and unit tests for modified classes.
2014-04-13 03:09:55 -04:00
Ryan Poplin 4b140c9e48 Merge pull request #600 from broadinstitute/rp_random_forest_no_QUAL
Improvements to the Random Forest pipeline based on Marathon results.
2014-04-11 13:41:05 -04:00
Ryan Poplin 04ddbac585 Improvements to the Random Forest pipeline based on Marathon results.
-- We no longer use QUAL because it scales insidiously with AC.
-- By default we exclude sites in which NA12878 is polymorphic to prevent overfitting to the knowledgebase.
-- Tweaks to training parameters were required because of the QUAL change.
-- We now test for model convergence instead of specifying the number of iterations at the command line.
2014-04-11 12:16:05 -04:00
kshakir 6d58e61f23 Merge pull request #603 from broadinstitute/ks_specify_columns_analyzerunreports
Mapping fields to explicit column names in analyzeRunReports.py
2014-04-11 04:30:31 +08:00
Khalid Shakir c84235c17c Mapping fields to explicit column names in analyzeRunReports.py.
Removed SQLSetupTable support.
2014-04-11 04:28:33 +08:00
Eric Banks e38a295ebd Merge pull request #601 from broadinstitute/ami-updateScalaScript
update scala scrits to include more of the pipeline stpes
2014-04-10 16:01:02 -04:00
droazen 1590f06322 Merge pull request #602 from broadinstitute/use_version_controlled_scripts_for_s3_dl
Use version-controlled copies of scripts in GATKReports downloader
2014-04-10 15:40:37 -04:00
David Roazen 147bd88d26 Use version-controlled copies of scripts in GATKReports downloader 2014-04-10 15:39:06 -04:00
Ami Levy-Moonshine 40360ddb56 update scala scrits to include more of the pipeline stpes
Add a new script for evaluating the RNA-seq downsample results
2014-04-10 15:29:17 -04:00
jmthibault79 c275d76a3e Merge pull request #599 from broadinstitute/jt_logging_test
Integration test for logging to stderr
2014-04-09 15:31:51 -04:00
Joel Thibault c84126205b Test that stdout redirects and log files do not affect output 2014-04-09 13:52:42 -04:00
Joel Thibault 1103fd231a Better exception message 2014-04-09 10:51:45 -04:00
Ryan Poplin 1001a75d0e Merge pull request #598 from broadinstitute/rp_random_forest_fix_tranches
Bug fix for correctly parsing the tranche tag in the RandomForestWalker.
2014-04-09 09:28:23 -04:00
kshakir 5b32b7b191 Merge pull request #595 from broadinstitute/ks_picard_matecigar_update
After comments from @nh13, updated latest picard and setMateInfo call.
2014-04-09 10:30:22 +08:00
Ryan Poplin edd15add7c Bug fix for correctly parsing the tranche tag in the RandomForestWalker. 2014-04-08 15:39:17 -04:00
Khalid Shakir a6b0754990 After comments from @nh13, updated latest picard and setMateInfo call. 2014-04-08 15:22:45 -04:00
kshakir cc580ac75f Merge pull request #593 from broadinstitute/ks_bqsrgatherer_missing_readgroups_68720468
BQSRGatherer handles missing read groups from some input files.
2014-04-09 03:17:53 +08:00
Khalid Shakir 3047d6ff32 BQSRGatherer handles missing read groups from some input files. [#68720468] 2014-04-08 23:58:54 +08:00
Eric Banks b07c0a6b4c Merge pull request #594 from broadinstitute/dr_vcf_sample_renaming
Extend on-the-fly sample renaming feature to vcfs
2014-04-08 11:47:45 -04:00
David Roazen af6a897479 Extend on-the-fly sample renaming feature to vcfs
-Only works with single-sample vcfs

-As with bams, the user must provide a file mapping the absolute path to
 each vcf whose samples are to be renamed to the new sample name for that
 vcf. The argument is the same as for bams: --sample_rename_mapping_file,
 and the mapping file may contain a mix of bam and vcf files should the
 user wish.

-It's an error to attempt to remap the sample names of a multi-sample
 or sites-only vcf

-Implemented at the codec level at the instant the vcf header is first
 read in to minimize the chances of downstream code examining vcf
 headers/records before renaming occurs.

-Integration tests are in sting, unit tests are in picard

-Rev picard et. al. to 1.111.1902
2014-04-08 11:07:00 -04:00
Eric Banks e40cad7b50 Merge pull request #597 from broadinstitute/eb_fix_b36_chainfile
The contig is named MT, not M in b36.  Delivers PT68890442.
2014-04-08 10:04:44 -04:00
Eric Banks e690ed1a67 The contig is named MT not M in b36. Delivers PT68890442. 2014-04-08 10:03:47 -04:00
Eric Banks 85f68f610e Merge pull request #596 from broadinstitute/eb_fix_na12878_roc_curve_maker
Don't error out with ArithmeticException in ROC maker when using small sets
2014-04-08 09:56:07 -04:00
Eric Banks ad336375dc Merge pull request #590 from broadinstitute/vrr_validate_variants_unused_alleles_fix
Addresses issue with strict validation on GVCF files.
2014-04-07 22:10:49 -04:00
Ryan Poplin 416ccef0c5 Merge pull request #592 from broadinstitute/rp_random_forest_improvements
Balancing training classes between SNP/Indel and TP/FP.
2014-04-07 21:22:45 -04:00
Valentin Ruano-Rubio 5afcc8e05f Change in the command line interface of ValidateVariants.
Following reviewers comments the command line interface has been simplified.
All extra strict validations are performed by default (as before) and the
user has to indicate which one he/she does not want to use with --validationTypeToExclude.

Before he/she was able to indicate the only ones to apply with --validationType but that has been scrapped out.

Stories:

    - https://www.pivotaltracker.com/story/show/68725164

Changes:

    - Removed validateType argument.
    - Improved documentation.
    - Added some warnning log message on suspicious argument combinations.

Tests:

    - ValidateVariantsIntegrationTest#*
2014-04-07 16:27:11 -04:00
Ryan Poplin 7d11b4d5f1 Balancing training classes between SNP/Indel and TP/FP.
-- This results in much more consistent distribution of LOD scores for SNPs and Indels.
-- Removing genotype summary stats since they are now produced by default.
-- Added functionality to specify certain subsets of the training data to be used in Tranche file generation, -good:tranche=true set.vcf
2014-04-07 15:23:53 -04:00
Eric Banks de2a2442d9 Merge pull request #591 from broadinstitute/rp_add_genotype_summary_annotations
Adding GenotypeSummaries as INFO field annotations.
2014-04-07 09:21:07 -04:00