Commit Graph

4164 Commits (96e29d3d94ceaa9e7d8a478f6a5d146c19556cee)

Author SHA1 Message Date
Eric Banks 83e09b1f64 Created a new walker to do the full combination of N gVCFs from the HC single-sample ref calc pipeline.
Basically, it does 3 things (as opposed to having to call into 3 separate walkers):
1. merge the records at any given position into a single one with all alleles and appropriate PLs
2. re-genotype the record using the exact AF calculation model
3. re-annotate the record using the VariantAnnotatorEngine

In the course of this work it became clear that we couldn't just use the simpleMerge() method used
by CombineVariants; combining HC-based gVCFs is really a complicated process.  So I added a new
utility method to handle this merging and pulled any related code out of CombineVariants.  I tried
to clean up a lot of that code, but ultimately that's out of the scope of this project.

Added unit tests for correctness testing.
Integration tests cannot be used yet because the HC doesn't output correct gVCFs.
2013-12-31 12:07:56 -05:00
Menachem Fromer 48ef7a1a2f Merge remote-tracking branch 'origin/master' into mf_new_RBP 2013-12-19 10:42:20 -05:00
David Roazen 4a79831adc Add ability to specify min/max required/recommended values for numeric arguments in the @Argument annotation
-You can now add "minValue", "maxValue", "minRecommendedValue", and "maxRecommendedValue" attributes
 to @Argument annotations for command-line arguments

-"minValue" and "maxValue" specify hard limits that generate an exception if violated

-"minRecommendedValue" and "maxRecommendedValue" specify soft limits that generate a warning if violated

-Works only for numeric arguments (int, double, etc.) with @Argument annotations

-Only considers values actually specified by the user on the command line, not default values
 assigned in the code

As requested by Geraldine
2013-12-18 18:09:08 -05:00
Eric Banks 400e7c1404 Fixed bug in the filtering of lifted over variants where a deletion at the end of a contig could cause it to error out.
Added a unit test.
2013-12-11 14:07:18 -05:00
Eric Banks 418fbdfbab Added HC trio calls and NA12878 KB snapshot to resource bundle.
Also, don't touch the current link until the resources are finished being produced.
2013-12-07 22:08:34 -05:00
David Roazen 932cd3ada7 Fix 3rd-party library dependency issues in the HC/PairHMM tests
In general, test classes cannot use 3rd-party libraries that are not
also dependencies of the GATK proper without causing problems when,
at release time, we test that the GATK jar has been packaged correctly
with all required dependencies.

If a test class needs to use a 3rd-party library that is not a GATK
dependency, write wrapper methods in the GATK utils/* classes, and
invoke those wrapper methods from the test class.
2013-12-06 13:16:55 -05:00
David Roazen 0e65296efb Rev picard, sam-jdk, tribble, and variant jars to 1.104.1628
-update VariantFiltration to work with new Lazy wrapper around the
 JexlEngine in VariantContextUtils
2013-12-05 12:45:32 -05:00
Joel Thibault 5fe0531b4d Throw a GVCFIndexException when the user doesn't specify the optimal indexing strategy 2013-12-03 23:12:14 -05:00
Joel Thibault 8571a641bf Add @Advanced to variant_index_type and variant_index_parameter 2013-12-03 23:12:14 -05:00
Joel Thibault fd0a02e52e New VCF engine arguments to specify an alternate IndexCreator
- CatVariants updates to use custom VCF indices
- Scala scripts for VCF index testing
2013-12-03 13:31:02 -05:00
Joel Thibault 42f78bdb3a Add a class-based DataProvider 2013-12-03 13:31:01 -05:00
Joel Thibault cd3ee2ae7e whitespace 2013-12-03 13:31:01 -05:00
Eric Banks 6bee6a1b53 Change the behavior of SelectVariants for PL/AD when it encounters a record that has lost one or more alternate alleles.
Previously, we would strip out the PLs and AD values since they were no longer accurate.  However, this is not ideal because
then that information is just lost and 1) users complain on the forum and post it as a bug and 2) it gives us problems in both
the current and future (single sample) calling pipelines because we subset samples/alleles all the time and lose info.

Now the PLs and AD get correctly selected down.

While I was in there I also refactored some related code in subsetDiploidAlleles().  There were no real changes there - I just
broke it out into smaller chunks as per our best practices.

Added unit tests and updated integration tests.
Addressed reviews.
2013-12-03 09:23:03 -05:00
Valentin Ruano-Rubio 0f99778a59 Adding Graph-based likelihood ratio calculation to HC
To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test

Adding Graph-based likelihood ratio calculation to HC

  To active this feature add '--likelihoodCalculationEngine GraphBased' to the HC command line.

New HC Options (both Advanced and Hidden):
==========================================

  --likelihoodCalculationEngine PairHMM/GraphBased/Random (default PairHMM)

Specifies what engine should be used to generate read vs haplotype likelihoods.

  PairHMM : standard full-PairHMM approach.
  GraphBased : using the assembly graph to accelarate the process.
  Random : generate random likelihoods - used for benchmarking purposes only.

  --heterogeneousKmerSizeResolution COMBO_MIN/COMBO_MAX/MAX_ONLY/MIN_ONLY (default COMBO_MIN)

It idicates how to merge haplotypes produced using different kmerSizes.
Only has effect when used in combination with (--likelihooCalculationEngine GraphBased)

  COMBO_MIN : use the smallest kmerSize with all haplotypes.
  COMBO_MAX : use the larger kmerSize with all haplotypes.
  MIN_ONLY : use the smallest kmerSize with haplotypes assembled using it.
  MAX_ONLY : use the larger kmerSize with haplotypes asembled using it.

Major code changes:
===================

 * Introduce multiple likelihood calculation engines (before there was just one).

 * Assembly results from different kmerSies are now packed together using the AssemblyResultSet class.

 * Added yet another PairHMM implementation with a different API in order to spport
   local PairHMM calculations, (e.g. a segment of the read vs a segment of the haplotype).

Major components:
================

 * FastLoglessPairHMM: New pair-hmm implemtation using some heuristic to speed up partial PairHMM calculations

 * GraphBasedLikelihoodCalculationEngine: delegates onto GraphBasedLikelihoodCalculationEngineInstance the exectution
     of the graph-based likelihood approach.

 * GraphBasedLikelihoodCalculationEngineInstance: one instance per active-region, implements the graph traversals
     to calcualte the likelihoods using the graph as an scafold.

 * HaplotypeGraph: haplotype threading graph where build from the assembly haplotypes. This structure is the one
     used by GraphBasedLikelihoodCalculationEngineInstance to do its work.

 * ReadAnchoring and KmerSequenceGraphMap: contain information as how a read map on the HaplotypeGraph that is
     used by GraphBasedLikelihoodCalcuationEngineInstance to do its work.

Remove mergeCommonChains from HaplotypeGraph creation

Fixed bamboo issues with HaplotypeGraphUnitTest

Fixed probrems with HaplotypeCallerIntegrationTest

Fixed issue with GraphLikelihoodVsLoglessAccuracyIntegrationTest

Fixed ReadThreadingLikelihoodCalculationEngine issues

Moved event-block iteration outside GraphBased*EngineInstance

Removed unecessary parameter from ReadAnchoring constructor.
Fixed test problem

Added a bit more documentation to EventBlockSearchEngine

Fixing some private - protected dependency issues

Further refactoring making GraphBased*Instance and HaplotypeGraph slimmer. Addressed last pull request commit comments

Fixed FastLoglessPairHMM public -> protected dependency

Fixed probrem with HaplotypeGraph unit test
2013-12-02 19:37:19 -05:00
Chris Hartl 1f777c4898 Introducing the latest-and-greatest in genotyping: CalculatePosteriors.
CalculatePosteriors enables the user to calculate genotype likelihood posteriors (and set genotypes accordingly) given one or more panels containing allele counts (for instance, calculating NA12878 genotypes based on 1000G EUR frequencies). The uncertainty in allele frequency is modeled by a Dirichlet distribution (parameters being the observed allele counts across each allele), and the genotype state is modeled by assuming independent draws (Hardy-Weinberg Equilibrium). This leads to the Dirichlet-Multinomial distribution.

Currently this is implemented only for ploidy=2. It should be straightforward to generalize. In addition there's a parameter for "EM" that currently does nothing but throw an exception -- another extension of this method is to run an EM over the Maximum A-Posteriori (MAP) allele count in the input sample as follows:
 while not converged:
  * AC = [external AC] + [sample AC]
  * Prior = DirichletMultinomial[AC]
  * Posteriors = [sample GL + Prior]
  * sample AC = MLEAC(Posteriors)

This is more useful for large callsets with small panels than for small callsets with large panels -- the latter of these being the more common usecase.

Fully unit tested.

Reviewer (Eric) jumped in to address many of his own comments plus removed public->protected dependencies.
2013-11-27 13:00:45 -05:00
Geraldine Van der Auwera 429582589f Set SAMFileWriter to create index in ReadUtils to fix SplitSamFile issue 2013-11-26 15:54:47 -05:00
Geraldine Van der Auwera 25bc6e64ae Patched Queue extensions lacking a main class definition 2013-11-22 14:57:09 -05:00
Ami Levy-Moonshine 6ad841cec5 Rewrite ReadLengthDistribution to count the read lengths into a hash table first and only at the end to produce a GATK report table.
Before that fix, the tool was couldn't work with more then one RG before.
- Address all review comments
2013-11-18 17:29:31 -05:00
Ami Levy-Moonshine 9c1023c933 fix a (ugly) weird error from last commit that changed all the scala files to end with MoleculoPipeline.scala 2013-11-18 11:44:24 -05:00
MauricioCarneiro 7f08250870 Merge pull request #417 from broadinstitute/bt_pairhmm_api_cleanup2
Improve the PairHMM API for better FPGA integration
2013-11-14 10:47:07 -08:00
bradtaylor e40a07bb58 Improve the PairHMM API for better FPGA integration
Motivation:
The API was different between the regular PairHMM and the FPGA-implementation
via CnyPairHMM. As a result, the LikelihoodCalculationEngine had
to use account for this. The goal is to change the API to be the same
for all implementations, and make it easier to access.

PairHMM
PairHMM now accepts a list of reads and a map of alleles/haplotpes and returns a PerReadAlleleLikelihoodMap.
Added a new primary method that loops the reads and haplotypes, extracts qualities,
and passes them to the computeReadLikelihoodGivenHaplotypeLog10 method.
Did not alter that method, or its subcompute method, at all.
PairHMM also now handles its own (re)initialization, so users don't have to worry about that.

CnyPairHMM
Added that same new primary access method to this FPGA class.
Method overrides the default implementation in PairHMM. Walks through a list of reads.
Individual-read quals and the full haplotype list are fed to batchAdd(), as before.
However, instead of waiting for every read to get added, and then walking through the reads
again to extract results, we just get the haplotype-results array for each read as soon as it
is generated, and pack it into a perReadAlleleLikelihoodMap for return.
The main access method is now the same no matter whether the FPGA CnyPairHMM is used or not.

LikelihoodCalculationEngine
The functionality to loop through the reads and haplotypes and get individual log10-likelihoods
was moved to the PairHMM, and so removed from here. However, this class does need to retain
the ability to pre-process the reads, and post-process the resulting likelihoods map.
Those features were separated from running the HMM and refactored into their own methods
Commented out the (unused) system for finding best N haplotypes for genotyping.

PairHMMIndelErrorModel
Similar changes were made as to the LCE. However, in this case the haplotypes are modified
based on each individual read, so the read-list we feed into the HMM only has one read.
2013-11-14 09:45:33 -05:00
Geraldine Van der Auwera f22ab033f6 Merge pull request #424 from broadinstitute/gg_yetanothergatkdocfix
Yet another gatkdoc fix
2013-11-13 11:35:59 -08:00
Geraldine Van der Auwera dac3dbc997 Improved gatkdocs for InbreedingCoefficient, ReduceReads, ErrorRatePerCycle
Clarified caveat for InbreedingCoefficient
Cleaned up docstrings for ReduceReads
Brushed up doc for ErrorRatePerCycle
2013-11-13 14:33:04 -05:00
Phillip Dexheimer 296bcc7fb1 Changed name of jobs submitted to cluster job runners
-- Added 'jobRunnerJobName' definition to QFunction, defaults to value of shortDescription
-- Edited Lsf and Drmaa JobRunners to use this string instead of description for naming jobs in the scheduler

Signed-off-by: Joel Thibault <thibault@broadinstitute.org>
2013-11-12 14:34:56 -05:00
Mauricio Carneiro 725656ae7e Generalizing the FullProcessingPipeline Qscript
We have generalized the processing script to be able to handle multiple scenarios. Originally it was
designed for PCR free data only, we added all the steps necessary to start from fastq and process
RNA-seq as well as non-human data. This is our go to script in TechDev.

   * add optional "starting from fastq" path to the pipeline
   * add mark duplicates (optionally) to the pipeline
   * add an option to run with the mouse data (without dbsnp and with single ended fastq)
   * add option to process RNA-seq data from topHat (add RG and reassign mapping quality if necessary)
   * add option to filter or include reads with N in the cigar string
   * add parameter to allow keeping the intermediate files
2013-11-07 16:34:29 -05:00
Eric Banks f15355856a Merge pull request #418 from broadinstitute/eb_fix_liftover_script
Fixing the liftover script to not require strict VCF header validation.
2013-11-07 06:04:56 -08:00
Eric Banks 2fc40a0aed Fixing the liftover script to not require strict VCF header validation.
Apparently no one has used the liftover script for a while (which I guess is a good thing)...
2013-11-07 09:02:17 -05:00
Eric Banks 0e3d83d1ef Merge pull request #413 from broadinstitute/rp_qd_and_qual_updates_in_ref_model_pipeline
Improvements to the reference model pipeline.
2013-11-05 06:33:17 -08:00
Eric Banks 09dfaf1a68 Merge pull request #416 from broadinstitute/mc_quick_fixes_to_cser_pipeline
Add interpretation to QualifyMissingIntervals
2013-11-05 06:08:13 -08:00
Eric Banks 96024403bf Update the dbsnp version in the bundle from 137 to 138; resolves PT #59771004. 2013-11-04 10:01:22 -05:00
Ryan Poplin b22c9c2cb4 Improvements to the reference model pipeline.
-- We use the RegenotypeVariants walker to recompute the qual field. (instead of the discussed idea of adding this functionality to CombineVariants)
-- QualByDepth will now be recomputed even if the stratified contexts are missing. This greatly improves the QD estimate for this pipeline. Doesn't work for multi-allelics since the qual can't be recomputed.
2013-11-01 17:58:25 -04:00
Eric Banks cafcb34855 Merge pull request #411 from broadinstitute/eb_add_exome_intervals_to_bundle_script
Updated the GATK bundle script to:
2013-10-29 07:38:44 -07:00
Eric Banks 209f2a61aa Updated the GATK bundle script to:
1. Include exome target list for b37
2. Not delete the 'current' link unless -run is applied to the command line!  (sorry, Ryan)
2013-10-29 10:33:51 -04:00
Louis Bergelson 9498950b1c Adding more specific error message when one of the scripts doesn't exist.
--Previously it gave a cryptic message:
----IO error while decoding blarg.script with UTF-8
----Please try specifying another one using the -encoding option
2013-10-21 14:57:42 -04:00
David Roazen 5a2ef37ead Tweak dcov documentation to help prevent user confusion
Geraldine-approved!
2013-10-16 15:24:33 -04:00
Mauricio Carneiro efbfdb64fe Qscript to Downsample and analyze an exome BAM
this script downsamples an exome BAM several times and makes a coverage distribution
analysis (of bases that pass filters) as well as haplotype caller calls with a NA12878
Knowledge Base assessment with comparison against multi-sample calling
with the UG.

This script was used for the "downsampling the exome" presentation
2013-10-10 14:37:33 -04:00
Chris Hartl 55bab9fa87 Merged bug fix from Stable into Unstable 2013-10-10 13:01:12 -04:00
Chris Hartl 06d28c7f8b VariantsToBinaryPed: Move .fam file writing to initialize to ensure ordering matches the ordering of the VCF. Change the documentation to clarify that the fam files are not directly copied, but subset and re-ordered. 2013-10-10 12:53:15 -04:00
Mauricio Carneiro 5d6421494b Fix mismatching number of columns in report
Quick fix the missing column header in the QualifyMissingIntervals
report.

Adding a QScript for the tool as well as a few minor updates to the
GATKReportGatherer.
2013-10-09 14:38:15 -04:00
Ryan Poplin f3a67edc24 Merge pull request #402 from broadinstitute/gg_dcov_docs
Improvements to gatkdocs related to downsampling
2013-09-27 07:07:21 -07:00
kshakir a29f1f84bf Merge pull request #397 from lbergelson/lb_scala_2.10.2
Update scala from 2.9 to 2.10.2
2013-09-26 21:51:43 -07:00
Geraldine Van der Auwera 66d0235efc Minor clarifications & formatting tweaks for dcov docs 2013-09-26 14:28:22 -04:00
Michael McCowan 5113e21437 Bug fix: annotation values ar parsed as Doubles when they should be parsed as Integers due to implicit conversion.
* Updated expected test data in which an integer annotation (MQ0) was formatted as a double.
2013-09-25 13:12:02 -04:00
Louis Bergelson c05208ecec Resolving warnings
--specifying exception types in cases where none was already specified
----mostly changed to catch Exception instead of Throwable
----EmailMessage has a point where it should only be expecting a RetryException but was catching everything

--changing build.xml so that it prints scala feature warning details

--added necessary imports needed to remove feature warnings

--updating a newly deprecated enum declaration to match the new syntax
2013-09-23 12:42:22 -04:00
Louis Bergelson b32ad99d3f Changing from scala 2.9.2 to 2.10.2.
--modified ivy dependencies
--modified scala classpath in build.xml to include scala-reflect

--changed imports to point to the new scala scala.reflect.internal.util

--set the bootclasspath in QScriptManager as well as the classpath variable.

--removing Set[File] <-> Set[String] conversions
----Set is invariant now and the conversions broke
--removing unit tests for Set[File] <-> Set[String] conversions
2013-09-23 12:42:22 -04:00
chapmanb 2f5064dd1d Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions. Allows utilization of CoveredLocusView via the API
Signed-off-by: David Roazen <droazen@broadinstitute.org>
2013-09-10 15:32:54 -04:00
Geraldine Van der Auwera 292426b504 Merge pull request #390 from broadinstitute/mc_update_clipreads
Added REVERT SOFTCLIPPED bases to ClipReads
2013-09-09 16:43:03 -07:00
Geraldine Van der Auwera 8b829255e7 Clarified docs on using clipping options 2013-09-09 19:40:03 -04:00
MauricioCarneiro 014bc4269e Merge pull request #361 from broadinstitute/bt_pairhmm_array_implementation
Add Array Logless PairHMM
2013-09-08 20:16:53 -07:00
Ryan Poplin 3503050a39 Created a single sample calling pipeline which leverages the reference model calculation mode of the HaplotypeCaller
-- Adding changes to CombineVariants to work with the Reference Model mode of the HaplotypeCaller.
-- Added -combineAnnotations mode to CombineVariants to merge the info field annotations by taking the median
-- Added new StrandBiasBySample genotype annotation for use in computing strand bias from single sample input vcfs
-- Bug fixes to calcGenotypeLikelihoodsOfRefVsAny, used in isActive() as well as the reference model
-- Added active region trimming capabilities to the reference model mode, not perfect yet, turn off with --dontTrimActiveRegions
-- We only realign reads in the reference model if there are non-reference haplotypes, a big time savings
-- We only realign reads in the reference model if the read is informative for a particular haplotype over another
-- GVCF blocks will now track and output the minimum PLs over the block

-- MD5 changes!
-- HC tests: from bug fixes in calcGenotypeLikelihoodsOfRefVsAny
-- GVCF tests: from HC changes above and adding in active region trimming
2013-09-06 16:56:34 -04:00
Mauricio Carneiro b6c3ed0295 Added REVERT SOFTCLIPPED bases to ClipReads 2013-09-06 09:30:01 -04:00
Louis Bergelson 4473b0065e adding a check for the UNAVAILABLE case of GenotypeType in CountVariants 2013-08-29 17:27:00 -04:00
bradtaylor 3671e41b0c Add Array Logless PairHMM
A new PairHMM implementation for read/haplotype likelihood calculations. Output is the same as the LOGLESS_CACHING version.

Instead of allocating an entire (read x haplotype) matrix for each HMM state, this version stores sub-computations in 1D arrays. It also accesses intersections of the (read x haplotype) alignment in a different order, proceeding over "diagonals" if we think of the alignment as a matrix.

This implementation makes use of haplotype caching. Because arrays are overwritten, it has to explicitly store mid-process information. Knowing where to capture this info requires us to look ahead at the subsequent haplotype to be analyzed. This necessitated a signature change in the primary method for all pairHMM implementations.

We also had to adjust the classes that employ the pairHMM:
LikelihoodCalculationEngine (used by HaplotypeCaller)
PairHMMIndelErrorModel (used by indel genotyping classes)

Made the array version the default in the HaplotypeCaller and the UnifiedArgumentCollection.
The latter affects classes:
ErrorModel
GeneralPloidyIndelGenotypeLikelihoodsCalculationModel
IndelGenotypeLikelihoodsCalculationModel
... all of which use the pairHMM via PairHMMIndelErrorModel
2013-08-28 17:21:23 -04:00
David Roazen 42d771f748 Remove org.apache.commons.collections.IteratorUtils dependency from the test suite
-This was a dependency of the test suite, but not the GATK proper,
 which caused problems when running the test suite on the packaged
 GATK jar at release time

-Use GATKVCFUtils.readVCF() instead
2013-08-21 19:44:02 -04:00
Eric Banks 9424008055 Merge pull request #383 from broadinstitute/dr_change_phone_home_aws_settings
Update GATK AWS phone-home configuration
2013-08-21 14:08:21 -07:00
David Roazen 9fbb4920d0 Update GATK AWS phone-home configuration
-Switch to using new GSA AWS account for storage of phone home data

-Use DNS-compliant bucket names, as per Amazon's best practices

-Encrypt publicly-distributed version of credentials. Grant only PutObject
 permission, and only for the relevant buckets.

-Store non-distributed credentials in private/GATKLogs/newAWSAccountCredentials
 for now -- need to integrate with existing python/shell scripts
 later to get the log downloading working with the new account
2013-08-21 14:31:46 -04:00
Ami Levy-Moonshine 0f5bb706ff - update picard, sam, variants and tribble after fixing bug in BCF2Utils.makeDictionary as reported in ticket 52571227
- update call for VCFSimpleHeaderLine constructor in GATKVCFUtils
2013-08-21 12:06:42 -04:00
Eric Banks e1174a582d Merge pull request #379 from broadinstitute/mc_dpp_updates_part2
Including SplitByRG in the FullProcessingPipeline
2013-08-19 18:42:12 -07:00
Eric Banks 6663d48ffe Merge pull request #381 from broadinstitute/mm_rev_picard_to_get_tribble_updates
Adaptations to accomodate Tribble API changes.
2013-08-19 18:31:02 -07:00
Michael McCowan c3a933ce84 Adaptations to accomodate Tribble API changes, comprising mostly of the following.
* Refactoring implementations of readHeader(LineReader) -> readActualHeader(LineIterator), including nullary implementations where applicable.
* Galvanizing fo generic types.
* Test fixups, mostly to pass around LineIterators instead of LineReaders.
* New rev of tribble, which incorporates a fix that addresses a problem with TribbleIndexedFeatureReader reading a header twice in some instances.
* New rev of sam, to make AbstractIterator visible (was moved from picard -> sam in Tribble API refactor).
2013-08-19 15:52:47 -04:00
Mauricio Carneiro e991307eb5 Including SplitByRG in the FullProcessingPipeline
Why wasn't it there before, you ask
----------------------------------

Before I was running it separately (by hand), but now it's integrated in
the FullProcessingPipeline.

Integration was a pain because of Queue's limitation of only allowing 1
@Output file. This forced me to write the ugliest piece of code of my
life, but it's working and it's processing the YRI from scratch using
that right now. So I'm happy... somewhat.

Other changes to the pipeline
-----------------------------

   * Add --filter_bases_not_stored to the IndelRealigner step -- sometimes BAM files have reads with no bases stored in the unmapped section (no idea why) but this disrupts the pipeline.
   * Change adaptor marking parameter to "dual indexed" instead of "pair-ended" -- for PCR Free data.
2013-08-18 00:51:32 -04:00
droazen ee5de8510d Merge pull request #380 from broadinstitute/gg_gatkdocs_arglabels
More detailed labels for arguments in the gakdocs
2013-08-16 15:34:56 -07:00
Geraldine Van der Auwera 80ed186971 More detailed labels for arguments in the gakdocs (requested by David) 2013-08-16 14:25:53 -04:00
Geraldine Van der Auwera 9bb0aac7bf Disabled the help system's printout of cmdline options when GATK errors out. Now the user has to explicitly ask for it using -h. 2013-08-16 13:09:52 -04:00
Geraldine Van der Auwera 3841635fcb Changed 'depreciated' to the more correct 'deprecated' 2013-08-16 13:06:41 -04:00
Eric Banks 08be871309 Removing unused code in VariantsToTable: GQ is not an INFO field and is taken care of by -GF and not -F. 2013-08-16 01:57:24 -04:00
Eric Banks 1a5e4cc4e7 Merge pull request #375 from broadinstitute/rp_queue_jobreport_rscript
Something changed with the ggtitle syntax in the latest version of ggplo...
2013-08-14 12:48:42 -07:00
Geraldine Van der Auwera 19a4bf9ff0 made AR an Advanced argument to discourage basic users from fiddling with it 2013-08-14 14:46:56 -04:00
Ryan Poplin d4ac183580 Something changed with the ggtitle syntax in the latest version of ggplot2. 2013-08-14 14:40:03 -04:00
Mauricio Carneiro 765f5450ac Updated Full Processing Pipeline
* add interleaved fastq option to sam2fastq
    * add optional adapter trimming path
    * add "skip_revert" option to skip reverting the bams (sometimes useful -- hidden parameter)
    * add a walker that reads in one bam file and outputs N bam files, one for each read group in the original bam. This is a very important step in any BAM reprocessing pipeline.

I am using this new pipeline to process the CEU and YRI PCR Free WGS
trios.
2013-08-13 23:35:32 -04:00
Geraldine Van der Auwera a09831489b Disabled emission of doc URLs for external codecs to avoid broken links 2013-08-10 10:04:04 -07:00
Geraldine Van der Auwera 4d20c71e09 Improvements to various gatkdocs
- Make -rod required
    - Document that contaminationFile is currently not functional with HC
    - Document liftover process more clearly
    - Document VariantEval combinations of ST and VE that are incompatible
    - Added a caveat about using MVLR from HC and UG.
    - Added caveat about not using -mte with -nt
    - Clarified masking options
    - Fixed docs based on Erics comments
2013-08-10 10:01:31 -07:00
kshakir 1f86cf13d1 Merge pull request #359 from lbergelson/lb_relax_add_parameter
Trivial update to QScript.scala
2013-08-07 06:45:01 -07:00
Mark DePristo 7aba5a2f9f Several improvements to AssessNA12878 and KB
-- Bugfix for BAMs containing reads without real (M,I,D,N) operators.  Simply needed to set validation stringency to SILENT in the read. Added a BadCigar filter to the SAMRecord stream anyway
-- Add capture all sites mode to AssessNA12878: will write all sites to the badSites VCF, regardless of whether they are bad.  It's useful if you essentially want to annotate a VCF with KB information for later analysis, such as computing ROC curves
-- Add ignore filters mode to AssessNA12878: will as expected treat all sites in the input VCF calls as PASS, even if the site has a FILTER field setting
-- Add minPNonRef argument to AssessNA12878: this will consider a site not called even if the NA12878 genotype is not 0/0 if the PLs are present and the PL for 0/0 isn't greater than this value.  It allows us to easily differentiate low confidence non-ref sites obtained via multi-sample calling from highly confident non-ref calls that might be real TP or FPs
2013-08-07 08:08:37 -04:00
lbergelson af36c7ce9a Update QScript.scala
Relaxing addAll parameter type from Seq to Traversable to make it slightly more flexible.
2013-08-02 14:09:26 -04:00
Mauricio Carneiro 285ab2ac62 Better caching for the HaplotypeCaller
Problem
-------
Caching strategy is incompatible with the current sorting of the haplotypes, and is rendering the cache nearly useless.

Before the PairHMM updates, we realized that a lexicographically sorted list of haplotypes would optimize the use of the cache. This was only true until we've added the initial condition to the first row of the deletion matrix, which depends on the length of the haplotype. Because of that, every time the haplotypes differ in length, the cache has to be wiped. A lexicographic sorting of the haplotypes will put different lengths haplotypes clustered together therefore wasting *tons* of re-compute.

Solution
-------
Very simple. Sort the haplotypes by LENGTH and then in lexicographic order.
2013-08-02 01:27:29 -04:00
Yossi Farjoun 284176cd7b moved SnpEffUtilUnitTest to public tree 2013-07-30 17:51:40 -04:00
droazen b8709b1942 Merge pull request #332 from broadinstitute/st_fpga_hmm
FPGA support for PairHMM
2013-07-30 14:21:21 -07:00
Joseph Rose d2860a5486 Adding a representation of the hierarchy of flags output by snpEff (Yossi) and a stratifier whose output states are coding regions, genes, stop_gain, stop_lost and splice sites, all determined by the snpEff hierarchy (J. Rose) 2013-07-30 15:38:32 -04:00
Chris Hartl 464a5b229d Add <pre> tags to the Genotype Concordance docs. Tables were not being displayed properly. 2013-07-29 15:48:17 -07:00
Geraldine Van der Auwera 3063d82797 Fixed example in CallableLoci gatkdoc 2013-07-26 15:51:31 -04:00
Geraldine Van der Auwera fc4a8b1dd0 Fixed example in DoC gatkdoc 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 660b075900 Added deprecation notice for SomaticIndelDetector 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 5ad99c362d Added caveat to gatkdocs for MAPQ read transformers & cleaned up AB annotation gatkdocs 2013-07-26 15:51:30 -04:00
Geraldine Van der Auwera 0ea3f8ca58 Added function to gatkdocs to specify what VCF field an annotation goes in (INFO or FORMAT) 2013-07-26 15:51:30 -04:00
Ryan Poplin 8c205dda1b Automatically order the annotation dimensions in the VQSR by their standard deviation instead of the order they were specified on the command line. 2013-07-26 10:22:43 -04:00
Louis Bergelson 7c43b5f26a Adding LibraryReadFilter.
--Moving LibraryReadFilter which has been part of Mutect into gatk public.
--Added an additional check for null values.
2013-07-26 09:32:14 -04:00
Mauricio Carneiro 31ab0824b1 quick indentation fixes to FPGA code 2013-07-24 14:09:49 -04:00
Eric Banks 6df43f730a Fixing ReadBackedPileup to represent mapping qualities as ints, not (signed) bytes.
Having them as bytes caused problems for downstream programmers who had data with high MQs.
2013-07-23 23:47:15 -04:00
David Roazen 605a5ac2e3 GATK engine: add ability to do on-the-fly BAM file sample renaming at runtime
-User must provide a mapping file via new --sample_rename_mapping_file argument.
 Mapping file must contain a mapping from absolute bam file path to new sample name
 (format is described in the docs for the argument).

-Requires that each bam file listed in the mapping file contain only one sample
 in their headers (they may contain multiple read groups for that sample, however).
 The engine enforces this, and throws a UserException if on-the-fly renaming is
 requested for a multi-sample bam.

-Not all bam files for a traversal need to be listed in the mapping file.

-On-the-fly renaming is done as the VERY first step after creating the SAMFileReaders
 in SAMDataSource (before the headers are even merged), to prevent possible consistency
 issues.

-Renaming is done ONCE at traversal start for each SAMReaders resource creation in the
 SAMResourcePool; this effectively means once per -nt thread

-Comprehensive unit/integration tests

Known issues: -if you specify the absolute path to a bam in the mapping file, and then
               provide a path to that same bam to -I using SYMLINKS, the renaming won't
               work. The absolute paths will look different to the engine due to the
               symlink being present in one path and not in the other path.

GSA-974 #resolve
2013-07-18 15:48:42 -04:00
David Roazen c15751e41e SAMReaderID: fix bug with hash code and equals() method
-Two SAMReaderIDs that pointed at the same underlying bam file through
 a relative vs. an absolute path were not being treated as equal, and
 had different hash codes. This was causing problems in the engine, since
 SAMReaderIDs are often used as the keys of HashMaps.

-Fix: explicitly use the absolute path to the encapsulated bam file in
 hashCode() and equals()

-Added tests to ensure this doesn't break again
2013-07-15 13:57:00 -04:00
sathibault 0a8f75b953 Merge branch 'master' into st_fpga_hmm
Conflicts:
	protected/java/src/org/broadinstitute/sting/gatk/walkers/haplotypecaller/HaplotypeCaller.java
2013-07-15 08:17:32 -05:00
Eric Banks b16c7ce050 A whole slew of improvements to the Haplotype Caller and related code.
1. Some minor refactorings and claenup (e.g. removing unused imports) throughout.

2. Updates to the KB assessment functionality:
   a. Exclude duplicate reads when checking to see whether there's enough coverage to make a call.
   b. Lower the threshold on FS for FPs that would easily be filtered since it's only single sample calling.

3. Make the HC consistent in how it treats the pruning factor.  As part of this I removed and archived
   the DeBruijn assembler.

4. Improvements to the likelihoods for the HC
   a. We now include a "tristate" correction in the PairHMM (just like we do with UG).  Basically, we need
      to divide e by 3 because the observed base could have come from any of the non-observed alleles.
   b. We now correct overlapping read pairs.  Note that the fragments are not merged (which we know is
      dangerous).  Rather, the overlapping bases are just down-weighted so that their quals are not more
      than Q20 (or more specifically, half of the phred-scaled PCR error rate); mismatching bases are
      turned into Q0s for now.
   c. We no longer run contamination removal by default in the UG or HC.  The exome tends to have real
      sites with off kilter allele balances and we occasionally lose them to contamination removal.

5. Improved the dangling tail merging implementation.
2013-07-12 10:09:10 -04:00
Menachem Fromer a8ea57df9e Merge branch 'master' of github.com:broadinstitute/gsa-unstable 2013-07-10 16:44:35 -04:00
Eric Banks 5dbb582be7 Merge pull request #310 from broadinstitute/mc_interval_list_to_fastq
Walker to create a fastq file from an interval list
2013-07-08 14:30:43 -07:00
Valentin Ruano Rubio ac77a4c699 Merge pull request #316 from broadinstitute/md_filter_counting
Bugfix for counting of applied filters
2013-07-08 10:58:47 -07:00
Eric Banks 921f551426 AnalyzeCovariates is no longer a deprecated tool. 2013-07-08 09:48:12 -04:00
Eric Banks 5f5c90e65c Fix bug introduced recently in the VariantAnnotator where only the last -comp was being annotated at a site.
Trivial fix, added integration test to cover it.
2013-07-05 00:04:52 -04:00
David Roazen 6d69c7dc71 Disable RetryMemoryLimit pipeline test
-This test is failing intermittently for unexplained reasons (see GSA-943)

-In the interest of keeping the rest of the pipeline test suite running, it's
 best to disable this one test until GSA-943 is resolved
2013-07-03 13:38:28 -04:00
Mark DePristo 3db02e5ef1 Merge pull request #315 from broadinstitute/md_ref_conf_hc
Reference confidence model for the haplotype caller
2013-07-02 13:04:33 -07:00