Commit Graph

55 Commits (4e099c1f448ce3aa4d273f18136868f2c8ba01dd)

Author SHA1 Message Date
Ryan Poplin ac1a397024 This warning message actually happens all the time in AssessNA12878 when we subset down to biallelic events but I've verified that it is working as intended. Moving the logging level up to debug. 2014-09-29 11:40:38 -04:00
Phillip Dexheimer 1482a53aba Added -writeFullFormat engine-level argument
* This argument forces GATK to always write every record in the VCF format field, even if some records at the end are missing and could be removed
  * Revved htsjdk and picard
  * PT 70993484
2014-09-17 08:25:27 -04:00
Valentin Ruano-Rubio 95b45443ae Updated test according to changes in the AF calculator framework.
Changes:
-------

* Updated current unit and integration test to use the new API components.
* Added unit tests for new classes AFPriorProvider and AFCalculatorProviders.
* Added integration test for mixed ploidy GenotypeGVCFs and CombineGVCFs
2014-09-12 14:59:47 -04:00
Valentin Ruano-Rubio 3cdeab6e9e GenotypingEngines and walkers now use AFCalc(ulator) providers rathern than instanciate their own (fixed) calculators directly.
Changes:
-------

* GenotypingEngine uses now a AFCalc provider instead of
  its own thread-local with one-time initialized and fixed
  AF calculator.

* All walkers that use a GenotypingEngine now are passing
  the appropiate AF calculator provider. For now most
  just use a fix calculator (FixedAFCalculatorProvider)
  except GenotypeGVCFs as this one now can cope with
  mixture of ploidies failing-over to a general-ploidy
  calculator when the preferred implementation is not
  capable to handle a site's analysis.
2014-09-12 14:25:09 -04:00
Phillip Dexheimer a35f5b8685 Moved arguments controlling options in output files into the engine
* Arguments involved are --no_cmdline_in_header, --sites_only, and --bcf for VCF files and --bam_compression, --simplifyBAM, --disable_bam_indexing, and --generate_md5 for BAM files
 * PT 52740563
 * Removed ReadUtils.createSAMFileWriterWithCompression(), replaced with ReadUtils.createSAMFileWriter(), which applies all appropriate engine-level arguments
 * Replaced hard-coded field names in ArgumentDefinitionField (Queue extension generator) with a Reflections-based lookup that will fail noisily during extension generation if there's an error
2014-09-05 21:18:11 -04:00
Valentin Ruano Rubio c7925f6e5c Merge pull request #719 from broadinstitute/vrr_generalize_ploidy_in_genotype_gvcfs
Adds support for omniploidy to GenotypeGVCFs and CombineGVCFs.
2014-09-02 16:51:02 -04:00
Valentin Ruano-Rubio d363725b4b Adds support for omniploidy to GenotypeGVCFs and CombineGVCFs.
Same changes fixed the problem for GenotypeGVCFs and CombineGVCFs.

Stories:

  - https://www.pivotaltracker.com/story/show/77626044
  - https://www.pivotaltracker.com/story/show/77626854

Changes:

  - Generalized the code for the merging in GATKVariantContextUtils to cope
    with ploidy != 2.
  - GenotypeGVCFs now check that the input's ploidy conform to the '-ploidy'
    argument.
  - Moved out Refernce Confidence VC merging code from GATKVariantContextUtils
    so that we can keep new code in protected.

Caveats:

  - GenotypeGVCFs only can deal with input files that have the same ploidy in
    all positions; the one that the user MUST indicate in the -ploidy argument
    (if different to the default 2).
  - CombineGVCFs won't necessarely complain if its passed mixed ploidy
    inputs but you won't be able to genotype it with GenotypeGVCFs.

Test:

   - Removed deprecated unit tests for GATKVariantContextUtils.
   - Moved unit-tests regarding GVCF merging from GATKVariantContextUtilsUnitTest
     to ReferenceConfidenceVariantContextUtilsUnitTest.
   - Added unit test for new code for mapping genotype indices between allele
     index encoding in GenotypeLikelihoodCalculator.
   - GenotypeGVCFs and CombineGVCFs original integration test are unaffected
     by the change.
   - Added tetraploid run integration tests to check on non-diploid execution
     of GenotypeGVCFs and CombineGVCFs.
2014-09-02 15:06:47 -04:00
Eric Banks fe86dafc41 Merge pull request #705 from broadinstitute/gg_simplify_gatkdocs_templates
Changed the GATKDocs format to PHP
2014-09-02 06:28:26 -04:00
Valentin Ruano-Rubio fc5ce4b662 Created the stand-alone AC and AF annotation AlleleCountBySample
Story:

  https://www.pivotaltracker.com/story/show/77250524

Changes:

  - Remove the annotating code in GeneralPloidyExactAFCalc (GPEAFC) class.
  - Added the asAlleleList to GenotypeAlleleCounts class and get (GPEAFC) to use that instead of implementing its own (nicer and more reusable code).
  - Removed the explicit addition of AlleleCountBySample fields to the VCF header by the walker initialize
  - Added utility methods in Utils to wrap and int[] array into a List<Integer>, and double[] array into a List<Double> efficiently.

Test:

  - Added unit-testing for asAlleleList in GenotypeAlleleCountsUnitTest (within testFirst and testNext).
  - Added unit-testing for new methods in Utils : asList(int[]) and asList(double[])
  - Changed UG General Ploidy test to add explicitly those annotations.
  - Non-trivial changes in integration tests involving non-diploid runs (namelly haploid and tetraploid) as they are not showing
    those annotations anylonger, so the MD5s have been changed accordingly.
2014-08-22 20:33:25 -04:00
Valentin Ruano-Rubio 8d9a55ae60 Moving new omniploidy likelihood calculation classes to their final package (as far as this pull-request is concerned) in org.broadinstitute.gatk.tools.walkers.genotyper 2014-08-19 11:54:29 -04:00
Valentin Ruano-Rubio 611b7f25ea Adds unit-test and integration test for new omniploidy likelihood calculation components
Added md5 to HaplotypeCallerIntegrationTest.testHaplotypeCallerSingleSampleWithDbsnp
2014-08-19 11:53:19 -04:00
Valentin Ruano-Rubio 9ee9da36bb Generalize the calculation of the genotype likelihoods in HC to cope with haploid and multiploidy
Changes in several walker to use new sample, allele closed lists and new GenotypingEngine constructors signatures

Rebase adoption of new calculation system in walkers
2014-08-19 11:53:06 -04:00
Valentin Ruano-Rubio 4f993e8dbe Added read-likelihoods array base structure to substitute existing Map-of-Map-of-Maps. 2014-08-19 11:50:12 -04:00
Valentin Ruano-Rubio 242cd0e58f Added genotype allele counts and likelihood calculator utilities for arbitrary ploidy and number of alleles 2014-08-19 11:50:12 -04:00
Valentin Ruano-Rubio b0a4cb9f0c Added close sample and allele list data-structures and utility classes 2014-08-19 11:50:12 -04:00
Geraldine Van der Auwera cdba069b02 changed the GATKDocs format to PHP 2014-08-18 18:04:07 -04:00
Eric Banks eb84091702 Update the --keepOriginalAC functionality in SelectVariants to work for sites that lose alleles in the selection. 2014-08-14 15:34:09 -04:00
Ryan Poplin 3a9a78c785 Removing an assumption that ADs were in the same order if the number of alleles matched. This happens for example when one sample is C->T and another sample is C->G. 2014-08-13 13:26:40 -04:00
Eric Banks 27193c5048 Merge pull request #700 from broadinstitute/eb_phase_HC_variants_PT74816060
Initial implementation of functionality to add physical phasing informat...
2014-08-13 12:30:32 -04:00
Eric Banks 4512940e87 Initial implementation of functionality to add physical phasing information to the output of the HaplotypeCaller.
If any pair of variants occurs on all used haplotypes together, then we propagate that information into the gVCF.
Can be enabled with the --tryPhysicalPhasing argument.
2014-08-13 12:25:31 -04:00
Geraldine Van der Auwera 49702dc695 Clarified Phone Home system details re: privacy 2014-08-12 17:23:35 -04:00
jmthibault79 6d7201a7f8 Merge pull request #698 from broadinstitute/pd_printreads_subset
Improvements to read-group filtering in PrintReads
2014-08-12 14:13:07 -04:00
Phillip Dexheimer 7e77875c81 Improvements to read-group filtering in PrintReads
- Read groups that are excluded by sample_name, platform, or read_group arguments no longer appear in the header
 - The performance penalty associated with filtering by read group has been essentially eliminated
 - Partial fulfillment of PT 73075482
2014-08-11 20:08:16 -04:00
Valentin Ruano-Rubio 9a9a68409e ReadLikelihoods class introduction final changes before merging
Stories:

        https://www.pivotaltracker.com/story/show/70222086
        https://www.pivotaltracker.com/story/show/67961652

Changes:

  Done some changes that I missed in relation with making sure that all PairHMM implentations use the same interface; as a consequence we were running always the standard PairHMM.
  Fixed some additional bugs detected when running it on full wgs single sample and exom multi sample data set.
  Updated some integration test md5s.

Fixing GraphBased bugs with new master code
Fixed ReadLikelihoods.changeReads difficult to spot bug.
Changed PairHMM interface to fix a bug
Fixed missing changes for various PairHMM implementations to get them to use the new structure.
Fixed various bugs only detectable when running with full sample(s).
Believe to have fixed the lack of annotations in UG runs
Fixed integrationt test MD5s
Updating some md5s
Fixed yet another md5 probably left out by mistake
2014-08-11 17:46:28 -04:00
Valentin Ruano-Rubio 2914ecb585 Change the Map-of-maps-of-maps for an array based implementation ReadLikelihoods to hold read likelihoods.
The array structure should be faster to populate and query (no properly benchmarked) and reduce memory footprint considerably.
    Nevertheless removing PairHMM factor (using likelihoodEngine Random) it only achieves a speed up of 15% in some example WGS dataset
    i.e. there are other bigger bottle necks in the system. Bamboo tests also seem to run significantly faster with this change.

    Stories:

      https://www.pivotaltracker.com/story/show/70222086
      https://www.pivotaltracker.com/story/show/67961652

    Changes:

       - ReadLikelihoods added to substitute  Map<String,PerSampleReadLikelihoods>
       - Operation that involve changes in full sets of ReadLikelihoods have been moved into that class.
       - Simplified a bit the code that handles the downsampling of reads based on contamination

    Caveats:

       - Still we keep Map<String,PerReadAlleleLikelihoodsMap> around to pass to annotators..., didn't feel like change the interface of so many public classes in this pull-request.
2014-08-11 17:46:28 -04:00
Valentin Ruano-Rubio 09ac3779d6 Added ReadLikelihoods component to substitute Map<String,PerReadAlleleLikelihoodMap>.
It uses a more efficient java array[] based implementation and encapsulates operations perform with such a
read-likelihood collection such as marginalization, filtering by position, poor modeling or capping
worst likelihoods and so forth.

Stories:

          https://www.pivotaltracker.com/story/show/70222086
          https://www.pivotaltracker.com/story/show/67961652
2014-08-11 17:46:28 -04:00
Eric Banks 5f31e54d67 Merge pull request #696 from broadinstitute/pd_DoC_sorting
Fix sample sort order bug in DepthOfCoverage
2014-08-06 08:35:35 -04:00
Phillip Dexheimer b0c026e671 Fix sample sort order bug in DepthOfCoverage
Rare bug triggered by hash collision between sample names
 PT 66183936
2014-08-05 21:55:34 -04:00
Phillip Dexheimer 593663d9b6 Improved detection of missing argument values
In particular, it was possible to specify arguments for Files or Compound types without values
 Added a special "none" value for annotations, since a bare "-A" is no longer allowed
 Delivers PT 71792842 and 59360374
2014-08-05 20:31:31 -04:00
Phillip Dexheimer 359fe150c9 Documentation fix (closed HTML tag) 2014-08-04 23:19:16 -04:00
Laura Gauthier 4373922ee6 Update GATK to work with latest htsjdk
ValidationStringency was moved from htsjdk.samtools.SAMFileReader to htsjdk.samtools
samtools find BAM index file method was also moved (and made public!)
2014-07-30 12:05:14 -04:00
Eric Banks 84af1fc75f The copy constructor for a GATKSAMRecord (used for testing only) should use the actual read's contig index, not its mate's. 2014-07-23 15:31:03 -04:00
Geraldine Van der Auwera a6f632874b Various documentation improvements
- Edited intervals merging docs for correctness & clarity
- Edited VQSR arg docs and made mode required (+added -mode SNP to VQSR tests)
- Moved PaperGenotyper to Toy Walkers to declutter the actually useful docs
- Moved GenotypeGVCFs to Variant Discovery category and clarified a few points
- Clarified that the -resource argument depends on using the -V:tag format
- Clarified how the pcr indel model works
- Added caveat for -U ALLOW_N_CIGAR_READS
- Added MathJax support for displaying equations in GATKDocs
- Updated HC example commands and caveats
2014-07-14 12:03:03 -04:00
Eric Banks ecefcb383d Disable the complex variant merging for now, as requested by ATGU 2014-07-11 17:27:40 -04:00
Laura Gauthier 99026eb51b Update VQSR Rnd BQSR script generation code for compatibility with latest ggplot version. Update queueJobReport.R and public/gsalib/src/R/R/gsa.variantqc.utils.R also 2014-07-09 15:36:58 -04:00
Eric Banks bad7865078 When converting a haplotype to a set of variants we now check for cases that are overly complex.
In these cases, where the alignment contains multiple indels, we output a single complex
variant instead of the multiple partial indels.

We also re-enable dangling tail recovery by default.
2014-07-01 14:18:59 -04:00
Ryan Poplin 0127799cba Reads are now realigned to the most likely haplotype before being used by the annotations.
-- AD,DP will now correspond directly to the reads that were used to construct the PLs
-- RankSumTests, etc. will use the bases from the realigned reads instead of the original alignments
-- There is now no additional runtime cost to realign the reads when using bamout or GVCF mode
-- bamout mode no longer sets the mapping quality to zero for uninformative reads, instead the read will not be given an HC tag
2014-06-30 10:35:50 -04:00
Phillip Dexheimer 65eeb4a7ab Recast the "Invalid JEXL expression detected" error in SelectVariants from a RuntimeException to a UserException
- PT 68931448
2014-06-20 00:05:23 -04:00
Phillip Dexheimer da5e567b73 Added functionality to CatVariants to process .list files with -V
- Pivotal 70305712
2014-06-19 21:46:13 -04:00
Ryan Poplin da1dab6c32 Merge pull request #661 from broadinstitute/jw_allele_balance_gvcf
Enable AB annotation in reference model pipeline. Incorporates patches f...
2014-06-19 13:10:41 -04:00
Eric Banks 1092dd6e25 From Carlos Barroto: switch outputRoot in SplitSamFile to an empty string instead of null. 2014-06-19 11:06:55 -04:00
Ryan Poplin 8b75428a90 Enable AB annotation in reference model pipeline. Incorporates patches from John Wallace to public github account 2014-06-19 09:35:04 -04:00
Nigel Delaney 7570666f2a Merge pull request #655 from broadinstitute/nfd_mathutil_opts
Optimization of function to calculate the logged sum of exponentiated values
2014-06-17 17:07:42 -04:00
Nigel Delaney 5e258bfeff Minor optimization to function to calculate the log of exponentials.
* Avoids calling Math.Pow whenever possible (skips -Inf and 0 values),
leads to better performance.
2014-06-17 15:26:10 -04:00
Chris Whelan ba1d23e535 Created a new tool, SiblingIBD, which finds Identical-By-Descent regions in two siblings.
-When parental genotypes are available, implements an HMM on genotype observations in the quartet.
   -Outputs IBD regions as well as per-site posterior probabilities of being in each IBD state.
   -Includes an experimental heuristic based mode for when parental genotypes are not available.
   -Made a method in MendelianViolation public static to reuse code.
   -Added the mockito library to private/gatk-tools-private/pom.xml
2014-06-13 09:41:37 -04:00
Phillip Dexheimer 4eb9858461 Ensure that output files are specified in a writeable location
-PT 69579780
2014-06-02 21:13:59 -04:00
EvolvedMicrobe ef7531d4a5 Merge pull request #640 from broadinstitute/IntegerSWImplementation
Change SmithWaterman to use integers instead of doubles.
2014-05-28 15:10:05 -04:00
Nigel Delaney cc45e62e8e Change SmithWaterman to use integers instead of doubles. 2014-05-28 13:13:14 -04:00
Eric Banks ff43b1f298 Merge pull request #636 from broadinstitute/pd_log10_refactor
Replaced the static, fixed MathUtils.log10Cache array with a dynamic Log...
2014-05-28 08:46:49 -04:00
Phillip Dexheimer 6122b2805d Legibility improvements to ProgressMeter
- Fields in the header are delimited with the pipe character
 - Header is now split into two lines to improve spacing
 - Field width in header and progress lines auto-adjusts to length of "processing units" label (sites, active regions, etc)
 - Addresses PT 69725930
2014-05-27 23:52:42 -04:00