Commit Graph

571 Commits (232afdcbeaf0cf0f341c911e22f6e27286db08ee)

Author SHA1 Message Date
droazen 0fd9f0e77c Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837
Fixed the logic of the @Output annotation and its interaction with 'required'
2013-03-14 12:52:00 -07:00
Ryan Poplin 38914384d1 Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878. 2013-03-14 14:44:18 -04:00
Geraldine Van der Auwera 61349ecefa Cleaned up annotations
- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
  - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
  - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
  - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
  - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
  - Removed unnecessary check in AverageAltAlleleLength
2013-03-14 14:26:48 -04:00
Eric Banks 7cab709a88 Fixed the logic of the @Output annotation and its interaction with 'required'.
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:

I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
  * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
  * The logic for @Output is now:
    * if required==true then -o MUST be provided or a User Error is generated.
    * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
      * this is the default behavior (i.e. @Output with no modifiers).
    * if required==false and defaultToStdout==false then the output object is null.
      * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).

  * I have updated walkers so that previous behavior has been maintained (as best I could).
    * In general, all @Outputs with default long/short names have required=false.
    * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
  * I added unit tests for @Output changes with David's help (thanks!).
  * #resolve GSA-837
2013-03-14 11:58:51 -04:00
Mark DePristo b5b63eaac7 New GATKSAMRecord concept of a strandless read, update to FS
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads.  Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands.  This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called.  Added new GATKSAMRecord method setReducedCounts() that does the right thing.  Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation.  Differences are just minor updates to the FS
2013-03-13 11:16:36 -04:00
MauricioCarneiro 4403e3572a Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111 2013-03-12 13:02:35 -07:00
MauricioCarneiro 3a16ba04d4 Merge pull request #97 from broadinstitute/eb_refactor_sliding_window
Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug
2013-03-12 12:27:26 -07:00
Geraldine Van der Auwera f972963918 Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
2013-03-12 10:57:14 -04:00
Eric Banks 05e69b6294 Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug.
* Allow RR to write its BAM to stdout by setting required=true for @Output.
  * Fixed bug in sliding window where a break in coverage after a long stretch without
     a variant region was causing a doubling of all the reads before the break.
  * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods.
  * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().
2013-03-12 09:06:55 -04:00
Ryan Poplin c96fbcb995 Use the indel heterozygosity prior when calling indels with the HC 2013-03-11 14:12:43 -04:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
Eric Banks 3759d9dd67 Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine.
* ReadTransformers can say they must be first, must be last, or don't care.
  * By default, none of the existing ones care about ordering except BQSR (must be first).
    * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
  * The engine now orders the read transformers up front before applying iterators.
  * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
  * Added unit tests.
2013-03-06 12:38:59 -05:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Eric Banks bbbaf9ad20 Revert push from stable (I forgot that pushing from stable overwrites current unstable changes) 2013-03-05 09:06:02 -05:00
Eric Banks a037423225 Merged bug fix from Stable into Unstable 2013-03-05 09:03:48 -05:00
Eric Banks 7e1bfd6a7c Included an accidental change from unstable into the previous push 2013-03-05 09:03:31 -05:00
Eric Banks bd4e4f4ee3 Merged bug fix from Stable into Unstable 2013-03-04 23:24:44 -05:00
Eric Banks b715218bfe Fix for mismatching indel quals erro: need to adjust for softclips just like we do for bases and normal quals. 2013-03-04 23:23:18 -05:00
Ryan Poplin ce7554e9d6 Merged bug fix from Stable into Unstable 2013-03-04 12:36:04 -05:00
Ryan Poplin 0697594778 Active regions that don't contain any usable reads should just be skipped over instead of throwing an IllegalStateException. 2013-03-04 12:35:40 -05:00
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
David Roazen c5c99c8339 Split long-running integration test classes into multiple classes
This is to facilitate the current experiment with class-level test
suite parallelism. It's our hope that with these changes, we can get
the runtime of the integration test suite down to 20 minutes or so.

-UnifiedGenotyper tests: these divided nicely into logical categories
 that also happened to distribute the runtime fairly evenly

-UnifiedGenotyperPloidy: these had to be divided arbitrarily into two
 classes in order to halve the runtime

-HaplotypeCaller: turns out that the tests for complex and symbolic
 variants make up half the runtime here, so merely moving these into
 a separate class was sufficient

-BiasedDownsampling: most of these tests use excessively large intervals
 that likely can't be reduced without defeating the goals of the tests. I'm
 disabling these tests for now until they can either be redesigned to use smaller
 intervals around the variants of interest, or refactored into unit tests
 (creating a JIRA for Yossi for this task)
2013-03-01 13:55:23 -05:00
depristo cac3f80c64 Merge pull request #73 from broadinstitute/eb_remove_nested_hashmap_GSA-732
Replace uses of NestedHashMap with NestedIntegerArray.
2013-02-28 05:19:56 -08:00
Eric Banks d2904cb636 Update docs for RTC. 2013-02-27 14:56:44 -05:00
Eric Banks 69b8173535 Replace uses of NestedHashMap with NestedIntegerArray.
* Removed from codebase NestedHashMap since it is unused and untested.
 * Integration tests change because the BQSR CSV is now sorted automatically.
 * Resolves GSA-732
2013-02-27 14:03:39 -05:00
Alec Wysoker c8368ae2a5 Eliminate 7-element arrays in BaseCounts and BaseAndQualsCount and replace with in-line primitive attributes. This is ugly but reduces heap overhead, and changes are localized. When used in conjunction with Mauricio's FastUtil changes it saves and additional 9% or so of execution time. 2013-02-27 12:49:56 -05:00
David Roazen 6466463d5a Merged bug fix from Stable into Unstable 2013-02-26 21:54:54 -05:00
David Roazen 12a3d7ecad Fix licenses on files modified in 2.4-1 2013-02-26 21:53:17 -05:00
David Roazen a53b4a7521 Merged bug fix from Stable into Unstable 2013-02-26 21:41:13 -05:00
David Roazen 65d31ba4ad Fix runtime public -> protected dependencies in the test suite
-replace unnecessary uses of the UnifiedGenotyper by public integration tests
 with PrintReads

-move NanoSchedulerIntegrationTest to protected, since it's completely dependent
 on the UnifiedGenotyper
2013-02-26 21:19:12 -05:00
depristo 93205154b5 Merge pull request #63 from broadinstitute/eb_fix_pairhmm_unittest_GSA-776
Eb fix pairhmm unittest gsa 776
2013-02-26 11:56:58 -08:00
Eric Banks 734353e9df Merge pull request #60 from broadinstitute/mc_fastutil_GSATDG-83
Brought all of ReduceReads to fastutils
2013-02-26 11:56:41 -08:00
David Roazen 8b29030467 Change default downsampling coverage target for the HaplotypeCaller to 250
-was previously set to 30, which seems far too aggressive given that with
 ActiveRegionWalkers, as with LocusWalkers, this limits the depth of any
 pileup returned by LIBS

-250 is a more conservative default used by the UG

-can adjust down/up later based on further experiments (GSA-699 will
 remain open)

-verified with Ryan that all integration test differences are either
 innocent or represent an improvement

GSA-699
2013-02-26 09:33:25 -05:00
Eric Banks 396b7e0933 Fixed the intermittent PairHMM unit test failure.
The issue here is that the OptimizedLikelihoodTestProvider uses the same basic underlying class as the
BasicLikelihoodTestProvider and we were using the BasicTestProvider functionality to pull out tests of
that class; so if the optimized tests were run first we were unintentionally running those same tests
again with the basic ones (but expecting different results).
2013-02-25 15:05:13 -05:00
Eric Banks 7519484a38 Refactored PairHMM.initialize to first take haplotype max length and then the read max length so that it is consistent with other PairHMM methods. 2013-02-25 15:04:23 -05:00
Ryan Poplin 89e2943dd1 The maximum kmer length is derived from the reads.
-- This is done to take advantage of longer reads which can produce less ambiguous haplotypes
-- Integration tests change for HC and BiasedDownsampling
2013-02-25 14:40:25 -05:00
Mauricio Carneiro 0ff3343282 Addressing Eric's comments
-- added @param docs to the new variables
-- made all variables final
-- switched to string builder instead of String for performance.

GSATDG-83
2013-02-25 13:33:47 -05:00
Mauricio Carneiro 9e5a31b595 Brought all of ReduceReads to fastutils
-- Added unit tests to ReduceReads name compression
-- Updated reduce reads walker for unit testing

GSATDG-83
2013-02-23 22:53:23 -05:00
Ryan Poplin 6a639c8ffc Replace Smith-Waterman alignment with the bubble traversal.
-- Instead of doing a full SW alignment against the reference we read off bubbles from the assembly graph.
-- Smith-Waterman is run only on the base composition of the bubbles which drastically reduces runtime.
-- Refactoring graph functions into a new DeBruijnAssemblyGraph class.
-- Bug fix in path.getBases().
-- Adding validation code to the assembly engine.
-- Renaming SimpleDeBruijnAssembler to match the naming of the new Assembly graph class.
-- Adding bug fixes, docs and unit tests for DeBruijnAssemblyGraph and KBestPaths classes.
-- Added ability to ignore bubbles that are too divergent from the reference
-- Max kmer can't be bigger than the extension size.
-- Reverse the order that we create the assembly graphs so that the bigger kmers are used first.
-- New algorithm for determining unassembled insertions based on the bubble traversal instead of the full SW alignment.
-- Don't need the full read span reference loc for anything any more now that we clip down to the extended loc for both assembly and likelihood evaluation.
-- Updating HaplotypeCaller and BiasedDownsampling integration tests.
-- Rebased everything into one commit as requested by Eric
-- improvements to the bubble traversal are coming as a separate push
2013-02-22 15:42:16 -05:00
Mauricio Carneiro e3f01673e1 Implementation of the find and diagnose Queue script
-- Added 'uncovered intervals' output for FindCoveredIntervals
-- updated scala script to make use of it.
2013-02-22 10:19:01 -05:00
Ryan Poplin 62e14f5b58 Bug fix in LikelihoodCalculationEngine: Mapping quality was being cast to a byte and overflowing for reads with large mapping quality scores. 2013-02-21 14:34:17 -05:00
Eric Banks 6996a953a8 Haplotype/Allele based optimizations for the HaplotypeCaller that knock off nearly 20% of the total runtime (multi-sample).
These 2 changes improve runtime performance almost as much as Ryan's previous attempt (with ID-based comparisons):
* Don't unnecessarily overload Allele.getBases() in the Haplotype class.
  * Haplotype.getBases() was calling clone() on the byte array.
* Added a constructor to Allele (and Haplotype) that takes in an Allele as input.
  * It makes a copy of he given allele without having to go through the validation of the bases (since the Allele has already been validated).
  * Rev'ed the variant jar accordingly.

For the reviewer: all tests passed before rebasing, so this should be good to go as far as correctness.
2013-02-21 10:14:11 -05:00
Eric Banks 551d33686c Merge pull request #47 from broadinstitute/aw_reduceread_perf_1_GSA-761
Reduce memory footprint of SyntheticRead by replacing several Lists with...
2013-02-20 04:49:07 -08:00
Eric Banks 9dfdb9528b Merge pull request #49 from broadinstitute/gda_hidden_ug_args
Hide arguments related to reference sample operation in UG - for interna...
2013-02-19 16:18:32 -08:00
Eric Banks 0055a6f1cd Merge pull request #45 from broadinstitute/mc_fix_indelrealigner_GSA-774
Fix to the Indel Realigner bug described in GSA-774
2013-02-19 16:16:48 -08:00
Guillermo del Angel 5a0a9bc488 Hide arguments related to reference sample operation in UG - for internal use only until paper is published and docs are polished. 2013-02-19 19:06:42 -05:00
Mauricio Carneiro 371ea2f24c Fixed IndelRealigner reference length bug (GSA-774)
-- modified ReadBin GenomeLoc to keep track of softStart() and softEnd() of the reads coming in, to make sure the reference will always be sufficient even if we want to use the soft-clipped bases
-- changed the verification from readLength to aligned bases to allow reads with soft-clipped bases
-- switched TreeSet -> PriorityQueue in the ConstrainedMateFixer as some different reads can be considered equal by picard's SAMRecordCoordinateComparator (the Set was replacing them)
-- pulled out ReadBin class so it can be testable
-- added unit tests for ReadBin with soft-clips
-- added tests for getMismatchCount (AlignmentUtils) to make sure it works with soft-clipped reads

GSA-774 #resolve
2013-02-19 16:00:36 -05:00
Alec Wysoker ab75e053da Reduce memory footprint of SyntheticRead by replacing several Lists with a single List of a small private static
class that contains the attributes that were scattered across the several Lists.
2013-02-19 15:33:33 -05:00
Ryan Poplin c025e84c8b Fix for calculating read pos rank sum test with reads that are informative but don't actually overlap the variant due to some hard clipping.
-- Updated a few integration tests for HC, UG, and UG general ploidy
2013-02-19 14:09:24 -05:00