Commit Graph

12051 Commits (0cf5d30dacbff0a4e69baa2192b68a68ddca44fe)

Author SHA1 Message Date
Ryan Poplin 0cf5d30dac Bug fix in assembly for edge case in which the extendPartialHaplotype function was filling in deletions in the middle of haplotypes. 2013-03-15 14:20:25 -04:00
droazen 9d6d1f94b0 Merge pull request #112 from broadinstitute/dr_parallel_tests_print_unfinished_classes
parallel tests: start printing the names of unfinished test classes once...
2013-03-15 10:57:59 -07:00
Mark DePristo 4a042e9bff Merge pull request #111 from broadinstitute/rp_no_ref_padding_bug_GSA-860
Fix for edge case bug of trying to create insertions/deletions on the ed...
2013-03-15 10:34:45 -07:00
David Roazen f42a52c090 parallel tests: start printing the names of unfinished test classes once there are < 10 jobs left
This will let us see in real time in Bamboo which classes are preventing
our runs from finishing
2013-03-15 13:34:30 -04:00
Ryan Poplin b8991f5e98 Fix for edge case bug of trying to create insertions/deletions on the edge of contigs.
-- Added integration test using MT that previously failed
2013-03-15 12:32:13 -04:00
David Roazen 0fd40dbde9 parallel tests: use experimental Class A storage
(We were previously using Class C storage)
2013-03-15 10:20:27 -04:00
Ryan Poplin daa0f8b551 Merge pull request #109 from broadinstitute/md_qd_fix_for_high_depth
QualityByDepth remaps QD values > 40 to a gaussian around 30
2013-03-15 07:05:32 -07:00
Mark DePristo 8317cc155e Merge pull request #108 from broadinstitute/eb_bqsr_out_of_bounds_fix
Added check in the MalformedReadFilter for reads without stored bases (i...
2013-03-14 17:29:35 -07:00
MauricioCarneiro 6f0269df2c Merge pull request #107 from broadinstitute/eb_fix_bqsr_clip_exception 2013-03-14 14:40:06 -07:00
Eric Banks 232afdcbea Added check in the MalformedReadFilter for reads without stored bases (i.e. that use '*').
* We now throw a User Error for such reads
  * User can override this to filter instead with --filter_bases_not_stored
  * Added appropriate unit test
2013-03-14 17:17:26 -04:00
Mark DePristo 2d35065238 QualityByDepth remaps QD values > 40 to a gaussian around 30
-- This is a temporarily fix / hack to deal with the very high QD values that are generated by the haplotype caller when nearby events occur within reads.  In that case, the QUAL field can be many fold higher than normal, and results in an inflated QD value.  This hack projects such high QD values back into the good range (as these are good variants in general) so they aren't filtered away by VQSR.
-- The long-term solution to this problem is to move the HaplotypeCaller to the full bubble calling algorithm
-- Update md5s
2013-03-14 16:09:41 -04:00
droazen 0fd9f0e77c Merge pull request #104 from broadinstitute/eb_fix_output_annotation_GSA-837
Fixed the logic of the @Output annotation and its interaction with 'required'
2013-03-14 12:52:00 -07:00
David Roazen c3b5f66386 run_parallel_tests: further attempts to work around git issues in bamboo 2013-03-14 15:35:55 -04:00
Mark DePristo 5d6faef50e Merge pull request #106 from broadinstitute/rp_unknown_sites_assess_as_tp_in_kb
Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the s...
2013-03-14 11:50:12 -07:00
Ryan Poplin 38914384d1 Changing CALLED_IN_DB_UNKNOWN_STATUS to count as TRUE_POSITIVEs in the simplified stats for AssessNA12878. 2013-03-14 14:44:18 -04:00
Eric Banks 6d6264b108 Merge pull request #105 from broadinstitute/gg_annotations_cleanup_45802765
Cleaned up annotations
2013-03-14 11:35:00 -07:00
delangel ec43112d28 Merge pull request #100 from broadinstitute/eb_maxIndelSize_SV_fix
Fixed bug in SelectVariants where maxIndelSize argument wasn't getting a...
2013-03-14 11:32:56 -07:00
Geraldine Van der Auwera 61349ecefa Cleaned up annotations
- Moved AverageAltAlleleLength, MappingQualityZeroFraction and TechnologyComposition to Private
  - VariantType, TransmissionDisequilibriumTest, MVLikelihoodRatio and GCContent are no longer Experimental
  - AlleleBalanceBySample, HardyWeinberg and HomopolymerRun are Experimental and available to users with a big bold caveat message
  - Refactored getMeanAltAlleleLength() out of AverageAltAlleleLength into GATKVariantContextUtils in order to make QualByDepth independent of where AverageAltAlleleLength lives
  - Unrelated change, bundled in for convenience: made HC argument includeUnmappedreads @Hidden
  - Removed unnecessary check in AverageAltAlleleLength
2013-03-14 14:26:48 -04:00
Eric Banks 7cab709a88 Fixed the logic of the @Output annotation and its interaction with 'required'.
ALL GATK DEVELOPERS PLEASE READ NOTES BELOW:

I have updated the @Output annotation to behave differently and to include a 'defaultToStdout' tag.
  * The 'defaultToStdout' tags lets walkers specify whether to default to stdout if -o is not provided.
  * The logic for @Output is now:
    * if required==true then -o MUST be provided or a User Error is generated.
    * if required==false and defaultToStdout==true then the output is assigned to stdout if no -o is provided.
      * this is the default behavior (i.e. @Output with no modifiers).
    * if required==false and defaultToStdout==false then the output object is null.
      * use this combination for truly optional outputs (e.g. the -badSites option in AssessNA12878).

  * I have updated walkers so that previous behavior has been maintained (as best I could).
    * In general, all @Outputs with default long/short names have required=false.
    * Walkers with nWayOut options must have required==false and defaultToStdout==false (I added checks for this)
  * I added unit tests for @Output changes with David's help (thanks!).
  * #resolve GSA-837
2013-03-14 11:58:51 -04:00
Eric Banks 573ed07ad0 Fixed reported bug in BQSR for RNA seq alignments with Ns.
* ClippingOp updated to incorporate Ns in the hard clips.
  * ReadUtils.getReadCoordinateForReferenceCoordinate() updated to account for Ns.
  * Added test that covers the BQSR case we saw.
  * Created GSA-856 (for Mauricio) to add lots of tests to ReadUtils.
    * It will require refactoring code and not in the scope of what I was willing to do to fix this.
2013-03-14 11:26:52 -04:00
David Roazen acaa96f853 parallel_tests: use a safer method to copy the working dir into an LSF-accessible location
-"git clone" was failing intermittently with disturbing error messages about
 missing certain files. Use cp -r instead.

-Add extra checks and steps to try to ensure we have a complete checkout
 with no missing files.
2013-03-14 11:23:56 -04:00
David Roazen be729410b9 run_parallel_tests: use independent java.io.tmpdir for each run
-Turns out the Java 6 JCE crypto library (used to decrypt our AWS keys)
 uses the current list of files in the java.io.tmpdir as a source of
 entropy. This file list operation was prohibitively slow with a large,
 shared temp directory.

-Starting with an independent, empty temp dir for each run should solve
 this problem, and get rid of all/most of the test timeouts we've been
 seeing.
2013-03-14 08:55:26 -04:00
Eric Banks ff87b62fe3 Fixed bug in SelectVariants where maxIndelSize argument wasn't getting applied to deletions.
Added unit tests and docs.
2013-03-13 15:11:34 -04:00
Ryan Poplin 3b4dca1b94 Merge pull request #103 from broadinstitute/md_fragutils
Cleanup FragmentUtils; Add concept of strandless reads
2013-03-13 10:12:40 -07:00
Mark DePristo b5b63eaac7 New GATKSAMRecord concept of a strandless read, update to FS
-- Strandless GATK reads are ones where they don't really have a meaningful strand value, such as Reduced Reads or fragment merged reads.  Added GATKSAMRecord support for such reads, along with unit tests
-- The merge overlapping fragments code in FragmentUtils now produces strandless merged fragments
-- FisherStrand annotation generalized to treat strandless as providing 1/2 the representative count for both strands.  This means that that merged fragments are properly handled from the HC, so we don't hallucinate fake strand-bias just because we managed to merge a lot of reads together.
-- The previous getReducedCount() wouldn't work if a read was made into a reduced read after getReducedCount() had been called.  Added new GATKSAMRecord method setReducedCounts() that does the right thing.  Updated SlidingWindow and SyntheticRead to explicitly call this function, and so the readTag parameter is now gone.
-- Update MD5s for change to FS calculation.  Differences are just minor updates to the FS
2013-03-13 11:16:36 -04:00
Mark DePristo 925846c65f Cleanup of FragmentUtils
-- Code was undocumented, big, and not well tested.  All three things fixed.
-- Currently not passing, but the framework works well for testing
-- Added concat(byte[] ... arrays) to utils
2013-03-13 07:36:20 -04:00
David Roazen 8ed78b453f Increase timeout for a test in the EngineFeaturesIntegrationTest
-This test was intermittently failing when run on the farm
2013-03-12 23:53:26 -04:00
David Roazen 3847de5290 run_parallel_tests: detect farm glitches
-add a function to detect the case where there were no ant test failures,
 but one or more jobs exited with an error
2013-03-12 23:26:33 -04:00
Mark DePristo c289103c7d Merge pull request #102 from broadinstitute/dr_parallel_test_runner_improvements
parallel test runner: support multiple kinds of tests per run, logging, ...
2013-03-12 18:04:55 -07:00
David Roazen 7d06d15f3c parallel test runner: support multiple kinds of tests per run, logging, improved script output
-script now supports a variable number of test class suffixes (eg., UnitTest,
 IntegrationTest, etc.) meaning we can, for example, dispatch all unit
 and integration tests at once in a single job array

-write an entry to a log file at the end of each run including the build ID,
 exit status (COMPLETED or TIMED_OUT), total runtime, and time spent waiting
 for farm jobs to complete

-more detailed output: print how many jobs are pending vs. running vs. done,
 instead of just how many jobs are unfinished

-all errors now go to stderr rather than stdout
2013-03-12 20:46:38 -04:00
Mark DePristo b3f67899b5 Merge pull request #101 from broadinstitute/dr_fix_failing_parallel_tests
Fix more tests that fail when run in parallel on the farm
2013-03-12 14:11:02 -07:00
David Roazen cdb1fa1105 Fix more tests that fail when run in parallel on the farm
-Allow the default S3 put timeout of 30 seconds for GATKRunReports
 to be overridden via a constructor argument, and use a timeout
 of 300 seconds for tests. The timeout remains 30 seconds in all
 other cases.

-Change integration tests that themselves dispatch farm jobs
 into pipeline tests. Necessary because some farm nodes are
 not set up as submit hosts. Pipeline tests are still run
 directly on gsa4.

-Bump up the timeout for the MaxRuntimeIntegrationTest even more
 (was still occasionally failing on the farm!)
2013-03-12 16:53:30 -04:00
MauricioCarneiro 4403e3572a Merge pull request #94 from broadinstitute/gg_gatkdoc_docfixes_GSATDG-111 2013-03-12 13:02:35 -07:00
MauricioCarneiro 3a16ba04d4 Merge pull request #97 from broadinstitute/eb_refactor_sliding_window
Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug
2013-03-12 12:27:26 -07:00
droazen dcdd6e3e60 Merge pull request #96 from broadinstitute/md_assess_only_reviewed
Add mode to AssessNA12878 that will only consider reviewed sites
2013-03-12 10:29:07 -07:00
Geraldine Van der Auwera f972963918 Fixed issues raised by Appistry QA (mostly small fixes, corrections & clarifications to GATKDocs)
GATK-73 updated docs for bqsr args
GATK-9 differentiate CountRODs from CountRODsByRef
GATK-76 generate GATKDoc for CatVariants
GATK-4 made resource arg required
GATK-10 added -o, some docs to CountMales; some docs to CountLoci
GATK-11 fixed by MC's -o change; straightened out the docs.
GATK-77 fixed references to wiki
GATK-76 Added Ami's doc block
GATK-14 Added note that these annotations can only be used with VariantAnnotator
GATK-15 specified required=false for two arguments
GATK-23 Added documentation block
GATK-33 Added documentation
GATK-34 Added documentation
GATK-32 Corrected arg name and docstring in DiffObjects
GATK-32 Added note to DO doc about reference (required but unused)
GATK-29 Added doc block to CountIntervals
GATK-31 Added @Output PrintStream to enable -o
GATK-35 Touched up docs
GATK-36 Touched up docs, specified verbosity is optional
GATK-60 Corrected GContent annot module location in gatkdocs
GATK-68 touched up docs and arg docstrings
GATK-16 Added note of caution about calling RODRequiringAnnotations as a group
GATK-61 Added run requirements (num samples, min genotype quality)
Tweaked template and generic doc block formatting (h2 to h3 titles)
GATK-62 Added a caveat to HR annot
Made experimental annotation hidden
GATK-75 Added setup info regarding BWA
GATK-22 Clarified some argument requirements
GATK-48 Clarified -G doc comments
GATK-67 Added arg requirement
GATK-58 Added annotation and usage docs
GSATDG-96 Corrected doc
Updated MD5 for DiffObjectsIntegrationTests (only change is link in table title)
2013-03-12 10:57:14 -04:00
Mark DePristo 01c2e6e9fa Merge pull request #99 from broadinstitute/ami-fix-compilationError-LScallingPipeline
Ami fix compilation error l scalling pipeline
2013-03-12 07:47:57 -07:00
Ami Levy-Moonshine e2d4d1da20 fix compilation error in ReduceReadsScript (missing import) 2013-03-12 10:31:57 -04:00
Ami Levy-Moonshine eaf9c30257 fix compilation error (change from org.broadinstitute.variant.variantcontext.VariantContextUtils.FilteredRecordMergeType.KEEP_IF_ANY_UNFILTERED to GATKVariantContextUtils.FilteredRecordMergeType.KEEP_IF_ANY_UNFILTERED) 2013-03-12 10:31:57 -04:00
Mark DePristo 72f9abfcab Merge pull request #98 from broadinstitute/rp_hc_glm_both
Use the indel heterozygosity prior when calling indels with the HC
2013-03-12 07:09:43 -07:00
Eric Banks 05e69b6294 Refactoring of SlidingWindow class in RR to reduce complexity and fix important bug.
* Allow RR to write its BAM to stdout by setting required=true for @Output.
  * Fixed bug in sliding window where a break in coverage after a long stretch without
     a variant region was causing a doubling of all the reads before the break.
  * Refactored SlidingWindow.updateHeaderCounts() into 3 separate tested methods.
  * Refactored polyploid consensus code out of SlidingWindow.compressVariantRegion().
2013-03-12 09:06:55 -04:00
Mark DePristo 08db3b5155 Add mode to AssessNA12878 that will only consider reviewed sites 2013-03-11 21:31:02 -04:00
Ryan Poplin c96fbcb995 Use the indel heterozygosity prior when calling indels with the HC 2013-03-11 14:12:43 -04:00
Mark DePristo 7dce4f8630 Merge pull request #95 from broadinstitute/dr_parallel_tests_with_job_arrays
run_parallel_tests: add job array support
2013-03-11 10:57:39 -07:00
David Roazen df9821614c run_parallel_tests: add job array support
-With one bsub command per job, dispatch time could vary from 2 minutes to 2 hours (!)

-By dispatching all jobs at once using a job array, this potential bottleneck
 is removed
2013-03-11 13:36:55 -04:00
Eric Banks 508b58376c Merge pull request #93 from broadinstitute/gda_ancient_dna
Two features useful for ancient DNA processing. Ancient DNA sequencing d...
2013-03-10 17:57:28 -07:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
droazen 21a6b4add2 Merge pull request #92 from broadinstitute/yf_allow_spaces_in_sampleID_in_contam_file
Changed loadContaminationFile file parser to delimit by tab only (not spaces)
2013-03-07 12:07:51 -08:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
Mark DePristo ecb2599cde Merge pull request #91 from broadinstitute/dr_fix_failing_parallel_tests
Fix tests that were consistently or intermittently failing when run in parallel on the farm
2013-03-06 11:47:36 -08:00