Commit Graph

12013 Commits (01c2e6e9fad5c8f8f0173746c9c64620de67f97f)

Author SHA1 Message Date
Mark DePristo 01c2e6e9fa Merge pull request #99 from broadinstitute/ami-fix-compilationError-LScallingPipeline
Ami fix compilation error l scalling pipeline
2013-03-12 07:47:57 -07:00
Ami Levy-Moonshine e2d4d1da20 fix compilation error in ReduceReadsScript (missing import) 2013-03-12 10:31:57 -04:00
Ami Levy-Moonshine eaf9c30257 fix compilation error (change from org.broadinstitute.variant.variantcontext.VariantContextUtils.FilteredRecordMergeType.KEEP_IF_ANY_UNFILTERED to GATKVariantContextUtils.FilteredRecordMergeType.KEEP_IF_ANY_UNFILTERED) 2013-03-12 10:31:57 -04:00
Mark DePristo 72f9abfcab Merge pull request #98 from broadinstitute/rp_hc_glm_both
Use the indel heterozygosity prior when calling indels with the HC
2013-03-12 07:09:43 -07:00
Ryan Poplin c96fbcb995 Use the indel heterozygosity prior when calling indels with the HC 2013-03-11 14:12:43 -04:00
Mark DePristo 7dce4f8630 Merge pull request #95 from broadinstitute/dr_parallel_tests_with_job_arrays
run_parallel_tests: add job array support
2013-03-11 10:57:39 -07:00
David Roazen df9821614c run_parallel_tests: add job array support
-With one bsub command per job, dispatch time could vary from 2 minutes to 2 hours (!)

-By dispatching all jobs at once using a job array, this potential bottleneck
 is removed
2013-03-11 13:36:55 -04:00
Eric Banks 508b58376c Merge pull request #93 from broadinstitute/gda_ancient_dna
Two features useful for ancient DNA processing. Ancient DNA sequencing d...
2013-03-10 17:57:28 -07:00
Guillermo del Angel 695723ba43 Two features useful for ancient DNA processing.
Ancient DNA sequencing data is in many ways different from modern data, and methods to analyze it need to be adapted accordingly.
Feature 1: Read adaptor trimming. Ancient DNA libraries typically have very short inserts (in the order of 50 bp), so typical Illumina libraries sequenced in, say, 100bp HiSeq will have a large adaptor component being read after the insert.
If this adaptor is not removed, data will not be aligneable. There are third party tools that remove adaptor and potentially merge read pairs, but are cumbersome to use and require precise knowledge of the library construction and adaptor sequence.
-- New walker ReadAdaptorTrimmer walks through paired end data, computes pair overlap and trims auto-detected adaptor sequence.
-- Unit tests added for trimming operation.
-- Utility walker (may be retired later) DetailedReadLengthDistribution computes insert size or read length distribution stratified by read group and mapping status and outputs a GATKReport with data.
-- Renamed MaxReadLengthFilter to ReadLengthFilter and added ability to specify minimum read length as a filter (may be useful if, as a consequence of adaptor trimming, we're left with a lot of very short reads which will map poorly and will just clutter output BAMs).

Feature 2: Unbiased site QUAL estimation: many times ancestral allele status is not known and VCF fields like QUAL, QD, GQ, etc. are affected by the pop. gen. prior at a site. This might introduce subtle biases in studies where a species is aligned against the reference of another species, so an option for UG and HC not to apply such prior is introduced.
-- Added -noPrior argument to StandardCallerArgumentCollection.
-- Added option not to fill priors is such argument is set.
-- Added an integration test.
2013-03-09 18:18:13 -05:00
droazen 21a6b4add2 Merge pull request #92 from broadinstitute/yf_allow_spaces_in_sampleID_in_contam_file
Changed loadContaminationFile file parser to delimit by tab only (not spaces)
2013-03-07 12:07:51 -08:00
Yossi Farjoun baad965a57 - Changed loadContaminationFile file parser to delimit by tab only. This allows spaces in sampleIDs, which apparently are allowed.
- This was needed since samples with spaces in their names are regularly found in the picard pipeline.
- Modified the tests to account for this (removed spaces from the good tests, and changed the failing tests accordingly)
- Cleaned up the unit tests using a @DataProvider (I'm in love...).
- Moved AlleleBiasedDownsamplingUtilsUnitTest to public to match location of class it is testing (due to the way bamboo operates)
2013-03-07 13:04:24 -05:00
Mark DePristo ecb2599cde Merge pull request #91 from broadinstitute/dr_fix_failing_parallel_tests
Fix tests that were consistently or intermittently failing when run in parallel on the farm
2013-03-06 11:47:36 -08:00
David Roazen 3ab78543a7 Fix tests that were consistently or intermittently failing when run in parallel on the farm
-Make MaxRuntimeIntegrationTest more lenient by assuming that startup overhead
 might be as long as 120 seconds on a very slow node, rather than the original
 assumption of 20 seconds

-In TraverseActiveRegionsUnitTest, write temp bam file to the temp directory, not
 to the current working directory

-SimpleTimerUnitTest: This test was internally inconsistent. It asserted that
 a particular operation should take no more than 10 milliseconds, and then asserted
 again that this same operation should take no more than 100 microseconds (= 0.1 millisecond).
 On a slow node it could take slightly longer than 100 microseconds, however.
 Changed the test to assert that the operation should require no more than 10000 microseconds
 (= 10 milliseconds)

-change global default test timeout from 20 to 40 minutes (things just take longer
 on the farm!)

-build.xml: allow runtestonly target to work with scala test classes
2013-03-06 13:56:54 -05:00
Mark DePristo 7d833256e8 Merge pull request #90 from broadinstitute/eb_allow_read_transform_ordering
Added the functionality to impose a relative ordering on ReadTransformer...
2013-03-06 09:52:26 -08:00
Eric Banks 3759d9dd67 Added the functionality to impose a relative ordering on ReadTransformers in the GATK engine.
* ReadTransformers can say they must be first, must be last, or don't care.
  * By default, none of the existing ones care about ordering except BQSR (must be first).
    * This addresses a bug reported on the forum where BAQ is incorrectly applied before BQSR.
  * The engine now orders the read transformers up front before applying iterators.
  * The engine checks for enabled RTs that are not compatible (e.g. both must be first) and blows up (gracefully).
  * Added unit tests.
2013-03-06 12:38:59 -05:00
Mark DePristo 446cd61f7e Merge pull request #84 from broadinstitute/eb_allelic_primitives
Added new walker to split MNPs into their allelic primitives (SNPs).
2013-03-06 09:02:21 -08:00
Mark DePristo dadc079dbc Merge pull request #89 from broadinstitute/mc_fix_output_annotation_GSA-820
Turning @Output required to false
2013-03-06 09:01:20 -08:00
Mark DePristo 64a9ccded6 Merge pull request #77 from broadinstitute/mc_postqc_tsca
One line change to the post calling QC pipeline
2013-03-06 07:13:10 -08:00
Eric Banks 78721ee09b Added new walker to split MNPs into their allelic primitives (SNPs).
* Can be extended to complex alleles at some point.
  * Currently only works for bi-allelics (documented).
  * Added unit and integration tests.
2013-03-05 23:16:42 -05:00
Mauricio Carneiro e2d41f0282 Turning @Output required to false
By default all output is assigned to stdout if a -o is not provided. Technically this makes @Output a not required parameter, and the documentation is misleading because it's reading from the annotation.
GSA-820 #resolve
2013-03-05 17:26:16 -05:00
delangel f10723df3b Merge pull request #85 from broadinstitute/md_simple_kb_report
AssessNA12878 now emits a simplified assessment table by default
2013-03-05 10:39:39 -08:00
Eric Banks 2be57fbcfb Merged bug fix from Stable into Unstable 2013-03-05 13:28:46 -05:00
Eric Banks 5e89f01e10 Don't allow the use of compressed (.gz) references in the GATK. 2013-03-05 13:28:19 -05:00
Mark DePristo 92ac9e7f65 AssessNA12878 now emits a simplified assessment table by default
-- New report collapses the detailed states in the 5 key states: TP, FP, FN, TN, unknown, such as in the following example:

Name     VariantType  AssessmentType           Count
variant  SNPS         TRUE_POSITIVE                6
variant  SNPS         FALSE_POSITIVE               9
variant  SNPS         FALSE_NEGATIVE            1213
variant  SNPS         TRUE_NEGATIVE              172
variant  SNPS         CALLED_NOT_IN_DB_AT_ALL      0
variant  INDELS       TRUE_POSITIVE               19
variant  INDELS       FALSE_POSITIVE              13
variant  INDELS       FALSE_NEGATIVE             262
variant  INDELS       TRUE_NEGATIVE               57
variant  INDELS       CALLED_NOT_IN_DB_AT_ALL     39

-- Use --detailed to see the previous full version
-- Expanded unittests for Assessment
2013-03-05 11:51:38 -05:00
Eric Banks b5a07da04c Merge pull request #88 from broadinstitute/eb_fix_pairHMM_from_stable
Revert push from stable
2013-03-05 06:07:50 -08:00
Eric Banks bbbaf9ad20 Revert push from stable (I forgot that pushing from stable overwrites current unstable changes) 2013-03-05 09:06:02 -05:00
Eric Banks a037423225 Merged bug fix from Stable into Unstable 2013-03-05 09:03:48 -05:00
Eric Banks 7e1bfd6a7c Included an accidental change from unstable into the previous push 2013-03-05 09:03:31 -05:00
Mauricio Carneiro 3e118a5b41 Adding interval list to Postcalling QC script
It used to accept only interval strings, but I needed to pass it interval files for custom targeted projects.
2013-03-05 08:17:19 -05:00
David Roazen 74a5cd5956 run_parallel_tests: archive working directories for completed runs
-deleting is too time-consuming and adds precious minutes to each run

-old working directories can be deleted later by a cron job

-delete working directory if global timeout has elapsed, however,
 since in that case we've already spent an excessive amount of time
 on the run
2013-03-05 05:49:25 -05:00
David Roazen 754226907e run_parallel_tests.sh: improved test class search and post-test cleanup
-search for compiled classes rather than source files to avoid picking
 up archived tests

-add function (currently disabled) to remove test working directory when
 run completes

-better log messages
2013-03-05 04:22:51 -05:00
Eric Banks bd4e4f4ee3 Merged bug fix from Stable into Unstable 2013-03-04 23:24:44 -05:00
Eric Banks b715218bfe Fix for mismatching indel quals erro: need to adjust for softclips just like we do for bases and normal quals. 2013-03-04 23:23:18 -05:00
Mark DePristo 1b7164ccdb Merge pull request #86 from broadinstitute/mc_fix_exception_messages
Just a quick cleanup on the exception messages no need to wait for bamboo.
2013-03-04 13:55:00 -08:00
Mauricio Carneiro d0c8105387 Cleaning up hilarious exception messages
Too many users (with RNASeq reads) are hitting these exceptions that were never supposed to happen. Let's give them (and us) a better and clearer error message.
2013-03-04 16:52:22 -05:00
Ryan Poplin ce7554e9d6 Merged bug fix from Stable into Unstable 2013-03-04 12:36:04 -05:00
Ryan Poplin 0697594778 Active regions that don't contain any usable reads should just be skipped over instead of throwing an IllegalStateException. 2013-03-04 12:35:40 -05:00
Ryan Poplin b3ecbb011d Merge pull request #81 from broadinstitute/md_hc_bam_writing
Expanded functionality of writing BAMs from the haplotype caller
2013-03-04 06:39:19 -08:00
Mark DePristo 42d3919ca4 Expanded functionality for writing BAMs from HaplotypeCaller
-- The new code includes a new mode to write out a BAM containing reads realigned to the called haplotypes from the HC, which can be easily visualized in IGV.
-- Previous functionality maintained, with bug fixes
-- Haplotype BAM writing code now lives in utils
-- Created a base class that includes most of the functionality of writing reads realigned to haplotypes onto haplotypes.
-- Created two subclasses, one that writes all haplotypes (previous functionality) and a CalledHaplotypeBAMWriter that will only write reads aligned to the actually called haplotypes
-- Extended PerReadAlleleLikelihoodMap.getMostLikelyAllele to optionally restrict set of alleles to consider best
-- Massive increase in unit tests in AlignmentUtils, along with several new powerful functions for manipulating cigars
-- Fix bug in SWPairwiseAlignment that produces cigar elements with 0 size, and are now fixed with consolidateCigar in AlignmentUtils
-- HaplotypeCaller now tracks the called haplotypes in the GenotypingEngine, and returns this information to the HC for use in visualization.
-- Added extensive docs to HaplotypeCaller on how to use this capability
-- BUGFIX -- don't modify the read bases in GATKSAMRecord in LikelihoodCalculationEngine in the HC
-- Cleaned up SWPairwiseAlignment.  Refactored out the big main and supplementary static methods.  Added a unit test with a bug TODO to fix what seems to be an edge case bug in SW
-- Integration test to make sure we can actually write a BAM for each mode.  This test only ensures that the code runs and doesn't exception out.  It doesn't actually enforce any MD5s
-- HaplotypeBAMWriter also left aligns indels in the reads, as SW can return a random placement of a read against the haplotype.  Calls leftAlign to make the alignments more clear, with unit test of real read to cover this case
-- Writes out haplotypes for both all haplotype and called haplotype mode
-- Haplotype writers now get the active region call, regardless of whether an actual call was made.  Only emitting called haplotypes is moved down to CalledHaplotypeBAMWriter
2013-03-03 12:07:29 -05:00
Mark DePristo ec3bf9f362 Adding 1mb of 2x250 bp PCR-free reads to private testdata 2013-03-01 20:44:17 -05:00
Mark DePristo b1ea2f6125 Merge pull request #83 from broadinstitute/dr_gatk_jar_with_private_GSA-803
Ant target to package a GATK jar with private included
2013-03-01 13:15:57 -08:00
David Roazen 2a1a20fc9d Parallel tests: switch working directory from /humgen/gsa-scr1 to /humgen/gsa-hpprojects
Hoping that the higher class of storage will get us down from the current
~40 minutes for a parallel run of the integration tests to the goal of
~20 minutes.
2013-03-01 16:11:29 -05:00
David Roazen a0be74c2ef Ant target to package a GATK jar with private included
Needed before we can start emitting full unstable jars from
Bamboo for our internal use.
2013-03-01 15:33:59 -05:00
David Roazen 3f7d888ea5 run_parallel_tests.sh: further improvements
-accept global timeout as a command-line argument

-kill outstanding jobs when timeout reached

-print job output files to stdout so that they get recorded in bamboo's logs

-periodically print number of jobs outstanding during run

-documentation / comments
2013-03-01 14:59:10 -05:00
Mark DePristo 0cff9b8027 Merge pull request #82 from broadinstitute/dr_split_long_integration_test_classes
Split long-running integration test classes into multiple classes
2013-03-01 11:07:23 -08:00
David Roazen c5c99c8339 Split long-running integration test classes into multiple classes
This is to facilitate the current experiment with class-level test
suite parallelism. It's our hope that with these changes, we can get
the runtime of the integration test suite down to 20 minutes or so.

-UnifiedGenotyper tests: these divided nicely into logical categories
 that also happened to distribute the runtime fairly evenly

-UnifiedGenotyperPloidy: these had to be divided arbitrarily into two
 classes in order to halve the runtime

-HaplotypeCaller: turns out that the tests for complex and symbolic
 variants make up half the runtime here, so merely moving these into
 a separate class was sufficient

-BiasedDownsampling: most of these tests use excessively large intervals
 that likely can't be reduced without defeating the goals of the tests. I'm
 disabling these tests for now until they can either be redesigned to use smaller
 intervals around the variants of interest, or refactored into unit tests
 (creating a JIRA for Yossi for this task)
2013-03-01 13:55:23 -05:00
depristo 6204e6ccc9 Merge pull request #76 from broadinstitute/md_kb_bugfix_GSA-795
Bug fixes and optimizations for NA12878 KB
2013-03-01 10:52:16 -08:00
depristo c05d1352b1 Merge pull request #80 from broadinstitute/eb_cleanup_genomelocsortedset_GSA-775
Fixed the add functionality of GenomeLocSortedSet.
2013-03-01 08:35:20 -08:00
Eric Banks ebd5404124 Fixed the add functionality of GenomeLocSortedSet.
* Fixed GenomeLocSortedSet.add() to ensure that overlapping intervals are detected and an exception is thrown.
 * Fixed GenomeLocSortedSet.addRegion() by merging it with the add() method; it now produces sorted inputs in all cases.
 * Cleaned up duplicated code throughout the engine to create a list of intervals over all contigs.
 * Added more unit tests for add functionality of GLSS.
 * Resolves GSA-775.
2013-02-28 23:31:00 -05:00
David Roazen 6a77eee5f4 parallel tests script: pass in bamboo build number to make globally unique working directories for each run 2013-02-28 18:06:18 -05:00