Commit Graph

12394 Commits (a7cb599945889d9011e5d71d9ffcee94c4bba5ee)

Author SHA1 Message Date
David Roazen a7cb599945 Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage
-Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200,
 since low dcov values can result in problematic downsampling artifacts for locus-based
 traversals.

-Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals
 controls the number of reads per alignment start position, and even a dcov value of 1 might
 be safe/desirable in some circumstances.

-Also reorganize the global downsampling defaults so that they are specified as annotations
 to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the
 DownsamplingMethod class.

-The default downsampling settings have not been changed: they are still -dcov 1000
 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.
2013-05-29 12:07:12 -04:00
Ryan Poplin 3516daff52 Merge pull request #240 from broadinstitute/rp_gga_implies_both_models
Bugfix for GGA mode in UG silently ignoring indels
2013-05-24 11:13:57 -07:00
Ryan Poplin 85905dba92 Bugfix for GGA mode in UG silently ignoring indels
-- Started by Mark. Finished up by Ryan.
-- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls.
-- https://www.pivotaltracker.com/story/show/48924339 for more information
-- [resolves #48924339]
2013-05-24 13:47:26 -04:00
Mark DePristo b7fab31e25 Merge pull request #239 from broadinstitute/mc_quick_dt_output_fix
Make the missing targets output never use stdout
2013-05-23 07:00:57 -07:00
Mauricio Carneiro da21924b44 Make the missing targets output never use stdout
Problem
--------
Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided

Solution
--------
Make it NOT default to stdout

[Delivers #50386741]
2013-05-22 14:22:54 -04:00
MauricioCarneiro 32c33901d7 Merge pull request #237 from broadinstitute/md_banded_pairhmm
Archived banded logless PairHMM
2013-05-22 10:25:18 -07:00
Mark DePristo d167743852 Archived banded logless PairHMM
BandedHMM
---------
-- An implementation of a linear runtime, linear memory usage banded logless PairHMM.  Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available.  The implementation is being archived for future reference

Useful infrastructure changes
-----------------------------
-- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM
-- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing.  Added example of 100kb of data to private/testdata.  Can be easily read in with the PairHMMTestData class.
-- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.
2013-05-22 12:24:00 -04:00
delangel 925232b0fc Merge pull request #236 from broadinstitute/md_simple_hc_performance_improvements
3 simple performance improvements for HaplotypeCaller
2013-05-22 07:58:28 -07:00
Mark DePristo adec22e748 Merge pull request #238 from broadinstitute/eb_optimize_filter_counting
Optimized counting of filtered records by filter.
2013-05-22 04:57:26 -07:00
Eric Banks 881b2b50ab Optimized counting of filtered records by filter.
Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment).
Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.
2013-05-21 21:54:49 -04:00
Mark DePristo 010034a650 Optimization/bugfix for PerReadAlleleLikelihoodMap
-- Add() call had a misplaced map.put call, so that we were always putting the result of get() back into the map, when what we really intended was to only put the value back in if the original get() resulted in a null and so initialized the result
2013-05-21 16:18:57 -04:00
Mark DePristo a1093ad230 Optimization for ActiveRegion.removeAll
-- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion.  This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element.   New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set.  Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one.
-- Update HaplotypeCaller filterReadsInRegion to use a Set not a List.
-- Expanded the unit tests a bit for ActiveRegion.removeAll
2013-05-21 16:18:57 -04:00
Mark DePristo d9cdc5d006 Optimization: track alleles in the PerReadAlleleLikelihoodMap with a HashSet
-- The previous version of PerReadAlleleLikelihoodMap only stored the alleles in an ArrayList, and used ArrayList.contains() to determine if an allele was already present in the map.  This is very slow with many alleles.  Now keeps both the ArrayList (for get() performance) and a Set of alleles for contains().
2013-05-21 16:18:56 -04:00
Mark DePristo 3cfe2dcc64 Merge pull request #235 from broadinstitute/eb_several_traversal_printout_fixes
Eb several traversal printout fixes
2013-05-21 13:13:11 -07:00
Eric Banks 20c7a89030 Fixes to get accurate read counts for Read traversals
1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts)
 2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object!
 3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions.

Also:
 * Make sure users realize that the read counts are approximate in the print outs.
 * Removed a lot of unused cruft from the metrics object while I was in there.
 * Added test to make sure that the ReadMetrics read count does not overflow ints.
 * Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.
2013-05-21 15:24:07 -04:00
Eric Banks 58f4b81222 Count Reads should use a Long instead of an Integer for counts to prevent overflows. Added unit test. 2013-05-21 15:23:51 -04:00
Eric Banks 1f3624d204 Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing 2013-05-21 11:35:05 -04:00
Mark DePristo 7252238271 Merge pull request #234 from broadinstitute/md_update_gatkperformance_over_time
Use GATK 2.3 and 2.5 in GATKPerformanceOverTime
2013-05-21 06:14:07 -07:00
Mark DePristo e3d6443d3e Use GATK 2.3 and 2.5 in GATKPerformanceOverTime 2013-05-21 09:13:16 -04:00
Valentin Ruano Rubio 71bbb25c9e Merge pull request #231 from broadinstitute/md_combinevariants_bugfix
CombineVariants no longer adds PASS to unfiltered records
2013-05-20 14:28:20 -07:00
Mark DePristo 62fc88f92e CombineVariants no longer adds PASS to unfiltered records
-- [Delivers #49876703]
-- Add integration test and test file
-- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!
2013-05-20 16:53:51 -04:00
Mark DePristo 2d20e38149 Merge pull request #232 from broadinstitute/rp_hc_gga_mode_active_region_extension
Active region boundary parameters need to be bigger when running in GGA ...
2013-05-20 12:22:08 -07:00
Ryan Poplin 507853c583 Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result.
-- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.
2013-05-20 14:29:04 -04:00
Mark DePristo b239cb76d4 Merge pull request #230 from broadinstitute/mc_gsalib_toR3
Updating gsalib for R-3.0 compatibility
2013-05-19 04:32:58 -07:00
Mauricio Carneiro c8b1c47764 Updating gsalib for R-3.0 compatibility
* add package namespace that exports all the visible objects
   * list gsalib dependencies in the package requirements

[fixes #49987933]
2013-05-18 12:43:38 -04:00
Eric Banks 665e45f0fc Merge pull request #229 from broadinstitute/eb_liftover_variants_output_required
@Output needs to be required for LiftoverVariants to prevent a NPE and d...
2013-05-17 07:49:55 -07:00
Eric Banks 8a442d3c9f @Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating. 2013-05-17 10:04:10 -04:00
MauricioCarneiro 92072e1815 Merge pull request #224 from broadinstitute/yf_emit_insert_length_with_pileup
added a @hidden option to PileupWalker that causes it to emit insert sizes
2013-05-16 11:30:19 -07:00
Yossi Farjoun 3e2a0b15ed - Added a @Hidden option ( -outputInsertLength ) to PileupWalker that causes it to emit insert sizes together with the pileup (to assist Mark Daly's investigation of the contamination dependance on insert length)
- Converted my old GATKBAMIndexText (within PileupWalkerIntegrationTest) to use a dataProvider
- Added two integration tests to test -outputInsertLength option
2013-05-16 12:47:16 -04:00
Yossi Farjoun 9234a0efcd Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs
Bug fixes and missing interval functionality for Diagnose Targets

While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule).

It is with these cavets that I accept the pull request.
2013-05-16 09:25:54 -07:00
droazen a733a5e9b7 Merge pull request #228 from broadinstitute/mc_fine_grained_maxruntime
Subshard timeouts in the GATK
2013-05-15 05:25:43 -07:00
Mark DePristo 371f3752c1 Subshard timeouts in the GATK
-- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards.
-- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard.  This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider.  So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes.
-- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter.  This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense.  Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout.
-- Added integration tests to ensure that subshard timeouts are working properly
2013-05-15 07:00:39 -04:00
chartl 6137e5f89b Merge pull request #227 from broadinstitute/chartl_genotypeconcordance_integration_md5_fix
Update GCIT md5s to account for trivial changes to description strings
2013-05-14 17:26:24 -07:00
Mark DePristo 43e78286a0 Merge pull request #226 from broadinstitute/hc_ceu_trio_calling
Trivial update to ceutrio.ped file to make it really the CEU trio samples
2013-05-14 17:02:52 -07:00
Chris Hartl 6da0aed30f Update GCIT md5s to account for trivial changes to description strings 2013-05-14 19:45:30 -04:00
Yossi Farjoun 409a202492 Merge pull request #214 from broadinstitute/chartl_genotype_concordance_diploid_and_OGC
Add overall genotype concordance to the genotype concordance tool. In ad...
2013-05-14 14:19:54 -07:00
Mark DePristo 7d78a77f17 Trivial update to ceutrio.ped file to make it really the CEU trio sample names 2013-05-14 17:08:13 -04:00
Mark DePristo e9427d817f Merge pull request #225 from broadinstitute/eb_change_consensus_logic_in_kb
Fixed the logic of FP and DISCORDANT consensus.
2013-05-14 11:49:43 -07:00
Eric Banks b951ab1337 Fixed the logic of FP and DISCORDANT consensus.
* added a good unit test for this case
  * also, added the Reviewed status to the INFO field of the MVC record
2013-05-14 13:13:30 -04:00
Mark DePristo 349521e4c2 Merge pull request #220 from broadinstitute/md_nano_hc
NanoScheduling for the HaplotypeCaller!
2013-05-13 09:07:13 -07:00
Mauricio Carneiro adcbf947bf Update MD5s and the Diagnose Target scala script 2013-05-13 12:06:17 -04:00
Mauricio Carneiro 9eceae793a Tool to manipulate intervals outside the GATK
Performs basic set operations on intervals like union, intersect and difference between two or more intervals. Useful for techdev and QC purposes.
2013-05-13 11:56:24 -04:00
Mauricio Carneiro 3dbb86b052 Outputting missing intervals in DiagnoseTargets
Problem
------
Diagnose Targets identifies holes in the coverage of a targetted experiment, but it only reports them doesn't list the actual missing loci

Solution
------
This commit implements an optional intervals file output listing the exact loci that did not pass filters

Itemized changes
--------------
   * Cache callable statuses (to avoid recalculation)
   * Add functionality to output missing intervals
   * Implement new tool to qualify the missing intervals (QualifyMissingIntervals) by gc content, size, type of missing coverage and origin (coding sequence, intron, ...)
2013-05-13 11:51:56 -04:00
Mauricio Carneiro 1466396a31 Diagnose target is outputting intervals out of order
Problem
-------
When the interval had no reads, it was being sent to the VCF before the intervals that just got processed, therefore violating the sort order of the VCF.

Solution
--------
Use a linked hash map, and make the insertion and removal all happen in one place regardless of having reads or not. Since the input is ordered, the output has to be ordered as well.

Itemized changes
--------------
   * Clean up code duplication in LocusStratification and SampleStratification
   * Add number of uncovered sites and number of low covered sites to the VCF output.
   * Add new VCF format fields
   * Fix outputting multiple status when threshold is 0 (ratio must be GREATER THAN not equal to the threshold to get reported)

[fixes #48780333]
[fixes #48787311]
2013-05-13 11:50:22 -04:00
Mark DePristo 39e4396de0 New ActiveRegionShardBalancer allows efficient NanoScheduling
-- Previously we used the LocusShardBalancer for the haplotype caller, which meant that TraverseActiveRegions saw its shards grouped in chunks of 16kb bits on the genome.  These locus shards are useful when you want to use the HierarchicalMicroScheduler, as they provide fine-grained accessed to the underlying BAM, but they have two major drawbacks (1) we have to fairly frequently reset our state in TAR to handle moving between shard boundaries and (2) with the nano scheduled TAR we end up blocking at the end of each shard while our threads all finish processing.
-- This commit changes the system over to using an ActiveRegionShardBalancers, that combines all of the shard data for a single contig into a single combined shard.  This ensures that TAR, and by extensions the HaplotypeCaller, gets all of the data on a single contig together so the the NanoSchedule runs efficiently instead of blocking over and over at shard boundaries.  This simple change allows us to scale efficiently to around 8 threads in the nano scheduler:
  -- See https://www.dropbox.com/s/k7f280pd2zt0lyh/hc_nano_linear_scale.pdf
  -- See https://www.dropbox.com/s/fflpnan802m2906/hc_nano_log_scale.pdf
-- Misc. changes throughout the codebase so we Use the ActiveRegionShardBalancer where appropriate.
-- Added unit tests for ActiveRegionShardBalancer to confirm it does the merging as expected.
-- Fix bad toString in FilePointer
2013-05-13 11:09:02 -04:00
Mark DePristo b4f482a421 NanoScheduled ActiveRegionTraversal and HaplotypeCaller
-- Made CountReadsInActiveRegions Nano schedulable, confirming identical results for linear and nano results
-- Made Haplotype NanoScheduled, requiring misc. changes in the map/reduce type so that the map() function returns a List<VariantContext> and reduce actually prints out the results to disk
-- Tests for NanoScheduling
  -- CountReadsInActiveRegionsIntegrationTest now does NCT 1, 2, 4 with CountReadsInActiveRegions
  -- HaplotypeCallerParallelIntegrationTest does NCT 1,2,4 calling on 100kb of PCR free data
-- Some misc. code cleanup of HaplotypeCaller
-- Analysis scripts to assess performance of nano scheduled HC
-- In order to make the haplotype caller thread safe we needed to use an AtomicInteger for the class-specific static ID counter in SeqVertex and MultiDebrujinVertex, avoiding a race condition where multiple new Vertex() could end up with the same id.
2013-05-13 11:09:02 -04:00
Mark DePristo 82e6a77385 Merge pull request #221 from broadinstitute/eb_greedy_SW_v2_GSA-936
New faster Smith-Waterman implementation that is edge greedy and assumes...
2013-05-13 07:50:50 -07:00
Eric Banks 2f5ef6db44 New faster Smith-Waterman implementation that is edge greedy and assumes that ref and haplotype have same global start/end points.
* This version inherits from the original SW implementation so it can use the same matrix creation method.
   * A bunch of refactoring was done to the original version to clean it up a bit and to have it do the
     right thing for indels at the edges of the alignments.
     * Enum added for the overhang strategy to use; added implementation for the INDEL version of this strategy.
   * Lots of systematic testing added for this implementation.
   * NOT HOOKED UP TO HAPLOTYPE CALLER YET. Committing so that people can play around with this for now.
2013-05-13 09:36:39 -04:00
Mark DePristo cb49b8cc71 Merge pull request #222 from broadinstitute/dr_convenient_diff_engine_md5_display_in_bamboo_GSA-915
Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
2013-05-11 10:58:42 -07:00
David Roazen 639030bd6d Enable convenient display of diff engine output in Bamboo, plus misc. minor test-related improvements
-Diff engine output is now included in the actual exception message thrown as a
 result of an MD5 mismatch, which allows it to be conveniently viewed on the
 main page of a build in Bamboo.

Minor Additional Improvements:

-WalkerTestSpec now auto-detects test class name via new JVMUtils.getCallingClass()
 method, and the test class name is now included as a regular part of integration
 test output for each test.

-Fix race condition in MD5DB.ensureMd5DbDirectory()

-integrationtests dir is now cleaned by "ant clean"

GSA-915 #resolve
2013-05-10 19:00:33 -04:00