gatk-3.8

Commit Graph

Author	SHA1	Message	Date
Eric Banks	a96f48bc39	Merge pull request #249 from broadinstitute/rp_hc_gga_mode New implementation of the GGA mode in the HaplotypeCaller	2013-05-31 10:54:50 -07:00
droazen	a665d759cd	Merge pull request #251 from broadinstitute/md_mapq_reassign Command-line read filters are now applied before Walker default filters	2013-05-31 09:05:24 -07:00
David Roazen	ed4f19d79b	Restore scala compilation by default in build.xml -This was accidentally clobbered in a recent commit. -If you want to compile Java-only, easiest thing to do is run "ant gatk" rather than modifying build.xml	2013-05-31 11:28:29 -04:00
Ryan Poplin	b5b9d745a7	New implementation of the GGA mode in the HaplotypeCaller -- We now inject the given alleles into the reference haplotype and add them to the graph. -- Those paths are read off of the graph and then evaluated with the appropriate marginalization for GGA mode. -- This unifies how Smith-Waterman is performed between discovery and GGA modes. -- Misc minor cleanup in several places.	2013-05-31 10:35:36 -04:00
Eric Banks	50b4c130ca	Merge pull request #250 from broadinstitute/chartl_mathutils_checks_for_valid_inputs-GSA-767 MathUtils checks for argument values GSA-767	2013-05-30 21:58:33 -07:00
Chris Hartl	199476eae1	Three squashed commits: 1) Add in checks for input parameters in MathUtils method. I was careful to use the bottom-level methods whenever possible, so that parameters don't needlessly go through multiple checks (so for instance, the parameters n and k for a binomial aren't checked on log10binomial, but rather in the log10binomialcoefficient subroutine). This addresses JIRA GSA-767 Unit tests pass (we'll let bamboo deal with the integrations) 2) Address reviewer comments (change UserExceptions to IllegalArgumentExceptions). 3) .isWellFormedDouble() tests for infinity and not strictly positive infinity. Allow negative-infinity values for log10sumlog10 (as these just correspond to p=0). After these commits, unit and integration tests now pass, and GSA-767 is done. rebase and fix conflict: public/java/src/org/broadinstitute/sting/utils/MathUtils.java	2013-05-31 00:26:50 -04:00
Mark DePristo	b16de45ce4	Command-line read filters are now applied before Walker default filters -- This allows us to use -rf ReassignMappingQuality to reassign mapping qualities to 60 before the BQSR filters them out with MappingQualityUnassignedFilter. -- delivers #50222251	2013-05-30 16:54:18 -04:00
Mark DePristo	ac90e6765e	Merge pull request #248 from broadinstitute/rp_chartl_mathutils_checks_for_valid_inputs-GSA-767 Create a new normalDistributionLog10 function that is unit tested for us...	2013-05-30 13:39:33 -07:00
Ryan Poplin	61af37d0d2	Create a new normalDistributionLog10 function that is unit tested for use in the VQSR.	2013-05-30 16:00:08 -04:00
Mark DePristo	56b14be4bc	Merge pull request #247 from broadinstitute/eb_fix_RR_negative_header_problem Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads.	2013-05-29 18:10:19 -07:00
Mark DePristo	50a12df68c	Merge pull request #246 from broadinstitute/dr_fix_read_shard_balancer_log_output Fix confusing extraneous log output from the ReadShardBalancer at traversal end	2013-05-29 13:50:26 -07:00
Eric Banks	a5a68c09fa	Fix for the "Removed too many insertions, header is now negative" bug in ReduceReads. The problem ultimately was that ReadUtils.readStartsWithInsertion() ignores leading hard/softclips, but ReduceReads does not. So I refactored that method to include a boolean argument as to whether or not clips should be ignored. Also rebased so that return type is no longer a Pair. Added unit test to cover this situation.	2013-05-29 16:41:01 -04:00
David Roazen	eb206e9f71	Fix confusing log output from the engine -ReadShardBalancer was printing out an extra "Loading BAM index data for next contig" message at traversal end, which was confusing users and making the GATK look stupid. Suppress the extraneous message, and reword the log messages to be less confusing. -Improve log message output when initializing the shard iterator in GenomeAnalysisEngine. Don't mention BAMs when the are none, and say "Preparing for traversal" rather than mentioning the meaningless-for-users concept of "shard strategy" -These log messages are needed because the operations they surround might take a while under some circumstances, and the user should know that the GATK is actively doing something rather than being hung.	2013-05-29 16:17:04 -04:00
Mark DePristo	684c91c2e7	Merge pull request #245 from broadinstitute/dr_enforce_min_dcov Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage	2013-05-29 09:52:13 -07:00
Mark DePristo	60c3c83b94	Merge pull request #244 from broadinstitute/mc_add_missing_example_vcf_idx Somehow the index of exampleDBSNP.vcf was missing	2013-05-29 09:48:24 -07:00
Mark DePristo	677e6bd6cf	Merge pull request #243 from broadinstitute/mc_turn_off_downsampling_in_diagnose_targets Turn off downsampling for DiagnoseTargets	2013-05-29 09:48:08 -07:00
David Roazen	a7cb599945	Require a minimum dcov value of 200 for Locus and ActiveRegion walkers when downsampling to coverage -Throw a UserException if a Locus or ActiveRegion walker is run with -dcov < 200, since low dcov values can result in problematic downsampling artifacts for locus-based traversals. -Read-based traversals continue to have no minimum for -dcov, since dcov for read traversals controls the number of reads per alignment start position, and even a dcov value of 1 might be safe/desirable in some circumstances. -Also reorganize the global downsampling defaults so that they are specified as annotations to the Walker, LocusWalker, and ActiveRegionWalker classes rather than as constants in the DownsamplingMethod class. -The default downsampling settings have not been changed: they are still -dcov 1000 for Locus and ActiveRegion walkers, and -dt NONE for all other walkers.	2013-05-29 12:07:12 -04:00
Mauricio Carneiro	38e765f00d	Somehow the index of exampleDBSNP.vcf was missing This was missed when we added all the indices of our testdata	2013-05-28 15:29:43 -04:00
Mauricio Carneiro	f1affa9fbb	Turn off downsampling for DiagnoseTargets Diagnose targets should never be downsampled. (and I didn't know there was a default downsampling going on for locus walkers)	2013-05-28 14:58:50 -04:00
Ryan Poplin	3516daff52	Merge pull request #240 from broadinstitute/rp_gga_implies_both_models Bugfix for GGA mode in UG silently ignoring indels	2013-05-24 11:13:57 -07:00
Ryan Poplin	85905dba92	Bugfix for GGA mode in UG silently ignoring indels -- Started by Mark. Finished up by Ryan. -- GGA mode still respected glm argument for SNP and INDEL models, so that you would silently fail to genotype indels at all if the -glm INDEL wasn't provided, but you'd still emit the sites, so you'd see records in the VCF but all alleles would be no calls. -- https://www.pivotaltracker.com/story/show/48924339 for more information -- [resolves #48924339]	2013-05-24 13:47:26 -04:00
Mark DePristo	b7fab31e25	Merge pull request #239 from broadinstitute/mc_quick_dt_output_fix Make the missing targets output never use stdout	2013-05-23 07:00:57 -07:00
Mauricio Carneiro	da21924b44	Make the missing targets output never use stdout Problem -------- Diagnose Targets is outputting missing intervals to stdout if the argument -missing is not provided Solution -------- Make it NOT default to stdout [Delivers #50386741]	2013-05-22 14:22:54 -04:00
MauricioCarneiro	32c33901d7	Merge pull request #237 from broadinstitute/md_banded_pairhmm Archived banded logless PairHMM	2013-05-22 10:25:18 -07:00
Mark DePristo	d167743852	Archived banded logless PairHMM BandedHMM --------- -- An implementation of a linear runtime, linear memory usage banded logless PairHMM. Thought about 50% faster than current PairHMM, this implementation will be superceded by the GraphHMM when it becomes available. The implementation is being archived for future reference Useful infrastructure changes ----------------------------- -- Split PairHMM into a N2MemoryPairHMM that allows smarter implementation to not allocate the double[][] matrices if they don't want, which was previously occurring in the base class PairHMM -- Added functionality (controlled by private static boolean) to write out likelihood call information to a file from inside of LikelihoodCalculationEngine for using in unit or performance testing. Added example of 100kb of data to private/testdata. Can be easily read in with the PairHMMTestData class. -- PairHMM now tracks the number of possible cell evaluations, and the LoglessCachingPairHMM updates the nCellsEvaluated so we can see how many cells are saved by the caching calculation.	2013-05-22 12:24:00 -04:00
delangel	925232b0fc	Merge pull request #236 from broadinstitute/md_simple_hc_performance_improvements 3 simple performance improvements for HaplotypeCaller	2013-05-22 07:58:28 -07:00
Mark DePristo	adec22e748	Merge pull request #238 from broadinstitute/eb_optimize_filter_counting Optimized counting of filtered records by filter.	2013-05-22 04:57:26 -07:00
Eric Banks	881b2b50ab	Optimized counting of filtered records by filter. Don't map class to counts in the ReadMetrics (necessitating 2 HashMap lookups for every increment). Instead, wrap the ReadFilters with a counting version and then set those counts only when updating global metrics.	2013-05-21 21:54:49 -04:00
Mark DePristo	010034a650	Optimization/bugfix for PerReadAlleleLikelihoodMap -- Add() call had a misplaced map.put call, so that we were always putting the result of get() back into the map, when what we really intended was to only put the value back in if the original get() resulted in a null and so initialized the result	2013-05-21 16:18:57 -04:00
Mark DePristo	a1093ad230	Optimization for ActiveRegion.removeAll -- Previous version took a Collection<GATKSAMRecord> to remove, and called ArrayList.removeAll() on this collection to remove reads from the ActiveRegion. This can be very slow when there are lots of reads, as ArrayList.removeAll ultimately calls indexOf() that searches through the list calling equals() on each element. New version takes a set, and uses an iterator on the list to remove() from the iterator any read that is in the set. Given that we were already iterating over the list of reads to update the read span, this algorithm is actually simpler and faster than the previous one. -- Update HaplotypeCaller filterReadsInRegion to use a Set not a List. -- Expanded the unit tests a bit for ActiveRegion.removeAll	2013-05-21 16:18:57 -04:00
Mark DePristo	d9cdc5d006	Optimization: track alleles in the PerReadAlleleLikelihoodMap with a HashSet -- The previous version of PerReadAlleleLikelihoodMap only stored the alleles in an ArrayList, and used ArrayList.contains() to determine if an allele was already present in the map. This is very slow with many alleles. Now keeps both the ArrayList (for get() performance) and a Set of alleles for contains().	2013-05-21 16:18:56 -04:00
Mark DePristo	3cfe2dcc64	Merge pull request #235 from broadinstitute/eb_several_traversal_printout_fixes Eb several traversal printout fixes	2013-05-21 13:13:11 -07:00
Eric Banks	20c7a89030	Fixes to get accurate read counts for Read traversals 1. Don't clone the dataSource's metrics object (because then the engine won't continue to get updated counts) 2. Use the dataSource's metrics object in the CountingFilteringIterator and not the first shard's object! 3. Synchronize ReadMetrics.incrementMetrics to prevent race conditions. Also: * Make sure users realize that the read counts are approximate in the print outs. * Removed a lot of unused cruft from the metrics object while I was in there. * Added test to make sure that the ReadMetrics read count does not overflow ints. * Added unit tests for traversal metrics (reads, loci, and active region traversals); these test counts of reads and records.	2013-05-21 15:24:07 -04:00
Eric Banks	58f4b81222	Count Reads should use a Long instead of an Integer for counts to prevent overflows. Added unit test.	2013-05-21 15:23:51 -04:00
Eric Banks	1f3624d204	Base Recalibrator doesn't recalibrate all reads, so the final output line was confusing	2013-05-21 11:35:05 -04:00
Mark DePristo	7252238271	Merge pull request #234 from broadinstitute/md_update_gatkperformance_over_time Use GATK 2.3 and 2.5 in GATKPerformanceOverTime	2013-05-21 06:14:07 -07:00
Mark DePristo	e3d6443d3e	Use GATK 2.3 and 2.5 in GATKPerformanceOverTime	2013-05-21 09:13:16 -04:00
Valentin Ruano Rubio	71bbb25c9e	Merge pull request #231 from broadinstitute/md_combinevariants_bugfix CombineVariants no longer adds PASS to unfiltered records	2013-05-20 14:28:20 -07:00
Mark DePristo	62fc88f92e	CombineVariants no longer adds PASS to unfiltered records -- [Delivers #49876703] -- Add integration test and test file -- Update SymbolicAlleles combine variant tests, which was turning unfiltered records into PASS!	2013-05-20 16:53:51 -04:00
Mark DePristo	2d20e38149	Merge pull request #232 from broadinstitute/rp_hc_gga_mode_active_region_extension Active region boundary parameters need to be bigger when running in GGA ...	2013-05-20 12:22:08 -07:00
Ryan Poplin	507853c583	Active region boundary parameters need to be bigger when running in GGA mode. CGL performance is quite a bit better as a result. -- The troule stems from the fact that we may be trying to genotype indels even though it appears there are only SNPs in the reads.	2013-05-20 14:29:04 -04:00
Mark DePristo	b239cb76d4	Merge pull request #230 from broadinstitute/mc_gsalib_toR3 Updating gsalib for R-3.0 compatibility	2013-05-19 04:32:58 -07:00
Mauricio Carneiro	c8b1c47764	Updating gsalib for R-3.0 compatibility * add package namespace that exports all the visible objects * list gsalib dependencies in the package requirements [fixes #49987933]	2013-05-18 12:43:38 -04:00
Eric Banks	665e45f0fc	Merge pull request #229 from broadinstitute/eb_liftover_variants_output_required @Output needs to be required for LiftoverVariants to prevent a NPE and d...	2013-05-17 07:49:55 -07:00
Eric Banks	8a442d3c9f	@Output needs to be required for LiftoverVariants to prevent a NPE and documentation needed updating.	2013-05-17 10:04:10 -04:00
MauricioCarneiro	92072e1815	Merge pull request #224 from broadinstitute/yf_emit_insert_length_with_pileup added a @hidden option to PileupWalker that causes it to emit insert sizes	2013-05-16 11:30:19 -07:00
Yossi Farjoun	3e2a0b15ed	- Added a @Hidden option ( -outputInsertLength ) to PileupWalker that causes it to emit insert sizes together with the pileup (to assist Mark Daly's investigation of the contamination dependance on insert length) - Converted my old GATKBAMIndexText (within PileupWalkerIntegrationTest) to use a dataProvider - Added two integration tests to test -outputInsertLength option	2013-05-16 12:47:16 -04:00
Yossi Farjoun	9234a0efcd	Merge pull request #223 from broadinstitute/mc_dt_gaddy_outputs Bug fixes and missing interval functionality for Diagnose Targets While the code seems fine, the complex parts of it are untested. This is probably fine for now, but private code can have a tendency to creep into the codebase once accepted. I would have preferred that unit test OR a big comment stating that the code is untested (and thus broken by Mark's rule). It is with these cavets that I accept the pull request.	2013-05-16 09:25:54 -07:00
droazen	a733a5e9b7	Merge pull request #228 from broadinstitute/mc_fine_grained_maxruntime Subshard timeouts in the GATK	2013-05-15 05:25:43 -07:00
Mark DePristo	371f3752c1	Subshard timeouts in the GATK -- The previous implementation of the maxRuntime would require us to wait until all of the work was completed within a shard, which can be a substantial amount of work in the case of a locus walker with 16kb shards. -- This implementation ensures that we exit from the traversal very soon after the max runtime is exceeded, without completely all of our work within the shard. This is done by updating all of the traversal engines to return false for hasNext() in the nano scheduled input provider. So as soon as the timeout is exceeeded, we stop generating additional data to process, and we only have to wait until the currently executing data processing unit (locus, read, active region) completes. -- In order to implement this timeout efficiently at this fine scale, the progress meter now lives in the genome analysis engine, and the exceedsTimeout() call in the engine looks at a periodically updated runtime variable in the meter. This variable contains the elapsed runtime of the engine, but is updated by the progress meter daemon thread so that the engine doesn't call System.nanotime() in each cycle of the engine, which would be very expense. Instead we basically wait for the daemon to update this variable, and so our precision of timing out is limited by the update frequency of the daemon, which is on the order of every few hundred milliseconds, totally fine for a timeout. -- Added integration tests to ensure that subshard timeouts are working properly	2013-05-15 07:00:39 -04:00

1 2 3 4 5 ...

12412 Commits (a96f48bc39717e1b3ec7549f6078cbb1c7cf6534) All Branches Search

12412 Commits (a96f48bc39717e1b3ec7549f6078cbb1c7cf6534)

All Branches