Commit Graph

1089 Commits (1cc0b48caab07426a3d54b34db3043ca96a28a4e)

Author SHA1 Message Date
Mark DePristo 7e4b9c9e6e Fix failing unit tests for VariantContextUtilsUnitTest
-- Previous version was adding multiple samples with the same name to the variant context
2012-11-27 14:26:23 -05:00
Joel Thibault 9bfe39411e Equal overlap should match right/later region 2012-11-27 13:03:13 -05:00
Joel Thibault d83ad906ef Add profile range contract 2012-11-27 13:03:13 -05:00
Joel Thibault cc550b4145 Add a read and interval on a different contig 2012-11-27 13:03:13 -05:00
Eric Banks 9531e58445 Merged bug fix from Stable into Unstable 2012-11-27 11:00:50 -05:00
Eric Banks 4543ece088 Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case. 2012-11-27 11:00:33 -05:00
Eric Banks a82ec7ad80 Merged bug fix from Stable into Unstable 2012-11-27 10:27:08 -05:00
Eric Banks e199562c25 I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever). 2012-11-27 10:26:17 -05:00
Eric Banks 405f3c675d Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc. 2012-11-27 01:07:00 -05:00
Eric Banks 4f7fa3009a I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA. 2012-11-26 11:34:59 -05:00
Mark DePristo 48f271c5bd Adding 80% support for multi-allelic variants
-- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here.  It's not perfect and needs some discussion on how to handle het/alt variants
-- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)
2012-11-21 17:24:59 -05:00
Joel Thibault c68bc95db6 Initial read mapping tests
- Failing tests are commented out
2012-11-21 17:16:46 -05:00
Joel Thibault 3ad9128800 Add some reads
- Move intervals and reads to init
- Update intervals and reads
2012-11-21 17:16:46 -05:00
Joel Thibault 3fa3b00f4a Add ActiveRegion tests and refactor 2012-11-21 17:16:45 -05:00
Joel Thibault e8defcb20d Test multiple bases and intervals 2012-11-21 17:16:45 -05:00
Joel Thibault c08b782743 Count isActive calls directly 2012-11-21 17:16:45 -05:00
Eric Banks 72e2d569c5 The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality. 2012-11-20 22:41:57 -05:00
Eric Banks ff87642a91 Enable cycle covariate unit tests 2012-11-20 22:29:56 -05:00
Eric Banks 937ac7290f Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results. 2012-11-20 16:13:29 -05:00
Joel Thibault b70fd4a242 Initial testing of the Active Region Traversal contract
- TODO: many more tests and test cases
2012-11-15 10:08:00 -05:00
Eric Banks e9183d9fe0 Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec. 2012-11-08 15:07:47 -05:00
David Roazen 6185e8c432 Allow large-scale tests 5 hours each to run 2012-11-01 17:48:58 -04:00
Mark DePristo 872abddfce Add custom TestNGTestTransformer that adds a maximum test runtime of 10 minutes to all testng tests
-- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread
-- Needed to debug locking issues
-- Needed to debug excessively long running integrationtests
-- Added build.xml maximum runtime for all testng tests of 10 hours.  We will ultimately fail the build if it goes on for more than 10 hours
2012-11-01 15:34:12 -04:00
Mark DePristo 1444cd753b Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered.  Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state.  This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
2012-11-01 15:34:04 -04:00
Mark DePristo 9cd04c335c Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data
-- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases.  This is now done by default, unless requested explicitly to preserve the original bases.
-- This is really the correct place to do this for a variety of reasons.  First, you don't need to work about upper casing bases throughout the code.  Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves.  Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase()
-- Added unit tests to ensure this functionality works correct.
-- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case.
-- Added contracts to ensure this is the case.
-- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code
2012-11-01 15:34:03 -04:00
Eric Banks 47a0f5859e Don't run these tests if not GAKT lite 2012-10-31 22:56:38 -04:00
Eric Banks f8af8a2355 Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite. 2012-10-31 21:28:07 -04:00
Eric Banks 2aa28abe0a Fixing md5s to reflect the new HapMap file 2012-10-30 14:27:10 -04:00
Eric Banks b6a1967f12 Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it. 2012-10-29 21:47:09 -04:00
Eric Banks 43625f652e Shoot, mixed up the md5s last time. 2012-10-27 19:43:46 -04:00
Eric Banks 682a72faf7 Hmm, thought I got all the md5s last time. Apparently not. 2012-10-26 16:10:12 -04:00
Mark DePristo 251983b8fb Add GATK-wide command line argument to control the maximum runtime allowed for the GATK
-- Providing this optional argument -maxRuntime (in -maxRuntimeUnits units) causes the GATK to exit gracefully when the max. runtime has been exceeded.  By cleanly I mean that the engine simply stops at the next available cycle in the walker as through the end of processing had been reached.  This means that all output files are closed properly, etc.
-- Emits an info message that looks like "INFO  10:36:52,723 MicroScheduler - Aborting execution (cleanly) because the runtime has exceeded the requested maximum 10.0000 s".  Otherwise there's currently no way to differentiate a truly completed run from a timelimit exceeded run, which may be a useful thing for a future update
-- Resolves GSA-630 / GATK max runtime to deal with bad LSA calling?
-- Added new JIRA entry for Ami to restart chr1 macarthur with this argument set to -maxRuntime 1 -maxRuntimeUnits DAYS to see if we can do all of chr1 in one weekend.
2012-10-26 13:18:34 -04:00
Eric Banks ed11b7dab2 Fix UG parallelization test 2012-10-26 12:10:44 -04:00
Eric Banks 7a706ed345 Fix some of the broken integration tests 2012-10-26 11:23:44 -04:00
Eric Banks ebebec7fdb Accidentally left one test disabled 2012-10-26 02:15:32 -04:00
Eric Banks a53e03d525 Do not let reduced reads get removed in the contamination down-sampling 2012-10-26 02:13:04 -04:00
Eric Banks bf3d61ce82 The default value for --contamination_fraction_to_filter is now 0.05 (5%) in both UG and HC. Users of GATK-lite get pushed down to 0% by default (since it's not enabled) or get a user error if they try to set it. 2012-10-26 01:04:51 -04:00
Eric Banks 91f2c847a3 Fixing problem reported on forum for VF: DP couldn't be filtered from the FORMAT field, only from the INFO field. Fixed and added integration test. 2012-10-26 00:57:40 -04:00
Eric Banks e93ff3ea6e Let's go back to having the SB/SLOD NOT computed by default. If you recall, it was only enabled by default because we thought we were going to use it when we made VQSR use random forests. But since we decided not to change VQSR, there's no reason to triple the computation for every variant site anymore. 2012-10-25 12:45:23 -04:00
Eric Banks c53c55da12 Re-enable tests 2012-10-25 09:37:08 -04:00
Eric Banks e6652f7777 Added integration test for contamination down-sampling 2012-10-25 09:36:05 -04:00
Mark DePristo 6e421a72d6 Add more exhaustive unit tests for input errors to NanoScheduler
-- Resolves issue GSA-515 / Nanoscheduler GSA-605 / Seems that -nct may deadlock as not reproducible
-- It seems that it's not an input error problem (or at least cannot be provoked with unit tests)
-- I'll keep an eye on this later
2012-10-23 20:11:29 -04:00
Mark DePristo f838815343 Updating MD5s for confidence ref site estimation in IndependentAllelesDiploidExactAFCalc
-- Included logic to only add priors for alleles with sufficient evidence to be called polymorphic.  If no alleles are poly make sure to add priors of first allele
2012-10-23 06:47:53 -04:00
Mark DePristo 15b28e61cd Retiring TraverseReads and TraverseLoci after testing confirms nano scheduler version in single threaded version is fine
-- There's been no report of problems with the nano scheduled version of TraverseLoci and TraverseReads, so I'm removing the old versions since they are no longer needed
-- Removing unnecessary intermediate base classes
-- GSA-515 / Nanoscheduler GSA-549 / https://jira.broadinstitute.org/browse/GSA-549
2012-10-22 16:55:06 -04:00
Mark DePristo 90f59803fd MaxAltAlleles now defaults to 6, no more MaxAltAllelesForIndels
-- Updated StandardCallerArgumentCollection to remove MaxAltAllelesForIndels. Previous argument is deprecated with meaningful doc message for people to use maxAltAlleles
-- All constructores, factory methods, and test builders and their users updated to provide just a single argument
-- Updating MD5s for integration tests that change due to genotyping more alleles
-- Adding more alleles to genotyping results in slight changes in the QUAL value for multi-allelic loci where one or more alleles aren't polymorphic.  That's simply due to the way that alternative hypotheses contribute as reference evidence against each true allele.  The effect can be large (new qual = old qual / 2 in one case here).
-- If we want more precision in our estimates we could decide (Eric, should we discuss?) to actually separately do a discovery phase in the genotyping, eliminate all variants not considered polymorphic, and then do a final round of calling to get the exact QUAL value for only those that are segregating.  This would have the value of having the QUAL stay constant as more alleles are genotyped, at the cost of some code complexity increase and runtime.  Might be worth it through
2012-10-22 13:47:56 -04:00
Khalid Shakir 97dc3664c9 Fixed yet another NPE related to the ArgumentTypeDescriptor vs. ArgumentMatchValue. Added integration test based on GSA-621. 2012-10-22 12:05:32 -04:00
Mark DePristo eb6c9a1a79 Disable EfficiencyMonitoringThreadFactoryUnitTest
-- This is no longer a core GATK activity, and the tests need to run for so long (2 min each) that it's just too painful to run them.  Should be re-eabled if we come to care about this capability again, or if we can run these tests all in parallel in the future.
2012-10-21 12:43:46 -04:00
Mark DePristo d21e42608a Updating integration tests for minor changes due to switching to EXACT_INDEPENDENT model by default 2012-10-21 12:43:46 -04:00
Mark DePristo 6b6caf8e3a Bugfix for indel DP calculations using reduced reads
-- Adding tests for SNP and indel calling on reduced BAM
2012-10-21 12:42:32 -04:00
Ryan Poplin a647f1e076 Refactoring the PairHMM util class to allow for multiple implementations which can be specified by the callers via an enum argument. Adding an optimized PairHMM implementation which caches per-read calculations as well as a logless implementation which drastically reduces the runtime of the HMM while also increasing the precision of the result. In the HaplotypeCaller we now lexicographically sort the haplotypes to take maximal benefit of the haplotype offset optimization which only recalculates the HMM matrices after the first differing base in the haplotype. Many thanks to Mauricio for all the initial groundwork for these optimizations. The change to the one HC integration test is in the fourth decimal of HaplotypeScore. 2012-10-20 16:38:18 -04:00