Commit Graph

1110 Commits (07b369ca7ec734a0dcbaf60d9412c9e60729075c)

Author SHA1 Message Date
David Roazen 07b369ca7e Move VCF/BCF2/VariantContext to new standalone org.broadinstitute.variant package
This is an intermediate commit so that there is a record of these changes in our
commit history. Next step is to isolate the test classes as well, and then move
the entire package to the Picard repository and replace it with a jar in our repo.

-Removed all dependencies on org.broadinstitute.sting (still need to do the test classes,
though)

-Had to split some of the utility classes into "GATK-specific" vs generic methods
(eg., GATKVCFUtils vs. VCFUtils)

-Placement of some methods and choice of exception classes to replace the StingExceptions
and UserExceptions may need to be tweaked until everyone is happy, but this can be
done after the move.
2012-12-19 10:25:22 -05:00
Mark DePristo 1ca13f9581 Fundamentally better model for the NanoScheduler
-- Now each map job reads a value, performs map, and does as much reducing as possible.  This ensures that we scale performance with the nct value, so -nct 2 should result in 2x performance, -nct 3 3x, etc.  All of this is accomplished using exactly NCT% of the CPU of the machine.
-- Has the additional value of actually simplifying the code
-- Resolves a long-standing annoyance with the nano scheduler.
2012-12-19 09:31:31 -05:00
Joel Thibault a29df3e094 oops 2012-12-18 19:03:12 -05:00
Joel Thibault ee22c1bf44 More TODOs 2012-12-18 18:47:43 -05:00
Joel Thibault 2b1db519d7 Add reads which overstep a boundary by a single base 2012-12-18 18:47:43 -05:00
Joel Thibault 9828b2990f Reads off the end of a contig fail SAM validation when using actual BAMs 2012-12-18 18:47:43 -05:00
Joel Thibault 72e2394b26 Create actual BAM 2012-12-18 18:47:43 -05:00
Joel Thibault d69d1f8988 Fun with varargs 2012-12-18 18:47:42 -05:00
Joel Thibault 1158c1529f Refactor region/read comparisons 2012-12-18 18:47:42 -05:00
Yossi Farjoun 19dd2d628a some changes.
some changes.
2012-12-14 17:21:32 -05:00
Eric Banks 696bf95fba Fix for PBT bug reported on the forum: the AD is actually output correctly now (rather than with 'null' or some gibberish memory pointer). 2012-12-13 23:28:30 +00:00
Ami Levy-Moonshine 2f99569dda change the md5 in one of the CV intergration tests, since it wasn't use the priority list when printing the origin of the annotation (the setValue field) 2012-12-10 22:48:15 -05:00
David Roazen 46edab6d6a Use the new downsampling implementation by default
-Switch back to the old implementation, if needed, with --use_legacy_downsampler

-LocusIteratorByStateExperimental becomes the new LocusIteratorByState, and
the original LocusIteratorByState becomes LegacyLocusIteratorByState

-Similarly, the ExperimentalReadShardBalancer becomes the new ReadShardBalancer,
with the old one renamed to LegacyReadShardBalancer

-Performance improvements: locus traversals used to be 20% slower in the new
downsampling implementation, now they are roughly the same speed.

-Tests show a very high level of concordance with UG calls from the previous
implementation, with some new calls and edge cases that still require more examination.

-With the new implementation, can now use -dcov with ReadWalkers to set a limit
on the max # of reads per alignment start position per sample. Appropriate value
for ReadWalker dcov may be in the single digits for some tools, but this too
requires more investigation.
2012-12-10 09:44:50 -05:00
Eric Banks 574d5b467f Bug fix for indel HMM: protect against situation where long reads (e.g. Sanger) in a pileup can lead to a read starting after the haplotype end for a given haplotype. 2012-12-09 02:09:34 -05:00
Mark DePristo dbf721968d PrintReads large-scale test to protect against another major low-level performance issue 2012-12-05 21:36:27 -05:00
Mark DePristo 465694078e Major performance improvement to the GATK engine
-- The NanoSchedule timing code (in NSRuntimeProfile) was crazy expensive, but never showed up in the profilers.  Removed all of the timing code from the NanoScheduler, the NSRuntimeProfile itself, and updated the unit tests.
-- For tools that largely pass through data quickly, this change reduces runtimes by as much as 10x.  For the RealignerTargetCreator example, the runtime before this commit was 3 hours, and after is 30 minutes (6x improvement).
-- Took this opportunity to improve the GATK ProgressMeter.  NotifyOfProgress now just keeps track of the maximum position seen, and a separate daemon thread ProgressMeterDaemon periodically wakes up and prints the current progress.  This removes all inner loop calls to the GATK timers.
-- The history of the bug started here: http://gatkforums.broadinstitute.org/discussion/comment/2402#Comment_2402
2012-12-05 14:49:22 -05:00
Mark DePristo 2b601571e7 Better error handling in NanoScheduler
-- The previous nanoscheduler would deadlock in the case where an Error, not an Exception, was thrown.  Errors, like out of memory, would cause the whole system to die.  This bugfix resolves that issue
2012-12-05 14:49:22 -05:00
Eric Banks 5fed9df295 Quick fix: base qual array in the GATKSAMRecord stores the actual phred values (-33) and not the original bytes (duh). 2012-12-03 12:18:20 -05:00
Eric Banks b6839b3049 Added checking in the GATK for mis-encoded quality scores.
The check is performed by a Read Transformer that samples (currently set to once
every 1000 reads so that we don't hurt overall GATK performance) from the input
reads and checks to make sure that none of the base quals is too high (> Q60). If
we encounter such a base then we fail with a User Error.

* Can be over-ridden with --allow_potentially_misencoded_quality_scores.
* Also, the user can choose to fix his quals on the fly (presumably using PrintReads
  to write out a fixed bam) with the --fix_misencoded_quality_scores argument.

Added unit tests.
2012-12-03 11:18:41 -05:00
Joel Thibault c76c808268 Reads are required to be sorted
- Remove the extended_only case because it's outside intervals
2012-11-28 13:59:58 -05:00
Joel Thibault 198923b597 Add ActiveRegionReadState handling 2012-11-28 13:59:57 -05:00
Mark DePristo 7e4b9c9e6e Fix failing unit tests for VariantContextUtilsUnitTest
-- Previous version was adding multiple samples with the same name to the variant context
2012-11-27 14:26:23 -05:00
Joel Thibault 9bfe39411e Equal overlap should match right/later region 2012-11-27 13:03:13 -05:00
Joel Thibault d83ad906ef Add profile range contract 2012-11-27 13:03:13 -05:00
Joel Thibault cc550b4145 Add a read and interval on a different contig 2012-11-27 13:03:13 -05:00
Eric Banks 9531e58445 Merged bug fix from Stable into Unstable 2012-11-27 11:00:50 -05:00
Eric Banks 4543ece088 Fixing parsing of genomelocs that contain colons in the contig names (which is allowed by the spec) as reported on the forum. Added unit test for this case. 2012-11-27 11:00:33 -05:00
Eric Banks a82ec7ad80 Merged bug fix from Stable into Unstable 2012-11-27 10:27:08 -05:00
Eric Banks e199562c25 I have pulled out all of the documentation URLs and put them into the HelpUtils class as static variables; this way, Appistry can change links as needed to point commercial users to their own internal forum without having to muck things up all over our source. Added some TODOs for Geraldine to update links in the GATK docs that still point to the old wiki. Sorry that I am pushing into stable, but that's what Appistry is pulling from for their release next week (and unstable has been failing forever). 2012-11-27 10:26:17 -05:00
Eric Banks 405f3c675d Fix for GSA-649: GenomeLocSortedSet.overlaps is crazy slow. Also improved GenomeLocSortedSet.sizeBeforeLoc. 2012-11-27 01:07:00 -05:00
Eric Banks 4f7fa3009a I forget why I thought that the VariantAnnotator couldn't run multi-threaded because it works just fine. Now you can specify -nt with VA. 2012-11-26 11:34:59 -05:00
Mark DePristo 48f271c5bd Adding 80% support for multi-allelic variants
-- Multi-allelic variants are split into their bi-allelic version, trimmed, and we attempt to provide a meaningful genotype for NA12878 here.  It's not perfect and needs some discussion on how to handle het/alt variants
-- Adding splitInBiallelic funtion to VariantContextUtils as well as extensive unit tests that also indirectly test reverseTrimAlleles (which worked perfectly FYI)
2012-11-21 17:24:59 -05:00
Joel Thibault c68bc95db6 Initial read mapping tests
- Failing tests are commented out
2012-11-21 17:16:46 -05:00
Joel Thibault 3ad9128800 Add some reads
- Move intervals and reads to init
- Update intervals and reads
2012-11-21 17:16:46 -05:00
Joel Thibault 3fa3b00f4a Add ActiveRegion tests and refactor 2012-11-21 17:16:45 -05:00
Joel Thibault e8defcb20d Test multiple bases and intervals 2012-11-21 17:16:45 -05:00
Joel Thibault c08b782743 Count isActive calls directly 2012-11-21 17:16:45 -05:00
Eric Banks 72e2d569c5 The user can now set the maximum allowable cycle on the command-line with --maximum_cycle_value. This value is (now) enforced in the Cycle covariate and a User Error is thrown if the maximum value is passed (with a helpful error message). Added unit tests to cover this new functionality. 2012-11-20 22:41:57 -05:00
Eric Banks ff87642a91 Enable cycle covariate unit tests 2012-11-20 22:29:56 -05:00
Eric Banks 937ac7290f Lots more GGA fixes for the HC now that I understand what's going on internally. Integration tests pass except for the GGA test which I believe now produces better results. 2012-11-20 16:13:29 -05:00
Joel Thibault b70fd4a242 Initial testing of the Active Region Traversal contract
- TODO: many more tests and test cases
2012-11-15 10:08:00 -05:00
Eric Banks e9183d9fe0 Fix bugs as reported on the forum: BED needs to be explicitly set as the default output format and the output didn't actually adhere to the BED spec. 2012-11-08 15:07:47 -05:00
David Roazen 6185e8c432 Allow large-scale tests 5 hours each to run 2012-11-01 17:48:58 -04:00
Mark DePristo 872abddfce Add custom TestNGTestTransformer that adds a maximum test runtime of 10 minutes to all testng tests
-- Closes GSA-494 / Add maximum runtime for integration tests, running them in timeout thread
-- Needed to debug locking issues
-- Needed to debug excessively long running integrationtests
-- Added build.xml maximum runtime for all testng tests of 10 hours.  We will ultimately fail the build if it goes on for more than 10 hours
2012-11-01 15:34:12 -04:00
Mark DePristo 1444cd753b Bugfix for GSA-647 HaplotypeCaller misses good variant because the active region doesn't trigger for an exome
-- The logic for determining active regions was a bit broken in the HC when intervals were used in the system
-- TraverseActiveRegions now uses the AllLocus view, since we always want to see all reference sites, not just those covered.  Simplifies logic of TAR
-- Non-overlapping intervals are always treated as separate objects for determing active / inactive state.  This means that each exon will stand on its own when deciding if it should be active or inactive
-- Misc. cleanup, docs of some TAR infrastructure to make it safer and easier to debug in the future.
-- Committing the SingleExomeCalling script that I used to find this problem, and will continue to use in evaluating calling of a single exome with the HC
-- Make sure to get all of the reads into the set of potentially active reads, even for genomic locations that themselves don't overlap the engine intervals but may have reads that overlap the regions
-- Remove excessively expensive calls to check bases are upper cased in ReferenceContext
-- Update md5s after a lot of manual review and discussion with Ryan
2012-11-01 15:34:04 -04:00
Mark DePristo 9cd04c335c Work on GSA-508 / CachingIndexedFastaReader should internally upper case bases loading data
-- As one might expect, CachingIndexedFastaSequenceFile now internally upper cases the FASTA reference bases.  This is now done by default, unless requested explicitly to preserve the original bases.
-- This is really the correct place to do this for a variety of reasons.  First, you don't need to work about upper casing bases throughout the code.  Second, the cache is only upper cased once, no matter how often the bases are accessed, which walkers cannot optimize themselves.  Finally, this uses the fastest function for this -- Picard's toUpperCase(byte[]) which is way better than String.toUpperCase()
-- Added unit tests to ensure this functionality works correct.
-- Removing unnecessary upper casing of bases in some core GATK tools, now that RefContext guarentees that the reference bases are all upper case.
-- Added contracts to ensure this is the case.
-- Remove a ton of sh*t from BaseUtils that was so old I had no idea what it was doing any longer, and didn't have any unit tests to ensure it was correct, and wasn't used anywhere in our code
2012-11-01 15:34:03 -04:00
Eric Banks 47a0f5859e Don't run these tests if not GAKT lite 2012-10-31 22:56:38 -04:00
Eric Banks f8af8a2355 Moving UG integration tests to protected since they use protected-only contamination filtering. Adding a new UGLite integration test to confirm that contamination filtering is ignored in lite. 2012-10-31 21:28:07 -04:00
Eric Banks 2aa28abe0a Fixing md5s to reflect the new HapMap file 2012-10-30 14:27:10 -04:00
Eric Banks b6a1967f12 Better documentation for ValidateVariants so that people realize it's used for strict validation of the VCF file. Added an option to turn off strict validation and an integration test to cover it. 2012-10-29 21:47:09 -04:00