Ryan Poplin
f4c72a26d5
A few quick, minor findbugs fixes.
2012-08-09 16:30:58 -04:00
Ryan Poplin
c7f22e410f
A few quick, minor findbugs fixes.
2012-08-09 16:22:08 -04:00
Eric Banks
def077c4e5
There's actually a subtle but important difference between foo++ and ++foo
2012-08-09 12:42:50 -04:00
Ryan Poplin
e48727dae3
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-09 10:31:10 -04:00
Guillermo del Angel
5be7e0621d
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-09 09:58:34 -04:00
Guillermo del Angel
71ee8d87b3
Rename per-sample ML allelic fractions and counts so that they don't have the same name as the per-site INFO fields, and clarify wording in VCF header
2012-08-09 09:58:20 -04:00
Eric Banks
35cec8530c
Make coverage threshold in FindCoveredIntervals a command-line argument
2012-08-08 21:44:24 -04:00
Ryan Poplin
1223d77546
Removing argument from HaplotypeCaller that was made unneccesary by recent improvements to triggering around large events
2012-08-08 15:13:20 -04:00
Eric Banks
0a2a646a52
Other random FindBugs fixes
2012-08-08 14:56:27 -04:00
Eric Banks
4c84cc9486
Quick pass of FindBugs 'should be static inner class' fixes.
2012-08-08 14:42:06 -04:00
Eric Banks
a0196c9f5b
Quick pass of FindBugs 'method invokes inefficient Number constructor' fixes.
2012-08-08 14:34:16 -04:00
Eric Banks
4b2e3cec0b
Quick pass of FindBugs 'inefficient use of keySet iterator instead of entrySet iterator' fixes for core tools.
2012-08-08 14:29:41 -04:00
Guillermo del Angel
3e2752667c
Intermediate checkin for ReducedReads with HaplotypeCaller - change min read count over k-mer to average count over k-mer when doing assembly of a reduced read (not optimal, currently trying max and then will decide on best approach), fix merge conflicts
2012-08-08 12:07:33 -04:00
David Roazen
a7811d673f
Update URL for phone home / GATK key documentation output by the GATK upon error
2012-08-08 09:29:54 -04:00
Mark DePristo
cda8d944b7
Bugfixes for BCF with VQSR
...
-- Old version converted doubles directly from strings. New version uses VariantContext getAttributeAsDouble() that looks at the values directly to determine how to convert from Object to Double (via Double.valueOf, (Double), or (Double)(Integer)).
-- getAttributeAsDouble() is now smart in converting integers to doubles as needed
-- Removed unnecessary logging info in BCF2Codec
-- Added integration tests to ensure that VQSR works end-to-end with BCF2 using sites version of the file khalid sent to me
-- Added vqsr.bcf_test.snps.unfiltered.bcf file for this integration test
2012-08-07 17:22:39 -04:00
Mark DePristo
80b94a4f9a
AdaptiveContexts implement pruning to a given chi2 p value
...
-- Added bonferroni corrected p-value pruning, so you tell it how significant of a different you are willing to collapse in the tree, and it prunes the tree down to this maximum threshold
-- Penalty is now a phred-scaled p-value not the raw chi2 value
-- Split command line arguments in VisualizeContextTree into separate arguments for each type of pruning
2012-08-07 17:22:39 -04:00
Mark DePristo
982c735c76
VisualizeAdaptiveTree now considers only leaf nodes when computing max/min penalty
2012-08-07 17:22:39 -04:00
Ryan Poplin
15085bf03e
The UnifiedGenotyper now makes use of base insertion and base deletion quality scores if they exist in the reads.
2012-08-07 13:58:22 -04:00
Guillermo del Angel
97c5ed4feb
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-06 20:22:31 -04:00
Guillermo del Angel
238d55cb61
Fixes for running HaplotypeCaller with reduced reads: a) minor refactoring, pulled out code to compute mean representative count to ReadUtils, b) Don't use min representative count over kmer when constructing de Bruijn graph - this creates many paths with multiplicity=1 and makes us lose a lot of SNP's at edge of capture targets. Use mean instead
2012-08-06 20:22:12 -04:00
Mark DePristo
00858f16a6
Deleting empty unit test for AdaptiveContexts
2012-08-06 12:58:13 -04:00
Ryan Poplin
f1c30c3a59
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-06 12:02:26 -04:00
Mark DePristo
44f160f29f
indelGOP and indelGCP are now advanced, not hidden arguments
2012-08-06 11:42:55 -04:00
Mark DePristo
2f004665fb
Fixing public -> private dep
2012-08-06 11:42:55 -04:00
Mark DePristo
7bf5ca51ee
Major bugfix for adaptive contexts
...
-- Basically I was treating the context history in the wrong direction, effectively predicting the further bases in the context based on the closer one. Totally backward. Updated the code to build the tree in the right direction.
-- Added a few more useful outputs for analysis (minPenalty and maxPenalty)
-- Misc. cleanup of the code
-- Overall I'm not 100% certain this is even the right way to think about the problem. Clearly this is producing a reasonable output but the sum of chi2 values over the entire tree is just enormous. Perhaps a MCMC convergence / sampling criterion would be a better way to think about this problem?
2012-08-06 11:42:55 -04:00
Mark DePristo
b4841548f1
Bug fixes and misc. improvements to running the adaptive context tools
...
-- Better output file name defaults
-- Fixed nasty bug where I included non-existant quals in the contexts to process because they showed up in the Cycle covariate
-- Data is processed in qual order now, so it's easier to see progress
-- Logger messages explaining where we are in the process
-- When in UPDATE mode we still write out the information for an equivalent prune by depth for post analysis
2012-08-06 11:42:55 -04:00
Ryan Poplin
b8709d8c67
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-06 11:41:28 -04:00
Eric Banks
210db5ec27
Update -maxAlleles argument to -maxAltAlleles to make it more accurate. The hidden GSA production -capMaxAllelesForIndels argument also gets updated.
2012-08-06 11:31:18 -04:00
Eric Banks
8f95a03bb6
Prevent NumberFormatExceptions when parsing the VCF POS field
2012-08-06 11:19:54 -04:00
Ryan Poplin
b7eec2fd0e
Bug fixes related to the changes in allele padding. If a haplotype started with an insertion it led to array index out of bounds. Haplotype allele insert function is now very simple because all alleles are treated the same way. HaplotypeUnitTest now uses a variant context instead of creating Allele objects directly.
2012-08-05 12:29:10 -04:00
Mark DePristo
e1bba91836
Ready for full-scale evaluation adaptive BQSR contexts
...
-- VisualizeContextTree now can write out an equivalent BQSR table determined after adaptive context merging of all RG x QUAL x CONTEXT trees
-- Docs, algorithm descriptions, etc so that it makes sense what's going on
-- VisualizeContextTree should really be simplified when into a single tool that just visualize the trees when / if we decide to make adaptive contexts standard part of BQSR
-- Misc. cleaning, organization of the code (recalibation tests were in private but corresponding actual files were public)
2012-08-03 16:02:53 -04:00
Guillermo del Angel
6f8e7692d4
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-08-03 12:24:37 -04:00
Guillermo del Angel
9e25b209e0
First pass of implementation of Reduced Reads with HaplotypeCaller. Main changes: a) Active region: scale PL's by representative count to determine whether region is active. b) Scale per-read, per-haplotype likelihoods by read representative counts. A read representative count is (temporarily) defined as the average representative count over all bases in read, TBD whether this is good enough to avoid biases in GL's. c) DeBruijn assembler inserts kmers N times in graph, where N is min representative count of read over kmer span - TBD again whether this is the best approach. d) Bug fixes in FragmentUtils: logic to merge fragments was wrong in cases where there is discrepancy of overlaps between unclipped/soft clipped bases. Didn't affect things before but RR makes prevalence of hard-clipped bases in CIGARs more prevalent so this was exposed. e) Cache read representative counts along with read likelihoods associated with a Haplotype. Code can/should be cleaned up and unified with PairHMMIndelErrorModelCode, as well as refactored to support arbitrary ploidy in HaplotypeCaller
2012-08-03 12:24:23 -04:00
Ryan Poplin
8817fc70d1
Merged bug fix from Stable into Unstable
2012-08-03 10:45:01 -04:00
Ryan Poplin
f40d0a0a28
Updating VQSR to work with the MNP and symbolic variants that are coming out of the HaplotypeCaller. Integration tests change because of the MNPs in dbSNP.
2012-08-03 10:44:36 -04:00
Joel Thibault
51bd03cc36
Add RemoveProgramRecords annotation to ActiveRegionWalker
2012-08-03 09:54:16 -04:00
Joel Thibault
addbfd6437
Add a RemoveProgramRecords annotation
...
* Add the RemoveProgramRecords annotation to LocusWalker
2012-08-03 09:54:16 -04:00
Joel Thibault
524d7ea306
Choose whether to keep program records based on Walker
...
* Add keepProgramRecords argument
* Make removeProgramRecords / keepProgramRecords override default
2012-08-03 09:54:16 -04:00
Mark DePristo
e04989f76d
Bugfix for new PASS position in dictionary in BCF2
2012-08-03 09:42:21 -04:00
Mark DePristo
fb5dabce18
Update BCF2 to include a minor version number so we can rev (and report errors) with BCF2
...
-- We are no likely to fail with an error when reading old BCF files, rather than just giving bad results
-- Added new class BCFVersion that consolidates all of the version management of BCF
2012-08-02 17:30:30 -04:00
Eric Banks
e3f89fb054
Missing/malformed GATK report files are user errors
2012-08-02 11:33:21 -04:00
Mark DePristo
c3c3d18611
Update BCF2 to put PASS as offset 0 not at the end
...
-- Unfortunately this commit breaks backward compatibility with all existing BCF2 files...
2012-08-01 17:09:22 -04:00
Mark DePristo
ccac77d888
Bugfix for incorrect allele counting in IndelSummary
...
-- Previous version would count all alt alleles as present in a sample, even if only 1 were present, because of the way VariantEval subsetted VCs
-- Updated code for subsetting VCs by sample to be clearer about how it handles rederiving alleles
-- Update a few pieces of code to get previous correct behavior
-- Updated a few MD5s as now ref calls at sites in dbSNP are counted as having a comp sites, and therefore show up in known sites when Novelty strat is on (which I think is correct)
-- Walkers that used old subsetting function with true are now using clearer version that does rederive alleles by default
2012-08-01 15:45:12 -04:00
Joel Thibault
2b25df3d53
Add removeProgramRecords argument
...
* Add unit test for the removeProgramRecords
2012-08-01 15:33:05 -04:00
Ryan Poplin
d53105668b
Merged bug fix from Stable into Unstable
2012-08-01 14:53:06 -04:00
Ryan Poplin
fabca66d09
Another fix to VQSR docs
2012-08-01 14:52:49 -04:00
Ryan Poplin
2be29ebd22
Merged bug fix from Stable into Unstable
2012-08-01 14:35:30 -04:00
Ryan Poplin
4093909a56
Updating VQSR docs. Removing references to old best practices pages.
2012-08-01 14:30:24 -04:00
Eric Banks
52b93cab62
Merged bug fix from Stable into Unstable
2012-08-01 13:17:36 -04:00
Eric Banks
22bf052828
Fixing BQSR GATK docs
2012-08-01 13:17:16 -04:00
Eric Banks
459832ee16
Fixed bug in FastaAlternateReferenceMaker when input VCF has overlapping deletions as reported a while back on GS
2012-08-01 10:45:04 -04:00
Eric Banks
a4a41458ef
Update docs of FastaAlternateReferenceMaker as promised in older GS thread
2012-08-01 10:33:41 -04:00
Eric Banks
38e5419b11
Merged bug fix from Stable into Unstable
2012-08-01 09:50:31 -04:00
Eric Banks
56f8afab97
Requested by Geraldine: adding a utility to register deprecated walkers (and the major version of the first release since they were removed) so that the User Error printed out for e.g. CountCovariates now states: Walker CountCovariates is no longer available in the GATK; it has been deprecated since version 2.0.
2012-08-01 09:50:00 -04:00
Guillermo del Angel
0528337467
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-31 18:17:50 -04:00
Guillermo del Angel
4a23f3cd11
Simple cleanup of pool caller code - since usage is much more general than just calling pools, AF calculation models and GL calculation models are renamed from Pool -> GeneralPloidy. Also, don't have users specify special arguments for -glm and -pnrm. Instead, when running UG with sample ploidy != 2, the correct general ploidy modules are automatically detected and loaded. -glm now reverts to old [SNP|INDEL|BOTH] usage
2012-07-31 16:34:20 -04:00
Eric Banks
6cb10cef96
Fixed older GS reported bug. Actually, the problem really lies in Picard (can't set max records in RAM without it throwing an exception, reported on their JIRA) so I just masked out the problem by removing this never-used argument from this rarely-used tool.
2012-07-31 16:00:36 -04:00
Eric Banks
ab53d73459
Quick fix to user error catching
2012-07-31 15:50:32 -04:00
Eric Banks
10111450aa
Fixed AlignmentUtils bug for handling Ns in the CIGAR string. Added a UG integration test that calls a BAM with such reads (provided by a user on GetSatisfaction).
2012-07-31 15:37:22 -04:00
Mark DePristo
f7133ffc31
Cleanup syntax errors from BQSR reorganization
2012-07-31 08:11:05 -04:00
Mark DePristo
dad9bb1192
Changes order of writing BaseRecalibrator results so that if R blows up you still get a meaningful tree
2012-07-31 08:11:04 -04:00
Mark DePristo
0c4e729e13
Working version of adaptive context calculations
...
-- Uses chi2 test for independences to determine if subcontext is worth representing. Give excellent visual results
-- Writes out analysis output file producing excellent results in R
-- Trivial reformatting of MathUtils
2012-07-31 08:11:04 -04:00
Mark DePristo
93640b382e
Preliminary version of adaptive context covariate algorithm
...
-- Works according to visual inspection of output tree
2012-07-31 08:11:04 -04:00
Mark DePristo
315d25409f
Improvement to RecalDatum and VisualizeContextTree
...
-- Reorganize functions in RecalDatum so that error rate can be computed indepentently. Added unit tests. Removed equals() method, which is a buggy without it's associated implementation for hashcode
-- New class RecalDatumTree based on QualIntervals that inherits from RecalDatum but includes the concept of sub data
-- VisualizeContextTree now uses RecalDatumTree and can trivially compute the penalty function for merging nodes, which it displays in the graph
2012-07-31 08:11:04 -04:00
Mark DePristo
57b45bfb1e
Extensive unit tests, contacts, and documentation for RecalDatum
2012-07-31 08:11:03 -04:00
Mark DePristo
e00ed8bc5e
Cleanup BQSR classes
...
-- Moved most of BQSR classes (which are used throughout the codebase) to utils.recalibration. It's better in my opinion to keep commonly used code in utils, and only specialized code in walkers. As code becomes embedded throughout GATK its should be refactored to live in utils
-- Removed unncessary imports of BQSR in VQSR v3
-- Now ready to refactor QualQuantizer and unit test into a subclass of RecalDatum, refactor unit tests into RecalDatum unit tests, and generalize into hierarchical recal datum that can be used in QualQuantizer and the analysis of adaptive context covariate
-- Update PluginManager to sort the plugins and interfaces. This allows us to have a deterministic order in which the plugin classes come back, which caused BQSR integration tests to temporarily change because I moved my classes around a bit.
2012-07-31 08:11:03 -04:00
Mark DePristo
191294eedc
Initial cleanup of RecalDatum for move and further refactoring
...
-- Moved Datum, the now unnecessary superclass, into RecalDatum
-- Fixed some obviously dangerous synchronization errors in RecalDatum, though these may not have caused problems because they may not have been called in parallel mode
2012-07-31 08:11:03 -04:00
Mark DePristo
0670316288
Be clearer that dcov 50 is good for 4x, should use 200 for >30x
2012-07-31 08:11:02 -04:00
Mark DePristo
874dbf5b58
Maximum wait for GATK run report upload reduced to 10 seconds
2012-07-31 08:11:02 -04:00
Guillermo del Angel
e6b326c189
Merge branch 'master' of ssh://gsa4.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-30 21:32:19 -04:00
Guillermo del Angel
6c9d3ec155
Remerge after changes to allele construction code. More cleanups/fixes to artificial read pileup provider
2012-07-30 21:32:03 -04:00
Ryan Poplin
7ed06ee7b9
Updating FindCoveredIntervals to use the changes to the ActiveRegionWalker.
2012-07-30 12:16:27 -04:00
Ryan Poplin
13591b169f
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-07-30 12:13:24 -04:00
Eric Banks
0b30588d67
Catch yet another class of User Errors
2012-07-30 11:59:56 -04:00
Eric Banks
5743694196
Merged bug fix from Stable into Unstable
2012-07-30 11:35:28 -04:00
Eric Banks
79195b97a3
Adding categories for the remaining uncategorized walkers
2012-07-30 11:35:08 -04:00
Guillermo del Angel
5b9a1af7fe
Intermediate fix for pool GL unit test: fix up artificial read pileup provider to give consistent data. b) Increase downsampling in pool integration tests with reference sample, and shorten MT tests so they don't last too long
2012-07-30 09:56:10 -04:00
Eric Banks
7630c929a7
Re-enabling the unit tests for reverse allele clipping
2012-07-29 22:24:56 -04:00
Eric Banks
b07bf1950b
Adding an integration test for another feature that I snuck in during a previous commit: we now allow lower-case bases in the REF/ALT alleles of a VCF and upper-case them (this had been turned off because the previous version used Strings to do the uppercasing whereas we stick with byte operations now).
2012-07-29 22:19:49 -04:00
Eric Banks
c4ae9c6cfb
With the new Allele representation we can finally handle complex events (because they aren't so complex anymore). One place this manifests itself is with the strict VCF validation (ValidateVariants used to skip these events but doesn't anymore) so I've added a new test with complex events to the VV integration test.
2012-07-29 19:22:02 -04:00
Eric Banks
99b15b2b3a
Final checkpoint: all tests pass. Note that there were bugs in the PoolGenotypeLikelihoodsUnitTest that needed fixing and eventually led to my needing to disable one of the tests (with a note for Guillermo to look into it). Also note that while I have moved over the GATK to use the new non-null representation of Alleles, I didn't remove all of the now-superfluous code throughout to do padding checking on merges; we'll need to do this on a subsequent push.
2012-07-29 01:07:59 -04:00
Eric Banks
2b1b00ade5
All integration tests and VC/Allele unit tests are passing
2012-07-27 17:03:49 -04:00
Eric Banks
beb7610195
Resolving merge conflicts
2012-07-27 15:52:02 -04:00
Eric Banks
27e7e11ec0
Allele refactoring checkpoint #3 : all integration tests except for PoolCaller are passing now. Fixed a couple of bugs from old code that popped up during md5 difference review. Added VariantContextUtils.requiresPaddingBase() method for tools that create alleles to use for determining whether or not to add the ref padding base. One of the HaplotypeCaller tests wasn't passing because of RankSumTest differences, so I added a TODO for Ryan to look into this.
2012-07-27 15:48:40 -04:00
Ryan Poplin
22bb4804f0
HaplotypeCaller now use an excessive number of high quality soft clips as a triggering signal in order to capture both end points of a large deletion in a single active region.
2012-07-27 12:44:02 -04:00
Ryan Poplin
a0890126a8
ActiveRegionWalker's isActive function returns a results object now instead of just a double.
2012-07-27 11:01:39 -04:00
Eric Banks
ef335b6213
Several more walkers have been brought up to use the new Allele representation.
2012-07-27 02:14:25 -04:00
Eric Banks
9e2209694a
Re-enable reverse trimming of alleles in UG engine when sub-selecting alleles after genotyping. UG integration tests now pass.
2012-07-27 00:47:15 -04:00
Eric Banks
baf3e33730
Allele refactoring checkpoint 2: all code finally compiles, AD and STR annotations are fixed, and most of the UG integration tests pass.
2012-07-26 23:27:11 -04:00
Ryan Poplin
35e803e110
Merged bug fix from Stable into Unstable
2012-07-26 14:00:04 -04:00
Ryan Poplin
4f741b4cd7
Smoothing in the BQSR bins should be one error observation and one non-error observation.
2012-07-26 13:59:02 -04:00
Guillermo del Angel
2ae890155c
Improvements to indel calling in pool caller: a) Compute per-read likelihoods in reference sample to determine wheter a read is informative or not. b) Fixed bugs in unit tests. c) Fixed padding-related bugs when computing matches/mismatches in ErrorModel, d) Added a couple of more integration tests to increase test coverage, including testing odd ploidy
2012-07-26 13:43:00 -04:00
Eric Banks
a694d1b5de
Merge branch 'master' into allelePadding
2012-07-26 01:53:14 -04:00
Eric Banks
32516a2f60
Initial checkpoint commit of VariantContext/Allele refactoring. There were just too many problems associated with the different representation of alleles in VCF (padded) vs. VariantContext (unpadded). We are moving VC to use the VCF representation. No more reference base for indels in VC and no more trimming and padding of alleles. Even reverse trimming has been stopped (the theory being that writers of VCF now know what they are doing and often want the reverse padding if they put it there; this has been requested on GetSatisfaction). Code compiles but presumably pretty much all tests with indels with fail at this point.
2012-07-26 01:50:39 -04:00
Mark DePristo
8c418a15da
Sorting out HMS error handling (fingers crossed)
...
-- Check if a traversal error occurred in the last shard
-- Catch ExecutionException from the TreeReducer and throw as our HMS execption
-- ShardTraverser just throws the exception as formatted by the HMS, rather than wrapping it as a RuntimeException itself
-- EngineFeaturesIntegrationTests now uses public exampleFASTA (faster), and does 1000x iterations (slower)
2012-07-25 23:13:12 -04:00
Mark DePristo
9242f63a4d
On the way to really sorting out HMS error handling
...
-- Better error message when a traveral error occurs (a real bug)
-- EngineFeaturesIntegrationTest runs the multi-threaded error testing routines 50x times
-- A bit of cleanup in WalkerTest
2012-07-25 22:11:10 -04:00
Mark DePristo
5671992db3
RMDTrackBuilderUnitTest now uses private/testdata file to avoid filesystem race conditions
2012-07-25 22:05:04 -04:00
Eric Banks
7eb3f54750
Added category docs for the remaining public walkers (I think I got them all). I removed a couple of totally unnecessary walkers.
2012-07-25 21:40:28 -04:00
Eric Banks
2982b24c4b
Merge branch 'master' of ssh://gsa2.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/stable
2012-07-25 20:36:53 -04:00
Eric Banks
0a98a6aa8d
Adding extraDocs tag per Mauricio's request
2012-07-25 18:23:18 -04:00