Commit Graph

154 Commits (d6277b70d8e860ac4ef37d7438687480e79eb111)

Author SHA1 Message Date
Eric Banks d6277b70d8 Forgot to consider the optimized case in hasAllele 2012-04-24 11:32:28 -04:00
Eric Banks 74ad008163 Adding VariantContext.hasAlternateAllele functionality 2012-04-24 11:07:46 -04:00
Eric Banks 63aa79df82 Slightly better error message 2012-04-23 09:37:28 -04:00
Eric Banks 4edb005411 Catch poorly formatted PL/GL fields 2012-04-23 09:33:50 -04:00
Eric Banks 1f23d99dfa If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did. 2012-04-20 17:00:05 -04:00
Eric Banks 818e8c2fb9 Resolving merge conflicts 2012-04-12 15:19:44 -04:00
Eric Banks 0dd571928d Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones). 2012-04-12 15:16:29 -04:00
Eric Banks f77a6d18b8 Bad conflict merge before 2012-04-12 09:56:49 -04:00
Eric Banks cc71baf691 Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing. 2012-04-12 09:18:44 -04:00
Guillermo del Angel 719ec9144a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-09 14:53:19 -04:00
Guillermo del Angel 550179a1f7 Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors. 2012-04-09 14:53:05 -04:00
Mark DePristo 52ef4a3e26 Function to compute whether a VariantContext indel is part of a TandemRepeat
Returns true iff VC is an non-complex indel where every allele represents an expansion or
 contraction of a series of identical bases in the reference.

 The logic of this function is pretty simple.  Take all of the non-null alleles in VC.  For
 each insertion allele of n bases, check if that allele matches the next n reference bases.
 For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
 as it must necessarily match the first n bases.  If this test returns true for all
 alleles you are a tandem repeat, otherwise you are not.  Note that in this context n is the
 base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Eric Banks 5c3ddec4c2 Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update. 2012-04-05 10:49:08 -04:00
Eric Banks 2c956efa53 Minor fixups to GenotypeLikelihoods 2012-04-05 09:14:37 -04:00
Guillermo del Angel 820216dc68 More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation 2012-04-04 16:23:10 -04:00
Eric Banks 9e32a975f8 Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore. 2012-04-04 13:47:59 -04:00
Guillermo del Angel 05d8400468 Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet) 2012-04-03 20:51:24 -04:00
Guillermo del Angel 9e11b4f9a7 Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced. 2012-04-03 15:43:32 -04:00
Eric Banks e4a225ed09 Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally. 2012-03-29 11:07:37 -04:00
Mark DePristo 12aa72f200 Merged bug fix from Stable into Unstable 2012-03-27 22:43:00 -04:00
Mark DePristo 979a84a252 Bugfix for thread unsafe PL cache
-- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Solution is to use a fixed cache that's never updated on the fly.  My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.
2012-03-27 22:42:30 -04:00
Eric Banks 2511839068 Merged bug fix from Stable into Unstable 2012-03-23 13:51:33 -04:00
Eric Banks d3f2bc4361 Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles. 2012-03-23 13:51:00 -04:00
Eric Banks 9223e451a3 Merged bug fix from Stable into Unstable 2012-03-18 00:54:19 -04:00
Eric Banks 344a938a70 When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array. 2012-03-18 00:36:30 -04:00
Eric Banks be9e48ba29 Merged bug fix from Stable into Unstable 2012-03-16 14:33:53 -04:00
Eric Banks dce6b91f7d Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing. 2012-03-16 11:14:37 -04:00
Eric Banks 41068b6985 The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code. 2012-03-15 16:08:58 -04:00
Ryan Poplin 1da8928407 HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes. 2012-03-14 15:22:21 -04:00
Ryan Poplin 78a4e7e45e Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner. 2012-03-14 12:05:05 -04:00
Eric Banks a4a279ce80 Damn you, Mark 2012-02-28 10:09:09 -05:00
Eric Banks bd398e30fd Another quick optimization 2012-02-28 09:25:35 -05:00
Eric Banks 40bdadbda5 Minor optimization as per Mark 2012-02-28 09:24:07 -05:00
Eric Banks d7928ad669 Drat, missed one: handle null alleles being passed in. 2012-02-27 21:31:54 -05:00
Mark DePristo 24356f11b7 Merged bug fix from Stable into Unstable
-- Resolved conflict

Conflicts:
	public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java
2012-02-27 17:13:17 -05:00
Mark DePristo 100ddef930 Fix typo in VariantContextBuilder 2012-02-27 17:06:45 -05:00
Mark DePristo 729bb954e2 Throws ReviewedStingException for a bug when parent VariantContext argument is null 2012-02-27 15:09:00 -05:00
Eric Banks 998ed8fff3 Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous). 2012-02-27 14:56:10 -05:00
Eric Banks 64754e7870 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-02-27 11:31:41 -05:00
Eric Banks dfdf4f989b Enabling Fisher Strand for multi-allelics: use the alt allele with max AC. Added minor optimization to the method in the VC. 2012-02-27 09:50:09 -05:00
Mark DePristo e0c189909f Added support for breakpoint alleles
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Eric Banks 718da7757e Fixes to ValidateVariants as per GS post: ref base of mixed alleles were sometimes wrong, error print out of bad ACs was throwing a RuntimeException, don't validate ACs if there are no genotypes. 2012-02-07 13:15:58 -05:00
Eric Banks 702a2d768f Initial version of multi-allelic summary module in VariantEval 2012-01-25 19:42:55 -05:00
Menachem Fromer db645a94ca Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES 2012-01-25 16:10:59 -05:00
Menachem Fromer 066da80a3d Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output 2012-01-19 18:19:58 -05:00
Mark DePristo 1994c3e3bc Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants 2011-12-16 16:50:51 -05:00
Mark DePristo 71b4bb12b7 Bug fix for incorrect logic in subsetSamples
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Mark DePristo a3aef8fa53 Final performance optimization for GenotypesContext 2011-11-22 17:19:30 -05:00
Mark DePristo 7087310373 Embarassing bug fixed 2011-11-22 10:16:36 -05:00
Mark DePristo e484625594 GenotypesContext now updates cached data for add, set, replace operations when possible
-- Involved separately managing the sample -> offset and sample sorted list operations.  This should improve performance throughout the system
2011-11-22 08:40:48 -05:00