Eric Banks
d6277b70d8
Forgot to consider the optimized case in hasAllele
2012-04-24 11:32:28 -04:00
Eric Banks
74ad008163
Adding VariantContext.hasAlternateAllele functionality
2012-04-24 11:07:46 -04:00
Eric Banks
63aa79df82
Slightly better error message
2012-04-23 09:37:28 -04:00
Eric Banks
4edb005411
Catch poorly formatted PL/GL fields
2012-04-23 09:33:50 -04:00
Eric Banks
1f23d99dfa
If we are subsetting alleles in the UG (either because there were too many or because some were not polymorphic), then we may need to trim the alleles (because the original VariantContext may have had to pad at the end). Thanks to Ryan for reporting this. Only one of the integration tests had even partially covered this case, so I added one that did.
2012-04-20 17:00:05 -04:00
Eric Banks
818e8c2fb9
Resolving merge conflicts
2012-04-12 15:19:44 -04:00
Eric Banks
0dd571928d
Let's not have the indel model emit more than the max possible number of genotypable alt alleles (since we may not be able to subset down to the best ones).
2012-04-12 15:16:29 -04:00
Eric Banks
f77a6d18b8
Bad conflict merge before
2012-04-12 09:56:49 -04:00
Eric Banks
cc71baf691
Don't allow users to try to genotype more than the max possible value (catch and throw a User Error at startup). Better docs explaining that users shouldn't play with this value unless they know what they are doing.
2012-04-12 09:18:44 -04:00
Guillermo del Angel
719ec9144a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-09 14:53:19 -04:00
Guillermo del Angel
550179a1f7
Major refactorings/optimizations of pool caller, output still bit-true to older version: a) Move DEFAULT_PLOIDY from UnifiedGenotyperEngine to VariantContextUtils. b) Optimize iteration through all possible allele combinations. c) Don't store log PL's in hashmap from allele conformations to double, it was too slow. Things can still be optimized much more down the line if needed. d) Remove remaining traces of genotype priors.
2012-04-09 14:53:05 -04:00
Mark DePristo
52ef4a3e26
Function to compute whether a VariantContext indel is part of a TandemRepeat
...
Returns true iff VC is an non-complex indel where every allele represents an expansion or
contraction of a series of identical bases in the reference.
The logic of this function is pretty simple. Take all of the non-null alleles in VC. For
each insertion allele of n bases, check if that allele matches the next n reference bases.
For each deletion allele of n bases, check if this matches the reference bases at n - 2 n,
as it must necessarily match the first n bases. If this test returns true for all
alleles you are a tandem repeat, otherwise you are not. Note that in this context n is the
base differences between the ref and alt alleles
2012-04-06 16:07:46 -04:00
Eric Banks
5c3ddec4c2
Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update.
2012-04-05 10:49:08 -04:00
Eric Banks
2c956efa53
Minor fixups to GenotypeLikelihoods
2012-04-05 09:14:37 -04:00
Guillermo del Angel
820216dc68
More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation
2012-04-04 16:23:10 -04:00
Eric Banks
9e32a975f8
Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.
2012-04-04 13:47:59 -04:00
Guillermo del Angel
05d8400468
Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)
2012-04-03 20:51:24 -04:00
Guillermo del Angel
9e11b4f9a7
Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.
2012-04-03 15:43:32 -04:00
Eric Banks
e4a225ed09
Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally.
2012-03-29 11:07:37 -04:00
Mark DePristo
12aa72f200
Merged bug fix from Stable into Unstable
2012-03-27 22:43:00 -04:00
Mark DePristo
979a84a252
Bugfix for thread unsafe PL cache
...
-- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Solution is to use a fixed cache that's never updated on the fly. My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.
2012-03-27 22:42:30 -04:00
Eric Banks
2511839068
Merged bug fix from Stable into Unstable
2012-03-23 13:51:33 -04:00
Eric Banks
d3f2bc4361
Pre-allocate 10 alt alleles worth of PLs in the cache for efficiency. This effectively means that we never need to re-allocate the cache in the future because we can't ever really handle that many alt alleles.
2012-03-23 13:51:00 -04:00
Eric Banks
9223e451a3
Merged bug fix from Stable into Unstable
2012-03-18 00:54:19 -04:00
Eric Banks
344a938a70
When checking to make sure that we have cached enough data in the PL array, use the converted index value since that's what will be used as an index into the array.
2012-03-18 00:36:30 -04:00
Eric Banks
be9e48ba29
Merged bug fix from Stable into Unstable
2012-03-16 14:33:53 -04:00
Eric Banks
dce6b91f7d
Add a conversion from the deprecated PL ordering to the new one. We need this for the DiploidSNPGenotypeLikelihoods which still use the old ordering. My intention is for this to be a temporary patch, but changing the ordering in DiploidSNPGenotypeLikelihoods is not appriopriate for committing to stable as it will break all of the external tools (e.g. MuTec) that are built on top of the class. We will have to talk to e.g. Kristian to see how disruptive this will be. Added unit tests to the GL conversions and indexing.
2012-03-16 11:14:37 -04:00
Eric Banks
41068b6985
The commit constitutes a major refactoring of the UG as far as the genotype likelihoods are concerned. I hate to do this in stable, but the VCFs currently being produced by the UG are totally busted. I am trying to make just the necessary changes in stable, doing everything else in unstable later. Now all GL calculations are unified into the GenotypeLikelihoods class - please try and use this functionality from now on instead of duplicating the code.
2012-03-15 16:08:58 -04:00
Ryan Poplin
1da8928407
HC GenotypingEngine marginalizes over haplotypes when outputing events that were found on a subset of the called haplotypes.
2012-03-14 15:22:21 -04:00
Ryan Poplin
78a4e7e45e
Major restructuring of HaplotypeCaller's LikelihoodCalculationEngine and GenotypingEngine. We no longer create an ugly event dictionary and genotype events found on haplotypes independently by finding the haplotype with the max likelihood. Lots of code has been rewritten to be much cleaner.
2012-03-14 12:05:05 -04:00
Eric Banks
a4a279ce80
Damn you, Mark
2012-02-28 10:09:09 -05:00
Eric Banks
bd398e30fd
Another quick optimization
2012-02-28 09:25:35 -05:00
Eric Banks
40bdadbda5
Minor optimization as per Mark
2012-02-28 09:24:07 -05:00
Eric Banks
d7928ad669
Drat, missed one: handle null alleles being passed in.
2012-02-27 21:31:54 -05:00
Mark DePristo
24356f11b7
Merged bug fix from Stable into Unstable
...
-- Resolved conflict
Conflicts:
public/java/src/org/broadinstitute/sting/gatk/datasources/reads/SAMDataSource.java
2012-02-27 17:13:17 -05:00
Mark DePristo
100ddef930
Fix typo in VariantContextBuilder
2012-02-27 17:06:45 -05:00
Mark DePristo
729bb954e2
Throws ReviewedStingException for a bug when parent VariantContext argument is null
2012-02-27 15:09:00 -05:00
Eric Banks
998ed8fff3
Bug fix to deal with VCF records that don't have GTs. While in there, optimized a bunch of related functions (including removing a copy of the method calculateChromosomeCounts(); why did we have 2 copies? very dangerous).
2012-02-27 14:56:10 -05:00
Eric Banks
64754e7870
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-02-27 11:31:41 -05:00
Eric Banks
dfdf4f989b
Enabling Fisher Strand for multi-allelics: use the alt allele with max AC. Added minor optimization to the method in the VC.
2012-02-27 09:50:09 -05:00
Mark DePristo
e0c189909f
Added support for breakpoint alleles
...
-- See https://getsatisfaction.com/gsa/topics/support_vcf_4_1_structural_variation_breakend_alleles?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Added integrationtest to ensure that we can parse and write out breakpoint example
2012-02-23 12:14:48 -05:00
Eric Banks
718da7757e
Fixes to ValidateVariants as per GS post: ref base of mixed alleles were sometimes wrong, error print out of bad ACs was throwing a RuntimeException, don't validate ACs if there are no genotypes.
2012-02-07 13:15:58 -05:00
Eric Banks
702a2d768f
Initial version of multi-allelic summary module in VariantEval
2012-01-25 19:42:55 -05:00
Menachem Fromer
db645a94ca
Added options to make the batch-merger more all-inclusive: keep all indels, SNPs (even filtered ones) but maintain their annotations. Also, VariantContextUtils.simpleMerge can now merge variants of all types using the Hidden non-default enum MultipleAllelesMergeType=MIX_TYPES
2012-01-25 16:10:59 -05:00
Menachem Fromer
066da80a3d
Added KEEP_UNCONDTIONAL option which permits even sites with only filtered records to be included as unfiltered sites in the output
2012-01-19 18:19:58 -05:00
Mark DePristo
1994c3e3bc
Only print warning about allele incompatibility when running there are genotypes in the file in CombineVariants
2011-12-16 16:50:51 -05:00
Mark DePristo
71b4bb12b7
Bug fix for incorrect logic in subsetSamples
...
-- Now properly handles the case where a sample isn't present (no longer adds a null to the genotypes list)
-- Fix for logic failure where if the number of requested samples equals the number of known genotypes then all of the records were returned, which isn't correct when there are missing samples.
-- Unit tests added to handle these cases
2011-12-14 16:14:26 -05:00
Mark DePristo
a3aef8fa53
Final performance optimization for GenotypesContext
2011-11-22 17:19:30 -05:00
Mark DePristo
7087310373
Embarassing bug fixed
2011-11-22 10:16:36 -05:00
Mark DePristo
e484625594
GenotypesContext now updates cached data for add, set, replace operations when possible
...
-- Involved separately managing the sample -> offset and sample sorted list operations. This should improve performance throughout the system
2011-11-22 08:40:48 -05:00