VariantEvalWalker now passes a pointer to itself to the Stratefication setVariantEvalWalker (and assoc. get method) so that stratefications can look at VEWalker variables to obtain information necessary for their calculations, like the list of eval samples. This is a better interface, in my opinion, than the current approach of extending the base abstract Stratefication to include an initialize function that has all arguments necessarily for any Strat.
JEXL expressions now provide access to the VariantContext vc object itself, so you can write JEXL's that directly use VariantContext and associated functions from the command line.
ExomePostQC Queue script now creates a byAC eval using the new strat, and no longer produces a byAF file (as this was not exact, and lead to strange punctile behavior when actual AF quantization was out of sync with fix quantization of AF strat.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6015 348d0f76-0448-11de-a6fe-93d51630548a
VariantContextUtils now was a utility function that creates a sitesOnlyVariantContext from an input VC
Add complex merge test of SNPs and indels from the new batch merge wiki in :
http://www.broadinstitute.org/gsa/wiki/index.php/Merging_batched_call_sets
with multiple alleles for an indel. Created a BatchMergeIntegrationTest that uses GGA with the complex merged input alleles to genotype SNPs and Indels with multiple alleles simultaneously in NA12878. Looks great.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5959 348d0f76-0448-11de-a6fe-93d51630548a
a) All rank sum tests now work for indels including multiallelic sites. For the latter cases, rank sum test is REF vs most common allele
b) Redid computation of HaplotypeScore for indels. It's now trivially easy to do because we are already computing likelihoods of each read vs haplotypes in GL computation so we reuse that if available. For multiallelic case, we score against N haplotypes where N is total called alleles.
Drawback is that all cases need information contained in likelihood table that stores likelihood for each pileup element, for each allele. If this table is not available we dont annotate, so we can only fully annotate indels right now when running UG but not when running VariantAnnotator alone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5947 348d0f76-0448-11de-a6fe-93d51630548a
- Verified that exact calculations do agree with R's dwilcox()
- Verified that exact calculations do not agree with R's wilcox.test
+ This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values)
- Can now specify MWU to calculate cumulative exact tests, rather than point probabilities
- Z-scores are now calculated properly for exact tests
+ Previously, z-values calculated by inverting normal CDF from U-statistic PDF
+ Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate
+ Additional tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a
Fixed minor problem with WalkerTest for "" (for parameterization) md5s.
Added an explicit integrationtest for BAQ NONE
Now only creates the BAQ'd pileup, if the useBAQPileup parameter is provide in initializeAlternateAllele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5891 348d0f76-0448-11de-a6fe-93d51630548a
Also allows for the slightly more awesome name "MWUnitTest"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5850 348d0f76-0448-11de-a6fe-93d51630548a
- Running a test when there are no observations of at least one of the sets now breaks the MWU contract
+ MWU returns Pair(Double.NaN,Double.NaN) in these instances to maintain the contract of never returning null
+ No more Double.Infinity values will appear
- RankSumTests now probe the return values for NaNs, and don't annotate if they appear
- For small sets where the probability is calculated recursively, the z-value is now the inversion of the error function
and not the approximate z-value
- UG and Annotator integration tests updated to reflect changes
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5845 348d0f76-0448-11de-a6fe-93d51630548a
Right now, if you select a multi-sample VCF file down (or one with filters I see) down to a smaller set of samples, and the site isn't polymorphic in that subgroup, then the alt allele is lost. For example, when selecting down NA12878 from the OMNI, I previously received the following VCF:
1 82154 rs4477212 A . . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C . . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
Where the first two records lost the ALT allele, because NA12878 is hom-ref at this site. My change results in a VCF that looks like:
1 82154 rs4477212 A G . PASS AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0 GT:GC 0/0:0.7205
1 534247 SNP1-524110 C T . PASS AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0 GT:GC 0/0:0.6491
1 565286 SNP1-555149 C T . PASS AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0 GT:GC 1/1:0.3471
1 569624 SNP1-559487 T C . PASS AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0 GT:GC 1/1:0.3942
The genotype remains unchanged, but the ALT allele is now preserved. I think this is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().
isVariant => is there an ALT allele?
isPolymorphic => is some sample non-ref in the samples?
In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this. Wiki docs updated.
Does anyone have critical infrastructure that depends on the previous convention? Let me know so we can coordinate the change.
There's a new function subContextFromGenotypes() that also takes a Set<Allele> to handle this type of behavior.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5832 348d0f76-0448-11de-a6fe-93d51630548a
GenomeLocs can officially have any start/stop values from -Inf - +Inf. Bounds w.r.t. the reference are enforced, optionally, by GenomeLocParser. General code cleanup throughout the subsystem.
All validation code for GLs is now centralized, and all I/O systems now validate their inputs. Because of this, the Picard interval processing code has been changed to examine whether an interval is valid, and only keep the valid intervals. Note that the scatter/gather test was changed, because the original hg18 chr20 interval files as actually malformed (all records for some reason where on chr20).
Many interval processing routines were moved to IntervalUtils, as this is their natural home.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5830 348d0f76-0448-11de-a6fe-93d51630548a
Refactored several Interval utilties from GenomeLocParser to IntervalUtils, as one might expect they go
Removed GenomeLoc.clone() method, as this was not correctly implemented, and actually unnecessary, as GenomeLocs are immutable. Several iterator classes have changed to remove their use of clone()
Removed misc. unnecessary imports
Disabled, temporarily, the validating pileup integration test, as it uses reads mapped to an different reference sequence for ecoli, and this now does not satisfy the contracts for GenomeLoc
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5827 348d0f76-0448-11de-a6fe-93d51630548a