1) There is now a different parameter for sample name (-sn), sample file (-sf) or sample expression (-se). The unexpected behavior of the previous implementation was way too tricky to leave unchecked. (if you had a file or directory named after a sample name, SV wouldn't work)
1b) Fixed a TODO added by Eric -- now the output vcf always has the samples sorted alphabetically regardless of input (this came as a byproduct of the implementation of 1)
2) Discordance and Concordance now work in combination with all other parameters.
3) Discordance now follows Guillermo's suggestion where the discordance track is your VCF and the variant track is the one you are comparing to. I have updated the example in the wiki to reflect this change in interpretation.
4) If you DON'T provide any samples (-sn, -se or -sf), SelectVariants works with all samples from the VCF and ignores sample/genotype information when doing concordance or discordance. That is, it will report every "missing line" or "concordant line" in the two vcfs, regardless of sample or genotype information.
5) When samples are provided (-sn, -se or -sf) discordance and concordance will go down to the genotypes to determine whether or not you have a discordance/concordance event. In this case, a concordance happens only when the two VCFs display the same sample/genotype information for that locus, and discordance happens when the disc track is missing the line or has a different genotype information for that sample.
6) When dealing with multiple samples, concordance only happens if ALL your samples agree, and discordance happens if AT LEAST ONE of your samples disagree.
---
Integration tests:
1) Discordance and concordance test added
2) All other tests updated to comply with the new 'sorted output' format and different inputs for samples.
---
Methods for handling sample expressions and files with list of samples were added to SampleUtils. I recommend *NOT USING* the old getSamplesFromCommandLineInput as this mixing of sample names with expressions and files creates a rogue error that can be challenging to catch.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6072 348d0f76-0448-11de-a6fe-93d51630548a
BinomialCumulativeProbability was reformatted to follow the now standard parameter order.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6057 348d0f76-0448-11de-a6fe-93d51630548a
VariantEvalWalker now passes a pointer to itself to the Stratefication setVariantEvalWalker (and assoc. get method) so that stratefications can look at VEWalker variables to obtain information necessary for their calculations, like the list of eval samples. This is a better interface, in my opinion, than the current approach of extending the base abstract Stratefication to include an initialize function that has all arguments necessarily for any Strat.
JEXL expressions now provide access to the VariantContext vc object itself, so you can write JEXL's that directly use VariantContext and associated functions from the command line.
ExomePostQC Queue script now creates a byAC eval using the new strat, and no longer produces a byAF file (as this was not exact, and lead to strange punctile behavior when actual AF quantization was out of sync with fix quantization of AF strat.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@6015 348d0f76-0448-11de-a6fe-93d51630548a
VariantContextUtils now was a utility function that creates a sitesOnlyVariantContext from an input VC
Add complex merge test of SNPs and indels from the new batch merge wiki in :
http://www.broadinstitute.org/gsa/wiki/index.php/Merging_batched_call_sets
with multiple alleles for an indel. Created a BatchMergeIntegrationTest that uses GGA with the complex merged input alleles to genotype SNPs and Indels with multiple alleles simultaneously in NA12878. Looks great.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5959 348d0f76-0448-11de-a6fe-93d51630548a
a) All rank sum tests now work for indels including multiallelic sites. For the latter cases, rank sum test is REF vs most common allele
b) Redid computation of HaplotypeScore for indels. It's now trivially easy to do because we are already computing likelihoods of each read vs haplotypes in GL computation so we reuse that if available. For multiallelic case, we score against N haplotypes where N is total called alleles.
Drawback is that all cases need information contained in likelihood table that stores likelihood for each pileup element, for each allele. If this table is not available we dont annotate, so we can only fully annotate indels right now when running UG but not when running VariantAnnotator alone.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5947 348d0f76-0448-11de-a6fe-93d51630548a
- Verified that exact calculations do agree with R's dwilcox()
- Verified that exact calculations do not agree with R's wilcox.test
+ This is because R does a correction, and calculates CDFs rather than PDFs (e.g. sums over dwilcox() values)
- Can now specify MWU to calculate cumulative exact tests, rather than point probabilities
- Z-scores are now calculated properly for exact tests
+ Previously, z-values calculated by inverting normal CDF from U-statistic PDF
+ Now both inversions are done, with a smart heuristic (biased variance) to make the point-calculated Z-value more accurate
+ Additional tests
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5911 348d0f76-0448-11de-a6fe-93d51630548a
Fixed minor problem with WalkerTest for "" (for parameterization) md5s.
Added an explicit integrationtest for BAQ NONE
Now only creates the BAQ'd pileup, if the useBAQPileup parameter is provide in initializeAlternateAllele.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5891 348d0f76-0448-11de-a6fe-93d51630548a
Also allows for the slightly more awesome name "MWUnitTest"
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@5850 348d0f76-0448-11de-a6fe-93d51630548a