a) Sites were numerically well behaved now, but another hard fact of life is that the AF iteration is defined in linear Pr space, not in log likelihood space, and the math doesn't work out in log space. So, we need to convert back and forth from lin to log space.
b) As a consequence of a), the code got a major slowdown, and calling the 629 samples was about 15 times slower than before (sic).
c) To solve b), log10 of integers are now cached at init, and numerical approximations are now made. Most importantly, I'm using the approximation that log(exp(a) + exp(b)) ~= max(a,b) which seems almost inconsequential in practical performance but reduces computation time to what it was before. More detailes analyses are forthcoming. This approximation can be refined further on to avoid expensive log-exp conversions if further profiling and analysis deems it necessary.
Also, two other issues were solved:
a) Strand bias computation was actually wrong in the case where the optimal AC was bigger than max(forward reads,reverse reads). Now the code is exactly as buggy as the grid search model (all bugs are equal, but some are more equal than others)
b) Genotype likelihoods are now computed in a better way and if a likelihood < 0 we don't just cap to 0 but do something a bit smarter.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4600 348d0f76-0448-11de-a6fe-93d51630548a
interaction with Tribble. In Tribble, keeping these references in memory until
the shard is flushed means keeping one 512K character buffer per object in
memory. Fixed by purging the reference to the object at the end of the
shard traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4599 348d0f76-0448-11de-a6fe-93d51630548a
of traversal to avoid holding a reference to the microscheduler, which holds a reference to
the engine, which in turn holds a reference to the walker, which itself holds a reference to
all the data aggregated during the course of the traversal.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4594 348d0f76-0448-11de-a6fe-93d51630548a
Updated a call to swapExt to specify the directory.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4586 348d0f76-0448-11de-a6fe-93d51630548a
parsing engine. Hugely lowers our memory footprint in integrationtests, but not yet enough to
run Mark's new parallelized VariantEvalIntegrationTests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4585 348d0f76-0448-11de-a6fe-93d51630548a
- Forcing user to set the temp directory via -Djava.io.tmpdir to avoid filling up /tmp.
- By default deleting job outputs tagged as intermediate.
- Defaulting pipeline to scatter count 1 (no reads deleted).
- Cleaning up temp classes even when scripting fails.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4573 348d0f76-0448-11de-a6fe-93d51630548a
with -nt option. When last I checked in, Ryan was seeing a ~25% speedup
per shard by not indexing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4556 348d0f76-0448-11de-a6fe-93d51630548a
with a more sensible strategy for sharding w/o BAMs at some point after
ASHG.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4549 348d0f76-0448-11de-a6fe-93d51630548a
a) Fix it up because it broke with a recent checkin to annotate vcf with unfiltered depth.
b) Printout of ref/alt alleles in output vcf was incorrect because the start/stop positions of associated GenomeLoc were incorrectly computed in case of a deletion.
c) Redid Beagle input/output walkers as not assume that ref was a single base, not to assume that variant was a vcf and generalized it to be indel-capable, so now the Beagle walkers can be used for indels as well.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4541 348d0f76-0448-11de-a6fe-93d51630548a
Queue GATK generated .intervals is now a List(File) again removing special case handling in the generator.
Instead of using @Scatter annotation, using ScatterFunction instance to determine if a job can be scattered.
Implemented special VcfGatherFunction which only uses the header from the first file, even if the other files differ in their headers.
Added a -deleteIntermediates to Queue to delete the outputs from intermediate commands after a successful run.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4536 348d0f76-0448-11de-a6fe-93d51630548a
some subset need
A C 1/1 --> C A 0/0
while another subset need
A C 1/1 --> C A 1/1
it's unclear how big these subsets are (or even if one is empty). What I do know is, doing the first one totally screws up concordance metrics for the 421-sample chip. So either something else needs to be done, or there's a bug in this walker. Until I know for sure, I've added an initialize exception to disable this thing...
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4523 348d0f76-0448-11de-a6fe-93d51630548a
forgot to add the samples to the header. How could the VCFWriter let me get away with something so boneheaded?!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4513 348d0f76-0448-11de-a6fe-93d51630548a
Modified - SelectVariants: Hook up to VariantContextUtils to recalculate AC/AF/AN, which uses the accessor in VariantContext to do this. Somehow sites that were selected down to hom-ref genotypes only wound up getting positive AC.
**IMPORTANT** I kind of need input here. The header of a file used for an integration test specifies AC as being an integer. Recalculating it casts it into an integer list (which it should be, as it allows for alternate alleles). However this appears to clash with what the jexl expression is looking for? For now, the integration test itself needed to be changed -- it's unclear what to do when the header specifies AC of being one class, but recalculating it casts to another class, and I'm not sure what to do.
I'm committing my omni_qc pipeline because I'm almost certain 2 months down the road I'm going to wonder what the heck I did to generate my results.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4511 348d0f76-0448-11de-a6fe-93d51630548a
1: -sample can now include a file, which will be parsed for sample-name entries
2: If you request a sample to run analysis on, but it is not present in any of your RODs, VEW will exception out
3: Change added to parse Integer, String, and List<Integer> type Allele Count annotations (error otherwise)
4 [slightly problematic]: The count objects now maintain row-keys in order, as the keys were taking an inordinate amount of time in onTraversalDone (multiple calls to getRowKeys(), so many multiple sorts of the same underlying unsorted object, very bad)
There is a legacy comparison object which is unused which I will strip out soon.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4502 348d0f76-0448-11de-a6fe-93d51630548a
The core walker has been modified so that when variant contexts (eval and comp) are subset to command-line-specified sample(s), the chromosome count annotations (AC/AN/AF) are altered to reflect the AC/AN/AF of only those samples involved in the comparison. No more getting AC500 when you're comparing a 10-sample overlap. Interestingly enough, this didn't break any integration tests.
GenotypeConcordance now has two additional tables: Allele Count Statistics, and Allele Count Summary Statistics. These work exactly identically to the Sample Statistics and Sample Summary Statistics tables, except that the partition being used is no longer the sample, but instead the allele count of the variant sites. These tables stratify by both eval and comp ACs, e.g.
evalAC0
evalAC1
evalAC2
compAC0
compAC1
compAC2
Differences with previous integration tests were verified to only be in the Allele Count tables (by grepping them out of the diff); a new test has been added for the simple case of an AC=1 site in the eval becoming an AC=2 site in the comp.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4491 348d0f76-0448-11de-a6fe-93d51630548a
by the fact that the GATKSAMRecord, by design, needs to both inherit from
SAMRecord and wrap a 'member' SAMRecord, and method calls that aren't
implemented as explicit passthroughs can compromise the content of the
SAMRecord in subtle ways.
Will be automatically fixed when Picard moves to a lightweight SAMRecord
interface rather than the current heavyweight implementation. But in
the short-term, there's no obvious fix.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4489 348d0f76-0448-11de-a6fe-93d51630548a
that have 0 aligned bases in the genome. We'll have to fix walkers as faults
appear.
Also added JIRA GSA-406: finer-grained control of MalformedReadFilter: want
to exception out by default in these cases but pass them with a warning with
a corresponding -U flag.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4476 348d0f76-0448-11de-a6fe-93d51630548a
- ProduceBeagleInputWalker
+ Now takes a validation ROD and a prior to give it, will use those genotypes in place of the variant genotypes if both are present
+ Takes a bootstrap argument -- can use some given %age of the validation sites
+ Optionally takes a bootstrap output argument -- re-prints the validation VCF, filtering those sites used as part of the bootstrap
-BeagleOutputToVCFWalker
+ Now filters sites where the genotypes have been reverted to hom ref
+ Now calls in to the new VCUtils to calculate AC/AN
-Queue
+ New pipeline libraries for easy qscript creation, still a work in progress, but this is a considerable prototype
+ full calling pipeline v2 uses the above libraries
+ minor changes to some of my own scripts
+ no more need for contig interval lists, these will be parsed out of your normal interval list when it is provided
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4459 348d0f76-0448-11de-a6fe-93d51630548a
a) In Indel genotyper: we can't deal yet with extended events correctly and we are still triggering at each extended event which results in repeated records on a vcf. So, to avoid this, keep track of start position of candidate variantes we've visited and if we've visited a variant before we don't do it again.
b) Avoid infinite terms in QUAL and in genotype likelihoods which can happen if posterior AF happens to be exactly zero. For now, hard-code a minimum value of each term of the posterior AF likelihood to be -300 (ie 1e-300 in lin space). This can be solved with better and smarter log-to-lin conversions and some precision fixes in AF calculation.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4455 348d0f76-0448-11de-a6fe-93d51630548a
Previous output spec contained 3 columns:
haplotypeReference,haplotypeAlternate,haplotypeStrand
where haplotypeReference was always on the + strand, and haplotypeAlternate was on the strand specified by haplotypeStrand.
The new specification contains 3 columns:
haplotypeReference,haplotypeAlternate,transcriptStrand
where haplotypeRef and haplotypeAlt are required to be on the + strand. transcriptStrand now specifies the strand of the transcript, which is needed for interpreting the haplotypes.
Bugfix #1: fix incorrect assignment of variantCodon and variantAA
(Previously variantCodon was incorrectly set to referenceCodon)
Bugfix #2: fix incorrect codingCoordStr values for - strands (bug reported by Giulio Genovese), and incorrect usage of "m." for mitochondrial transcripts (bug reported by Steve Hershman)
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4444 348d0f76-0448-11de-a6fe-93d51630548a
Queue now submits new LSF jobs only after previous functions have completed successfully.
When the Queue process is shutdown (ex: via Control-C) sends a bkill command for any running jobs.
Ported commands like creating directories and scatter/gather interval list to scala functions.
Updates to LSF status tracking by porting the python to internally generated bash scripts.
Temporarily disabled job name submission to LSF. Plus side is that the full command is now available in "bjobs -w". TODO: Put back jobName passing to LSF based on an option?
Changed BaseTest to allow scala to access paths to references.
Changed the extension generator to default the analysis name to the walker "name".
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4442 348d0f76-0448-11de-a6fe-93d51630548a
Why v3, you ask? Why not? Simply because v2 was a String so old and clunky, the sun would fizzle out and grow cold before any VCF could be successfully parsed.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4421 348d0f76-0448-11de-a6fe-93d51630548a
Off to Yosemite in 4 hours, enjoy the week gsa folks!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4410 348d0f76-0448-11de-a6fe-93d51630548a
a) Fixed bugs in new dynamic programming-based genotyper
b) Fixed up temp hack that handles extended pileups for now.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4398 348d0f76-0448-11de-a6fe-93d51630548a
- Brought over exact AF estimation from branch (which is now dead). Exact model is default in UnifiedGenotyperV2.
- Implemented completely new genotyping algorithm given best AF estimate using dynamic programming, which in theory should be better than both greedy search and any HWE-based genotyper.
- Integrated and added new Dindel likelihood estimation model.
- Corrected annotators that would call readBasePileup: since we can be annotating extended events, best way is to interrogate context for kind of pileup and either readBasePileup or readExtendedEventPileup.
All changes above except last one are still in playground since they require more testing.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4396 348d0f76-0448-11de-a6fe-93d51630548a
where, when the system can't find a plugin of the correct name, the system
prefers to crap all over itself and throw an unintelligible NullPointerException
rather than displaying an intelligent error.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4393 348d0f76-0448-11de-a6fe-93d51630548a
the empty list. Score another victory for the integration tests.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4391 348d0f76-0448-11de-a6fe-93d51630548a
argument. Brought it back, and added an integrationtest to make sure it
stays around.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4390 348d0f76-0448-11de-a6fe-93d51630548a
the pileup at the next alignment start is large, we don't add as many of those
incoming reads as we should. No integration tests were affected.
Thanks, Chris!
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4378 348d0f76-0448-11de-a6fe-93d51630548a
-- getToolkit().subContextFromSampleProperty(): filters a VariantContext to genotypes that come from samples that have a given property value
-- getToolkit().getSamplesWithProperty(): gets all samples with a given property
-- getToolkit().getSamplesFromVariantContext(): sample objects that are referenced by name in a VariantContext
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4361 348d0f76-0448-11de-a6fe-93d51630548a
This will allow other programs like Queue to reuse the functionality.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4351 348d0f76-0448-11de-a6fe-93d51630548a
pieces of core that depend on playground. Most of these have been eliminated by
(temporarily) promoting Aaron's report system to core in this checkin. I'll
follow up with other changes in separately.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4350 348d0f76-0448-11de-a6fe-93d51630548a
The GAE half has all the walker specific code. The new "Abstract" GAE has the rest of the logic.
More refactoring to come, with the end goal of having a tool that other java analysis programs (Queue, etc.) can use to read in genomic data.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4339 348d0f76-0448-11de-a6fe-93d51630548a
Thanks to Ryan for identifying the problem.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4336 348d0f76-0448-11de-a6fe-93d51630548a
is. Required to fix bug ZjhCJAdwhtFq1x54ZlmlN8pFNcbrRpdJ and similar. We
might want to change this particular case to a ReviewedStingException after
we gain a bit more experience with it.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4333 348d0f76-0448-11de-a6fe-93d51630548a
mOVsxGfDiiSMxVs2PPTVjzYTVbizlD6e
f9kUHUADFsZ0LiTGxRL5zPmq9kZcA4cQ
8eGHWJFAlBVmgxwPi3sMd1RmiN2PwHOf
iLhvHWveypKb2F8vKS5irHylc3pYvlOb
HDttXKUMEVoPrvVeWrH7E0htxYyNydMx
plus a bit of cleanup of custom exceptions in the sharding system.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4330 348d0f76-0448-11de-a6fe-93d51630548a
Added a pipeline java bean and YAML utility to serialize java beans.
Added a getFirehosePipelineYaml.sh that can pull firehose data into the pipeline yaml file format.
Updated the fullCallingPipeline.q to begin using the pipeline yaml file format for bams and reference.
More changes to come as this code gets tested out in the fullCallingPipeline.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4329 348d0f76-0448-11de-a6fe-93d51630548a
Changed integration tests, adding the -NO_HEADER argument, for walkers that previously did not include the command-line
arg headers.
git-svn-id: file:///humgen/gsa-scr1/gsa-engineering/svn_contents/trunk@4326 348d0f76-0448-11de-a6fe-93d51630548a