Eric Banks
5c3ddec4c2
Large refactoring of the genotyping codebase. Deprecated several of the old classes that had the wrong allele ordering and made new better copies with the correct ordering; eventually we'll push the new ones into the place of the old ones but for now we'll give users a chance to update their code. Also, removed (or deprecated as needed) the genotype priors classes since we never use them and all they serve to do is make reading the code more complicated. I expect to finish this refactoring in GATK 1.7 (or 2.0?) so that should give Kristian ample time to update.
2012-04-05 10:49:08 -04:00
Eric Banks
2c956efa53
Minor fixups to GenotypeLikelihoods
2012-04-05 09:14:37 -04:00
Guillermo del Angel
6913710e89
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 20:17:18 -04:00
Mark DePristo
76e4100d89
By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
...
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Guillermo del Angel
820216dc68
More pool caller cleanups: ove common duplicated code between Pool and Exact AF calculation models up to super-class to avoid duplication. TMP: Have pool genotypes include the GT field. Mostly because without genotypes we can't get the site-wide AF,AC annotations, but it's unwieldy because it makes the genotype columns very long, TBD final implementation
2012-04-04 16:23:10 -04:00
Ryan Poplin
bfad26353a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 16:04:50 -04:00
Ryan Poplin
dda2173c66
Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.
2012-04-04 16:04:29 -04:00
Mark DePristo
fcdd65a0f4
Bugfix for IndelLengthHistogram
...
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo
1ccea866d8
VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
...
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Eric Banks
9e32a975f8
Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.
2012-04-04 13:47:59 -04:00
Eric Banks
337ff7887a
When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.
2012-04-04 10:57:05 -04:00
Guillermo del Angel
05d8400468
Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)
2012-04-03 20:51:24 -04:00
Guillermo del Angel
5abb07da5d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 17:00:45 -04:00
Christopher Hartl
a6837d31d4
Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it.
2012-04-03 16:13:16 -04:00
Guillermo del Angel
63b1e737c6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 15:43:50 -04:00
Guillermo del Angel
9e11b4f9a7
Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.
2012-04-03 15:43:32 -04:00
Eric Banks
f9ce9962c4
Minor changes to verbose mode
2012-04-03 10:53:48 -04:00
Eric Banks
f6aa95685d
OutOfMemory exceptions are User Errors
2012-04-02 22:46:56 -04:00
Eric Banks
659b82e74d
Old -B syntax is long gone at this point. Safe to remove the warning.
2012-04-02 22:25:16 -04:00
Eric Banks
99d27ddcc4
Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now.
2012-04-02 14:27:36 -04:00
Mark DePristo
6b7a00061a
VariantsToTable now works with multiple input VCFs
2012-04-02 09:13:35 -04:00
Mark DePristo
fbbb8509ad
Final commits to VariantEval
...
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
2012-03-30 20:11:06 -04:00
Mark DePristo
4b45a2c99d
Final version of new VariantEval infrastructure.
...
*** WAY FASTER ***
-- 3x performance for multiple sample analysis with 1000 samples
-- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
-- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2
-- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do.
-- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo
8c0718a7c9
Fixed missing import
2012-03-30 15:31:55 -04:00
Mark DePristo
097ed4ecc4
Memory usage optimizations and safety improvements to StratNode and StratificationManager
...
-- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo
b335c22f6d
Fully refactored, mostly cleaned up version of VariantEval using StratificationManager
2012-03-30 15:31:55 -04:00
Mark DePristo
c8086a79e3
New StratificationManager based VariantEval passes unmodified integration tests
...
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo
d37f31e349
First version of VariantEval that runs (approximately correctly) with new StratificationManager
2012-03-30 15:31:54 -04:00
Mark DePristo
8971b54b21
Phase II of Stratification manager
...
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo
9f1cd0ff66
Lots of new functionality for StratificationStates manager
...
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo
a3d896d80e
Part I of creating a fast state space lookup for VE
...
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Eric Banks
533c283783
Deprecating AlignmentContext.getExtendedEventPileup(). At this point the only walkers left with any relaiance on extended events are Guillermo's pooled code (he'll update soon) and the Pileup walker. David, I'll leave that last one for you (it should be easy). We can now officially rip the extended event code from the engine.
2012-03-30 10:37:14 -04:00
Eric Banks
6b49af253b
Removing dependence on extended events from the RealignerTargetCreator. Did some minor refactoring while I was in there.
2012-03-30 10:33:30 -04:00
Eric Banks
b467cd1dae
Removing dependence on extended events for the remaining Variant Annotator modules.
2012-03-30 09:05:26 -04:00
Eric Banks
b21889812d
Removing some more usages of extended events. Not done yet, but almost there.
2012-03-30 01:51:37 -04:00
Eric Banks
ad6ace2439
Resolving merge conflicts
2012-03-30 01:51:09 -04:00
Eric Banks
f4d4969f23
Don't ever return null for the list of GL models
2012-03-30 00:22:40 -04:00
Eric Banks
44ac49aa34
Removing dependencies in the annotations on extended events. Some refactoring involved in this.
2012-03-30 00:17:02 -04:00
Mauricio Carneiro
cbd21c6339
Nasty, nasty.....
...
VariantEval is overly abusive of the GATKReport (lack of) spec.
1. It converts numeric values (longs, integers and doubles) to string before sending to the Report, then expects it to decipher that those were actually numbers.
2. Worse, the stratification modules somehow instead of sending the actual values to the report table, sends a string with the value "unknown" and then abuses the GATKReport spec to convert those "unknown" placeholder values with numbers. Then again, it expects the report to know those are numbers, not strings.
Now that the GATKReport HAS specs, VariantEval needs to be overhauled to conform with that. In the meantime, I have added special ad-hoc treatment to these wrong contracts. It works, and the integration tests all passed without changing any MD5's, but right after Mark and Ryan commit their VariantEval refactors, I will step in to change the way it interacts with the GATKReport, so we can clean up the GATKReport.
No wonder, the printing needed to be O(n^2).
2012-03-29 17:49:53 -04:00
Eric Banks
c2e27729c7
Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly.
2012-03-29 17:08:25 -04:00
Ryan Poplin
6da9571829
resolving merge conflicts.
2012-03-29 16:16:28 -04:00
Ryan Poplin
ca96544ed0
All the zero quality N bases in the solid reads are adding lots of extra paths in the assembly graph. We now require a minimum base quality for every base in the kmer before adding it to the graph. The large number of solid reads with unmapped mates was also triggering the active region traversal at every base. We now ignore that check for solid reads.
2012-03-29 16:14:29 -04:00
Eric Banks
e4469a83ee
First attempt at removing all traces of extended events from UG; integration tests are expected to fail.
2012-03-29 14:59:29 -04:00
Eric Banks
e61e162c81
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-29 12:33:13 -04:00
Mauricio Carneiro
cf364f26a0
Fixing alignment issue with the GATKReportColumn algorithm
...
Numeric columns were being left-aligned when they should be right-aligned. Fixed it.
2012-03-29 12:28:49 -04:00
Mauricio Carneiro
f80bd4276a
fixed estimated Q reported calculation in the gatherer
2012-03-29 12:28:43 -04:00
Mauricio Carneiro
8a9fb514b6
simplifying GATKReportColumn constructor logic
2012-03-29 12:28:37 -04:00
Eric Banks
e861106398
Accidentally erased important line
2012-03-29 11:08:54 -04:00
Eric Banks
e4a225ed09
Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally.
2012-03-29 11:07:37 -04:00
Guillermo del Angel
c9c3f6b0fc
Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity
2012-03-29 11:05:42 -04:00