Mark DePristo
76e4100d89
By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
...
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Ryan Poplin
bfad26353a
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-04 16:04:50 -04:00
Ryan Poplin
dda2173c66
Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned.
2012-04-04 16:04:29 -04:00
Mark DePristo
1473e166f7
Update VariantCallQC script
...
-- Fixed indel length histogram to work with new table style
-- Removed unused expectations list
-- by AC plots for indels include log10(n_Indels) as their weight
-- smoothing is now weighted by the log10 count of n_Indels
2012-04-04 15:37:44 -04:00
Mark DePristo
2cc1e8d871
PostCallingQC uses IndelLengthHistogram now
2012-04-04 15:37:43 -04:00
Mark DePristo
fcdd65a0f4
Bugfix for IndelLengthHistogram
...
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo
3593996a87
G1K summary table needs to use the -keepAC0 flag
...
-- AC = 0 sites look about as good as singletons, and are likely only AC 0 because they cannot be easily imputed. We keep them in our counting.
2012-04-04 15:37:43 -04:00
Mark DePristo
1ccea866d8
VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
...
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Guillermo del Angel
15e26fec04
Many cosmetic fixes to pool caller (not done yet): better docs, some code reorganization, change iterator inside PoolGenotypeLikelihoods so that we store conformations and convert to/from PL vectors in the same order as the non-pool case (it used to be flipped), to maintain better legibility. Improved unit tests (not done yet)
2012-04-04 15:08:19 -04:00
Eric Banks
9e32a975f8
Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore.
2012-04-04 13:47:59 -04:00
Eric Banks
337ff7887a
When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals.
2012-04-04 10:57:05 -04:00
Guillermo del Angel
1248f9025d
Clean up and fix bug in PoolGenotypeLikelihoodsUnitTest. b) Cosmetic fixes to PoolAFCalculationModel, don't print PL vectors per pool if they're too long or else vcf's are too hard to make sense of
2012-04-03 21:27:53 -04:00
Guillermo del Angel
05d8400468
Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet)
2012-04-03 20:51:24 -04:00
Guillermo del Angel
5a10f173ea
Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow)
2012-04-03 18:55:52 -04:00
Guillermo del Angel
61e1ec6cdd
More bug fixing PoolAFCalculationModel
2012-04-03 18:12:39 -04:00
Guillermo del Angel
baad840598
Bug fixing PoolAFCalculationModel
2012-04-03 17:03:42 -04:00
Guillermo del Angel
5abb07da5d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 17:00:45 -04:00
Guillermo del Angel
4d33f63986
Bug fixing PoolAFCalculationModel
2012-04-03 16:58:12 -04:00
Christopher Hartl
a6837d31d4
Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it.
2012-04-03 16:13:16 -04:00
Guillermo del Angel
63b1e737c6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-03 15:43:50 -04:00
Guillermo del Angel
9e11b4f9a7
Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced.
2012-04-03 15:43:32 -04:00
Eric Banks
f9ce9962c4
Minor changes to verbose mode
2012-04-03 10:53:48 -04:00
Eric Banks
8ca4df38ed
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-02 22:47:19 -04:00
Eric Banks
f6aa95685d
OutOfMemory exceptions are User Errors
2012-04-02 22:46:56 -04:00
Eric Banks
659b82e74d
Old -B syntax is long gone at this point. Safe to remove the warning.
2012-04-02 22:25:16 -04:00
Ryan Poplin
0b37c556f5
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-04-02 15:05:08 -04:00
Ryan Poplin
02a9b73360
Bug fix for Solid reads still triggering the active region traversal at nearly every locus.
2012-04-02 15:04:17 -04:00
Eric Banks
326220c91c
Removing extended event related unit tests
2012-04-02 14:40:36 -04:00
Eric Banks
99d27ddcc4
Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now.
2012-04-02 14:27:36 -04:00
Mark DePristo
4075bc7d3d
BUGFIX to not skip chr22
2012-04-02 09:14:07 -04:00
Mark DePristo
6b7a00061a
VariantsToTable now works with multiple input VCFs
2012-04-02 09:13:35 -04:00
Mark DePristo
8072060496
Update g1K summary table to actually work with autosome / X distinction
2012-04-01 16:10:44 -04:00
Ryan Poplin
f05cc96991
Adding multiallelics and phased variant examples to the HaplotypeCaller integration tests. Lowering max alternate alleles back to the UG default value because of memory constraints with many samples.
2012-03-31 10:53:37 -04:00
Mark DePristo
4f73ea902f
Final update for VE. VCFStreaming wasn't yet updated
2012-03-30 21:52:01 -04:00
Mark DePristo
c9b2e376d3
Change variable name from Count to Freq in variantCallQC.R
2012-03-30 20:11:59 -04:00
Mark DePristo
fbbb8509ad
Final commits to VariantEval
...
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
2012-03-30 20:11:06 -04:00
Ryan Poplin
f997b1dbf9
Don't allow active region paths with cycles because it leads to Smith-Waterman issues
2012-03-30 16:06:40 -04:00
Mark DePristo
4b45a2c99d
Final version of new VariantEval infrastructure.
...
*** WAY FASTER ***
-- 3x performance for multiple sample analysis with 1000 samples
-- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
-- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2
-- Remove the TableType system, as this was way too complex. No longer possible to embed what were effectively multiple tables in a single Evaluator. You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis. IndelLengthHistogram is now a @Molten data type. GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints. You get an error if you do.
-- Simplified entire IO system of VE. Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo
8c0718a7c9
Fixed missing import
2012-03-30 15:31:55 -04:00
Mark DePristo
976bac0452
BaseTest now has a global variable to turn off network connection requirement
2012-03-30 15:31:55 -04:00
Mark DePristo
097ed4ecc4
Memory usage optimizations and safety improvements to StratNode and StratificationManager
...
-- Added memory and safety optimizations to StratNode and StratificationManager. Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation. The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement. Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo
b335c22f6d
Fully refactored, mostly cleaned up version of VariantEval using StratificationManager
2012-03-30 15:31:55 -04:00
Mark DePristo
c8086a79e3
New StratificationManager based VariantEval passes unmodified integration tests
...
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo
d37f31e349
First version of VariantEval that runs (approximately correctly) with new StratificationManager
2012-03-30 15:31:54 -04:00
Mark DePristo
8971b54b21
Phase II of Stratification manager
...
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V. All key functions are implemented. Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo
9f1cd0ff66
Lots of new functionality for StratificationStates manager
...
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo
91c5353c4c
By default use 4 threads for 1000G table
2012-03-30 15:31:54 -04:00
Mark DePristo
a3d896d80e
Part I of creating a fast state space lookup for VE
...
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates). This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Ryan Poplin
a36f4570c4
Adding two new output options in HaplotypeCaller for Menachem. The first controls the look ahead distance for turning consecutive SNPs into MNPs and the second outputs the full haplotype sequence for the active region instead of using Smith Waterman to find all the variants and output them in individual VCF records.
2012-03-30 14:09:38 -04:00
Guillermo del Angel
11645568a8
Removed all functional code from PoolIndelGenotypeLikelihoodsCalculationModel - it was just copied over from non-pool version. Real pool indel functionality not ready yet. Several pool caller cleanups and tmp additions
2012-03-30 10:42:32 -04:00