Commit Graph

9238 Commits (76e4100d89b8a11b1eaaeccc1406d0b3d08c6dbe)

Author SHA1 Message Date
Mark DePristo 76e4100d89 By default, IndelLengthHistogram won't collapse large events into the last bin, as it produces weird looking plots
-- Updated integration tests as well
2012-04-04 18:48:03 -04:00
Ryan Poplin bfad26353a Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-04 16:04:50 -04:00
Ryan Poplin dda2173c66 Moved the Smith-Watermaning of haplotypes to earlier in the process so that alleles sent to genotyping would have the exact genomic sequence of the active region they represent. As a side effect cleaned up some edge case problems with variants, both real and false, which show up on the edges of active regions. Removed code that was replicated between the Haplotype class and ReadUtils. Finally figured out how to ensure that the indel calls coming out of the HC were left aligned. 2012-04-04 16:04:29 -04:00
Mark DePristo 1473e166f7 Update VariantCallQC script
-- Fixed indel length histogram to work with new table style
-- Removed unused expectations list
-- by AC plots for indels include log10(n_Indels) as their weight
-- smoothing is now weighted by the log10 count of n_Indels
2012-04-04 15:37:44 -04:00
Mark DePristo 2cc1e8d871 PostCallingQC uses IndelLengthHistogram now 2012-04-04 15:37:43 -04:00
Mark DePristo fcdd65a0f4 Bugfix for IndelLengthHistogram
-- Wasn't requiring the allele to actually be polymorphic in the samples, so it wasn't working correctly with the Sample strat.
2012-04-04 15:37:43 -04:00
Mark DePristo 3593996a87 G1K summary table needs to use the -keepAC0 flag
-- AC = 0 sites look about as good as singletons, and are likely only AC 0 because they cannot be easily imputed.  We keep them in our counting.
2012-04-04 15:37:43 -04:00
Mark DePristo 1ccea866d8 VariantEval now includes -keepAC0 argument to include sites with alt alleles but AC 0 in analyses
-- Updated EvalModules to work with new paramter
-- adding test file for keepAC0 to public/testdata and integration tests
2012-04-04 15:37:12 -04:00
Guillermo del Angel 15e26fec04 Many cosmetic fixes to pool caller (not done yet): better docs, some code reorganization, change iterator inside PoolGenotypeLikelihoods so that we store conformations and convert to/from PL vectors in the same order as the non-pool case (it used to be flipped), to maintain better legibility. Improved unit tests (not done yet) 2012-04-04 15:08:19 -04:00
Eric Banks 9e32a975f8 Wow, symbolic alleles were all busted internally and this finally bubbled up after my previous commit. For some reason we were inconsistently forcing allele trimming/padding if one was present. Not anymore. 2012-04-04 13:47:59 -04:00
Eric Banks 337ff7887a When constructing VariantContexts from symbolic alleles, check for the END tag in the INFO field; if present, set the stop position of the VC accordingly. Added integration test to ensure that this is working properly for use with -L intervals. 2012-04-04 10:57:05 -04:00
Guillermo del Angel 1248f9025d Clean up and fix bug in PoolGenotypeLikelihoodsUnitTest. b) Cosmetic fixes to PoolAFCalculationModel, don't print PL vectors per pool if they're too long or else vcf's are too hard to make sense of 2012-04-03 21:27:53 -04:00
Guillermo del Angel 05d8400468 Fix up broken non-pool UG tests: GenotypeLikelihoods.calcNumLikelihoods now expects total # of alleles, not # of alt ones. Add doc to new function implementation. Add unit test for function. Add unit test for PoolGenotypeLikelihoods (not fully done yet) 2012-04-03 20:51:24 -04:00
Guillermo del Angel 5a10f173ea Bug fix: BaseTest change shouldn't have been committed, first cleanup of SNP pool code (more to follow) 2012-04-03 18:55:52 -04:00
Guillermo del Angel 61e1ec6cdd More bug fixing PoolAFCalculationModel 2012-04-03 18:12:39 -04:00
Guillermo del Angel baad840598 Bug fixing PoolAFCalculationModel 2012-04-03 17:03:42 -04:00
Guillermo del Angel 5abb07da5d Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 17:00:45 -04:00
Guillermo del Angel 4d33f63986 Bug fixing PoolAFCalculationModel 2012-04-03 16:58:12 -04:00
Christopher Hartl a6837d31d4 Success! A fast and low-memory converter from VCF into a binary ped file. This is mostly so I don't have to listen to Pierre/Jason complain about how slow and inefficient plinkseq is at converting; or at transposting. This automatically writes to individual-major mode. It will eat up space on /tmp if you don't run with -Djava.io.tmpdir, so be careful if you use it. 2012-04-03 16:13:16 -04:00
Guillermo del Angel 63b1e737c6 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-03 15:43:50 -04:00
Guillermo del Angel 9e11b4f9a7 Major refactor/completion of new Pool Caller under UnifiedGenotyper framework. PoolAFCalculationModel implements new math to combine pools - correct, but still O(N^2) and not complete yet for multiallelics. Pool likelihoods are better encapsulated and kept in an internal hashmap from int[] -> double for space efficiency (likelihoods can be big for pool calls when in initial discovery mode with 4 alleles). Maybe need several iterations of optimization to make it runnable at large scale. Still need to correct function chooseMostLikelyAlternateAlleles before full runs can be produced. 2012-04-03 15:43:32 -04:00
Eric Banks f9ce9962c4 Minor changes to verbose mode 2012-04-03 10:53:48 -04:00
Eric Banks 8ca4df38ed Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-02 22:47:19 -04:00
Eric Banks f6aa95685d OutOfMemory exceptions are User Errors 2012-04-02 22:46:56 -04:00
Eric Banks 659b82e74d Old -B syntax is long gone at this point. Safe to remove the warning. 2012-04-02 22:25:16 -04:00
Ryan Poplin 0b37c556f5 Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable 2012-04-02 15:05:08 -04:00
Ryan Poplin 02a9b73360 Bug fix for Solid reads still triggering the active region traversal at nearly every locus. 2012-04-02 15:04:17 -04:00
Eric Banks 326220c91c Removing extended event related unit tests 2012-04-02 14:40:36 -04:00
Eric Banks 99d27ddcc4 Had some free time, so I unplugged extended events from the walkers. Now they exist only in LocusIteratorByState, but ReadProperties.generateExtendedEvents() always returns false so that block is never actually executed anymore. I don't want to touch LIBS because I think David is in there right now. 2012-04-02 14:27:36 -04:00
Mark DePristo 4075bc7d3d BUGFIX to not skip chr22 2012-04-02 09:14:07 -04:00
Mark DePristo 6b7a00061a VariantsToTable now works with multiple input VCFs 2012-04-02 09:13:35 -04:00
Mark DePristo 8072060496 Update g1K summary table to actually work with autosome / X distinction 2012-04-01 16:10:44 -04:00
Ryan Poplin f05cc96991 Adding multiallelics and phased variant examples to the HaplotypeCaller integration tests. Lowering max alternate alleles back to the UG default value because of memory constraints with many samples. 2012-03-31 10:53:37 -04:00
Mark DePristo 4f73ea902f Final update for VE. VCFStreaming wasn't yet updated 2012-03-30 21:52:01 -04:00
Mark DePristo c9b2e376d3 Change variable name from Count to Freq in variantCallQC.R 2012-03-30 20:11:59 -04:00
Mark DePristo fbbb8509ad Final commits to VariantEval
-- Molten now supports variableName and valueName so you don't have to use variable and value if you don't want to.
-- Cleanup code, reorganize a bit more.
-- Fix for broken integrationtests
2012-03-30 20:11:06 -04:00
Ryan Poplin f997b1dbf9 Don't allow active region paths with cycles because it leads to Smith-Waterman issues 2012-03-30 16:06:40 -04:00
Mark DePristo 4b45a2c99d Final version of new VariantEval infrastructure.
*** WAY FASTER ***
 -- 3x performance for multiple sample analysis with 1000 samples
 -- Analyzing 1MB of the ESP call set (3100 samples) takes 40 secs, compared to several minutes in the previous version
 -- According to JProfiler all of the runtime is now spent decoding genotypes, which will only get better when we move to BCF2

-- Remove the TableType system, as this was way too complex.  No longer possible to embed what were effectively multiple tables in a single Evaluator.  You now have to have 1 table per eval
-- Replaced it with @Molten, which allows an evaluator to provide a single Map from variable -> value for analysis.  IndelLengthHistogram is now a @Molten data type.  GenotypeConcordance is also.
-- No longer allow Evaluators to use private and protected variables at @DataPoints.  You get an error if you do.
-- Simplified entire IO system of VE.  Refactored into VariantEvalReportWriter.
-- Commented out GenotypePhasingEvaluator, as it uses the retired TableType
-- Stratifications are all fully typed, so it's easy for GATKReports to format them.
-- Removed old VE work around from GATKReportColumn
-- General code cleanup throughout
-- Updated integration tests
2012-03-30 15:31:56 -04:00
Mark DePristo 8c0718a7c9 Fixed missing import 2012-03-30 15:31:55 -04:00
Mark DePristo 976bac0452 BaseTest now has a global variable to turn off network connection requirement 2012-03-30 15:31:55 -04:00
Mark DePristo 097ed4ecc4 Memory usage optimizations and safety improvements to StratNode and StratificationManager
-- Added memory and safety optimizations to StratNode and StratificationManager.  Fresh, immutable Hashmaps are allocated for final data structures, so they exactly the correct size and cannot be changed by users.
-- Added ability of a stratification to specify incompatible evaluation.  The two strats using this are AC and Sample with VariantSummary, as this computes per-sample averages and so combining these results in an O(n^2) memory requirement.  Added integration test to cover incompatible strats and evals
2012-03-30 15:31:55 -04:00
Mark DePristo b335c22f6d Fully refactored, mostly cleaned up version of VariantEval using StratificationManager 2012-03-30 15:31:55 -04:00
Mark DePristo c8086a79e3 New StratificationManager based VariantEval passes unmodified integration tests
-- Now needs cleanup and optimizations
2012-03-30 15:31:55 -04:00
Mark DePristo d37f31e349 First version of VariantEval that runs (approximately correctly) with new StratificationManager 2012-03-30 15:31:54 -04:00
Mark DePristo 8971b54b21 Phase II of Stratification manager
-- Renamed and reorganized infrastructure
-- StratificationManager now a Map from List<Object> -> V.  All key functions are implemented.  Less commonly used TODO
-- Ready for hookup to VE
2012-03-30 15:31:54 -04:00
Mark DePristo 9f1cd0ff66 Lots of new functionality for StratificationStates manager
-- Really working according to unit tests
-- A nCombination utils
2012-03-30 15:31:54 -04:00
Mark DePristo 91c5353c4c By default use 4 threads for 1000G table 2012-03-30 15:31:54 -04:00
Mark DePristo a3d896d80e Part I of creating a fast state space lookup for VE
-- Created a unit tested tree mapping from a List<String> -> integer (StratificationStates).  This class is the key infrastructure necessary to create a complete static mapping from all stratification combinations to an offset in a vector of EvalutionContexts for update in map.
-- Minor code cleanup throughout VE (removing unused headers, for example)
2012-03-30 15:31:53 -04:00
Ryan Poplin a36f4570c4 Adding two new output options in HaplotypeCaller for Menachem. The first controls the look ahead distance for turning consecutive SNPs into MNPs and the second outputs the full haplotype sequence for the active region instead of using Smith Waterman to find all the variants and output them in individual VCF records. 2012-03-30 14:09:38 -04:00
Guillermo del Angel 11645568a8 Removed all functional code from PoolIndelGenotypeLikelihoodsCalculationModel - it was just copied over from non-pool version. Real pool indel functionality not ready yet. Several pool caller cleanups and tmp additions 2012-03-30 10:42:32 -04:00