Guillermo del Angel
c9c3f6b0fc
Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity
2012-03-29 11:05:42 -04:00
Guillermo del Angel
a0843f125e
Forgot to add file itself for new unit test
2012-03-28 21:08:18 -04:00
Guillermo del Angel
250adca350
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 21:01:49 -04:00
Guillermo del Angel
e0ab4e4b30
Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic
2012-03-28 21:01:31 -04:00
Mauricio Carneiro
8f0e9d74ce
GATKReportTable output refactor
...
writing out a GATKReportTable was O(n^2)!!!!!
New implementation is O(n). What a difference, when N = 2^16...
2012-03-28 17:19:12 -04:00
Menachem Fromer
5ff24f8cf3
We want the DoC calculations for CNVs to include N bases in the reference just so that the PAR regions in chr Y (set to all Ns) end up with mean coverage of 0/L, where L = target length (and not NaN = 0/0). Including Ns in any other targets is probably OK too, since most (if not all) non-chromosome Y targets will not have any Ns
2012-03-28 16:42:34 -04:00
Guillermo del Angel
62ee31afba
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 16:00:38 -04:00
Guillermo del Angel
1eee9d512d
Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal)
2012-03-28 15:41:39 -04:00
Ryan Poplin
4a396d357d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 15:37:33 -04:00
Ryan Poplin
c17e2c61ef
Adding more assembly unit tests
2012-03-28 15:37:18 -04:00
Mauricio Carneiro
bb36cd4adf
Quick fixes to BQSRGatherer and GATKReportTable
...
* when gathering, be aware that some keys will be missing from some tables.
* when a gatktable has no elements, it should still output the header so we know it had no records
2012-03-28 09:07:54 -04:00
Roger Zurawicki
63cf7ec7ec
Added more primitives to GATK Report Column Type
...
- The Integer column type now accepts byte and shorts
- Updated Unit Tests and added a new testParse() test
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Guillermo del Angel
d2586911a4
Forgot to add tolerance to new MathUtils unit tests
2012-03-28 08:18:36 -04:00
Guillermo del Angel
08f7d47d7c
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 07:42:09 -04:00
Ryan Poplin
7f211c66fa
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-27 23:09:18 -04:00
Ryan Poplin
60c9ada5d8
Adding some basic AssemblyEngine unit tests.
2012-03-27 23:09:04 -04:00
Mark DePristo
12aa72f200
Merged bug fix from Stable into Unstable
2012-03-27 22:43:00 -04:00
Mark DePristo
979a84a252
Bugfix for thread unsafe PL cache
...
-- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Solution is to use a fixed cache that's never updated on the fly. My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.
2012-03-27 22:42:30 -04:00
Guillermo del Angel
8f34412fb8
First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing
2012-03-27 20:59:44 -04:00
Ryan Poplin
74615c42df
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-27 15:19:46 -04:00
Ryan Poplin
80b11cfe3c
Bug fix for the case of zero reads remaining in the active region after clipping and filtering.
2012-03-27 15:19:34 -04:00
Guillermo del Angel
ed322bd73f
Fix again merge issues
2012-03-27 15:03:13 -04:00
Guillermo del Angel
b4a7c0d98d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-27 15:01:03 -04:00
Guillermo del Angel
343a061b1c
Fix merge issues when incorporating new AF calculations changes
2012-03-27 15:00:44 -04:00
Mauricio Carneiro
1b75663178
BQSR Gatherer implementation and integration tests
...
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
* optmized empirical qual calculation when merging recalibration reports
* centralized the quality score quantization functionalities
* unified the creating/loading of all the key manager/hash table structures.
* added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
* added integration tests for BQSR and on-the-fly recalibration
2012-03-27 13:50:22 -05:00
Joel Thibault
ea9c04b8c2
Updated license year
2012-03-27 14:32:02 -04:00
Ryan Poplin
5dbd3625cd
Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data.
2012-03-27 13:38:52 -04:00
Eric Banks
c112e0824a
I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there.
2012-03-27 11:09:03 -05:00
Mark DePristo
a638996fe2
Cleanup of VariantEval, diatribe about performance problems with StateKey
...
-- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear
-- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.
2012-03-27 11:56:24 -04:00
Mark DePristo
679bb03014
Simple utility function for converting an Iterable<T> to Collection<T>
2012-03-27 11:54:58 -04:00
Mark DePristo
1f5f737c8b
Optimizing the GATKReportTable.write
...
-- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables
2012-03-27 11:54:35 -04:00
Mark DePristo
9c384b4813
Near final version of 1000G Phase I table script
2012-03-27 11:52:07 -04:00
Mark DePristo
913c8b231f
Fix ErrorRatePerCycle to overload equals and hashcode
...
-- Fixes failing integration tests
2012-03-27 10:35:32 -04:00
Eric Banks
c07a577ba3
Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously).
2012-03-27 00:27:44 -05:00
Guillermo del Angel
e8bb8ade1a
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-26 16:42:03 -04:00
Guillermo del Angel
1a2a4848e8
Added integration test for ValidationSiteSelector, correct MD5's
2012-03-26 16:39:55 -04:00
Mark DePristo
34ea443cdb
Better algorithm for choosing which indel alleles are present in samples
...
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
2012-03-26 16:28:49 -04:00
Mark DePristo
11b6fd990a
GATKReportColumn optimizations
...
-- Was TreeMap even though the sorting wasn't used. Replaced with LinkedHashMap.
2012-03-26 16:28:49 -04:00
Mark DePristo
a0dcac9ec6
Indel length distributions are now boxplots
2012-03-26 16:28:48 -04:00
Mark DePristo
6be5e82860
VariantEval scalability optimizations
...
-- StateKey no longer extends TreeMap. It's now a final immutable data structure that caches it's toString and hashcode values. TODO optimizations to entirely remove the TreeMap and just store the HashMap for performance and use the tree for the sorted tostring function.
-- NewEvaluationContext has a method makeStateKey() that contains all of the functionality that once was spread around VEUtils
-- AnalysisModuleScanner uses an annotationCache to speed up the reflections getAnnotations() call when invoked over and over on the same objects. Still expensive to convert each field to a string for the cache, but the only way around that is a complete refactoring of the toTransversalDone of VE
-- VariantEvaluator base class has a cached getSimpleName() function
-- VEUtils: general cleanup due to refactoring of StateKey
-- VEWalker: much better iteration of map data structures. If you need access to iterate over all key/value pairs use the Map.Entry construct with entrySet. This is far better than iterating over the keys and calling get() on each key.
2012-03-26 16:28:48 -04:00
Guillermo del Angel
1c424c0daf
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-26 15:15:50 -04:00
Ryan Poplin
8f0828daa6
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-26 11:32:58 -04:00
Ryan Poplin
019145175b
Major optimizations to graph construction through better use of built in graph.containsVertex and vertex.equals methods. Minor optimizations to MathUtils.approximateLog10SumLog10 method
2012-03-26 11:32:44 -04:00
Mark DePristo
dcc354c38e
Continuing to improve stack trace formatting. Now includes error message as well
2012-03-26 08:18:21 -04:00
Ryan Poplin
1fa66f76c9
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-25 23:04:47 -04:00
Ryan Poplin
3bf650ec18
bug fix for concurrent modification exception when path cache size was exceeded.
2012-03-25 20:44:45 -04:00
Guillermo del Angel
ce617b2dfc
Bug fix to previous UnifiedGenotyperEngine refactoring, removed debug code
2012-03-25 10:20:21 -04:00
Guillermo del Angel
db54c2625f
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-25 09:53:35 -04:00
Eric Banks
c4f0d7eb73
Added mode to call SNPs and indels separately. Added functional annotation step back in.
2012-03-25 09:06:28 -04:00
Guillermo del Angel
deb4586559
Next intermediate commit for new pool caller structure: a) Bug fixes in pool GL computation. Now, correct GL's are returned per each pool to the UG engine. Work still needs to be done in redoing interface with exact model. b) Added unit tests for new MathUtils dot product and logDotProduct functions. c) Refactorings of UnifiedGentotyperEngine since N (size of prior/posterior arrays) is no longer necessarily nSamples+1 but, in general, nSamplesPerPool*nPools+1
2012-03-24 21:49:43 -04:00