Eric Banks
16bef191c6
UG integration tests updated. A handful of sites are lost because there are only 5 indels and one starts at the beginning of the read so it no longer passes our min threshold (now consistent with GGA), but mostly the depth changes ever so slightly once in a while between extended and normal pileups (I think the normal pileups are correct). I have looked thoroughly in IGV at ALL differences and am happy with the new results. As an aside, the AD is now calculated more accurately for indels.
2012-03-30 01:35:49 -04:00
Eric Banks
f4d4969f23
Don't ever return null for the list of GL models
2012-03-30 00:22:40 -04:00
Eric Banks
44ac49aa34
Removing dependencies in the annotations on extended events. Some refactoring involved in this.
2012-03-30 00:17:02 -04:00
Eric Banks
c2e27729c7
Renaming PileupElement.isBeforeDeletion() to PileupElement.isBeforeDeletedBase() so that it's more clear that it can still be true while inside a deletion. Added PileupElement.isBeforeDeletionStart() to cover the case that I want where we only trigger before the actual deletion event. Similarly for after a deletion. Updated counting code in ConsensusAlleleCounter accordingly.
2012-03-29 17:08:25 -04:00
Eric Banks
e4469a83ee
First attempt at removing all traces of extended events from UG; integration tests are expected to fail.
2012-03-29 14:59:29 -04:00
Eric Banks
e61e162c81
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-29 12:33:13 -04:00
Mauricio Carneiro
cf364f26a0
Fixing alignment issue with the GATKReportColumn algorithm
...
Numeric columns were being left-aligned when they should be right-aligned. Fixed it.
2012-03-29 12:28:49 -04:00
Mauricio Carneiro
f80bd4276a
fixed estimated Q reported calculation in the gatherer
2012-03-29 12:28:43 -04:00
Mauricio Carneiro
8a9fb514b6
simplifying GATKReportColumn constructor logic
2012-03-29 12:28:37 -04:00
Eric Banks
e861106398
Accidentally erased important line
2012-03-29 11:08:54 -04:00
Eric Banks
e4a225ed09
Move the code to subset a Variant Context to fewer alleles (including restructuring the PLs appropriately) into VariantContextUtils where it can be used generally.
2012-03-29 11:07:37 -04:00
Guillermo del Angel
c9c3f6b0fc
Minor UG Engine refactoring/cleanup: instead of passing in the # of samples separately from sample set, pass in ploidy instead and compute # of chromosomes internally - will help later on with code clarity
2012-03-29 11:05:42 -04:00
Guillermo del Angel
a0843f125e
Forgot to add file itself for new unit test
2012-03-28 21:08:18 -04:00
Guillermo del Angel
250adca350
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 21:01:49 -04:00
Guillermo del Angel
e0ab4e4b30
Refactoring so that ConsensusAlleleCounter can use regular pileups and can operate correctly. This involved adding utility functions to ReadBackedPileup to count # of insertions/deletions right after current position. Added unit test for IndelGenotypeLikelihoods, esp. ConsensusAlleleCounter logic
2012-03-28 21:01:31 -04:00
Mauricio Carneiro
8f0e9d74ce
GATKReportTable output refactor
...
writing out a GATKReportTable was O(n^2)!!!!!
New implementation is O(n). What a difference, when N = 2^16...
2012-03-28 17:19:12 -04:00
Menachem Fromer
5ff24f8cf3
We want the DoC calculations for CNVs to include N bases in the reference just so that the PAR regions in chr Y (set to all Ns) end up with mean coverage of 0/L, where L = target length (and not NaN = 0/0). Including Ns in any other targets is probably OK too, since most (if not all) non-chromosome Y targets will not have any Ns
2012-03-28 16:42:34 -04:00
Guillermo del Angel
62ee31afba
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 16:00:38 -04:00
Guillermo del Angel
1eee9d512d
Make computeConsensusAlleles protected inside IndelGenotypeLikelihoodsCalculationModel so we can use it in unit tests, b) make ConsensusAlleleCounter work if no extended event pileup is present (necessary for ext. event removal)
2012-03-28 15:41:39 -04:00
Ryan Poplin
4a396d357d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 15:37:33 -04:00
Ryan Poplin
c17e2c61ef
Adding more assembly unit tests
2012-03-28 15:37:18 -04:00
Mauricio Carneiro
bb36cd4adf
Quick fixes to BQSRGatherer and GATKReportTable
...
* when gathering, be aware that some keys will be missing from some tables.
* when a gatktable has no elements, it should still output the header so we know it had no records
2012-03-28 09:07:54 -04:00
Roger Zurawicki
63cf7ec7ec
Added more primitives to GATK Report Column Type
...
- The Integer column type now accepts byte and shorts
- Updated Unit Tests and added a new testParse() test
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2012-03-28 09:07:54 -04:00
Guillermo del Angel
d2586911a4
Forgot to add tolerance to new MathUtils unit tests
2012-03-28 08:18:36 -04:00
Guillermo del Angel
08f7d47d7c
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-28 07:42:09 -04:00
Ryan Poplin
7f211c66fa
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-27 23:09:18 -04:00
Ryan Poplin
60c9ada5d8
Adding some basic AssemblyEngine unit tests.
2012-03-27 23:09:04 -04:00
Mark DePristo
12aa72f200
Merged bug fix from Stable into Unstable
2012-03-27 22:43:00 -04:00
Mark DePristo
979a84a252
Bugfix for thread unsafe PL cache
...
-- See https://getsatisfaction.com/gsa/topics/unifiedgenotyper_error_indel?utm_content=topic_link&utm_medium=email&utm_source=new_topic
-- Solution is to use a fixed cache that's never updated on the fly. My changes limit us to having no more than 500 alleles at a site, which I hope is ok but easy enough to up to a ridiculously large number.
2012-03-27 22:42:30 -04:00
Guillermo del Angel
8f34412fb8
First Pool Caller exact model: silly straightforward math implementation of biallelic pool caller exact likelihood model, no attempt and any smartness or optimization, no support yet for generalized multiallelic form, just hooking up for testing
2012-03-27 20:59:44 -04:00
Ryan Poplin
74615c42df
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-27 15:19:46 -04:00
Ryan Poplin
80b11cfe3c
Bug fix for the case of zero reads remaining in the active region after clipping and filtering.
2012-03-27 15:19:34 -04:00
Guillermo del Angel
ed322bd73f
Fix again merge issues
2012-03-27 15:03:13 -04:00
Guillermo del Angel
b4a7c0d98d
Merge branch 'master' of ssh://nickel.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-27 15:01:03 -04:00
Guillermo del Angel
343a061b1c
Fix merge issues when incorporating new AF calculations changes
2012-03-27 15:00:44 -04:00
Mauricio Carneiro
1b75663178
BQSR Gatherer implementation and integration tests
...
* restructured the hash tables into one class (RecalibrationReport) that has all the functionality for the different tables and key managers
* optmized empirical qual calculation when merging recalibration reports
* centralized the quality score quantization functionalities
* unified the creating/loading of all the key manager/hash table structures.
* added unit tests for the gatherer (disabled because gatk report needs to be sorted for automated testing)
* added integration tests for BQSR and on-the-fly recalibration
2012-03-27 13:50:22 -05:00
Joel Thibault
ea9c04b8c2
Updated license year
2012-03-27 14:32:02 -04:00
Ryan Poplin
5dbd3625cd
Initial algorithm for choosing best alternate haplotypes to genotype based on the likelihoods from all samples instead of choosing for each sample independently. Simple tradeoff of penalty for increasing model complexity and likelihood of the data.
2012-03-27 13:38:52 -04:00
Eric Banks
c112e0824a
I was adding verbose output to the Pileup output for a one-off and decided that I might as well commit it as an option. Updated deprecated calls while I was in there.
2012-03-27 11:09:03 -05:00
Mark DePristo
a638996fe2
Cleanup of VariantEval, diatribe about performance problems with StateKey
...
-- Minor refactoring of state key iteration in VEW.map to make the dependencies more clear
-- Long discussion about the performance problems with StateKey, and how to fix it, which I have run out of time to address before ESP meeting.
2012-03-27 11:56:24 -04:00
Mark DePristo
679bb03014
Simple utility function for converting an Iterable<T> to Collection<T>
2012-03-27 11:54:58 -04:00
Mark DePristo
1f5f737c8b
Optimizing the GATKReportTable.write
...
-- Better iteration, caching of strings, better printf calls, to improve the writing performance of GATKReportTables
2012-03-27 11:54:35 -04:00
Mark DePristo
9c384b4813
Near final version of 1000G Phase I table script
2012-03-27 11:52:07 -04:00
Mark DePristo
913c8b231f
Fix ErrorRatePerCycle to overload equals and hashcode
...
-- Fixes failing integration tests
2012-03-27 10:35:32 -04:00
Eric Banks
c07a577ba3
Significant restructuring of the Exact model, as discussed within the dev group last week. There is no more marginalizing over alternate alleles, and we now keep track of the MLE and MAP. Important notes: 1) integration tests change because the previous marginalization wasn't done correctly (as pointed out by Guillermo) and our confidences were too high for many multi-allelic sites; 2) there is a major TO-DO item that needs to be discussed within the dev group (so they should expect a follow up email); 3) this code is still in flux as I am awaiting feedback from Ryan now on its performance with the Haplotype Caller (the good news, Ryan, is that we recover that site that we were losing previously).
2012-03-27 00:27:44 -05:00
Guillermo del Angel
e8bb8ade1a
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2012-03-26 16:42:03 -04:00
Guillermo del Angel
1a2a4848e8
Added integration test for ValidationSiteSelector, correct MD5's
2012-03-26 16:39:55 -04:00
Mark DePristo
34ea443cdb
Better algorithm for choosing which indel alleles are present in samples
...
-- The previous approach (requiring > 5 copies among all reads) is breaking down in many samples (>1000) just from sequencing errors.
-- This breakdown is producing spurious clustered indels (lots of these!) around real common indels
-- The new approach requires >X% of reads in a sample to carry an indel of any type (no allele matching) to be including in the counting towards 5. This actually makes sense in that if you have enough data we expect most reads to have the indel, but the allele might be wrong because of alignment, etc. If you have very few reads, then the threshold is crossed with any indel containing read, and it's counted.
-- As far as I can tell this is the right thing to do in general. We'll make another call set in ESP and see how it works at scale.
-- Added integration tests to ensure that the system is behaving as I expect on the site I developed the code on from ESP
2012-03-26 16:28:49 -04:00
Mark DePristo
11b6fd990a
GATKReportColumn optimizations
...
-- Was TreeMap even though the sorting wasn't used. Replaced with LinkedHashMap.
2012-03-26 16:28:49 -04:00
Mark DePristo
a0dcac9ec6
Indel length distributions are now boxplots
2012-03-26 16:28:48 -04:00