Eric Banks
1e90d602a4
Optimization: cache up front the PL index to the pair of alleles it represents for all possible numbers of alternate alleles.
2011-12-14 13:38:20 -05:00
Eric Banks
988d60091f
Forgot to add in the new result class
2011-12-14 13:37:15 -05:00
Eric Banks
106bf13056
Use a thread local result object to collect the results of the exact calculation instead of passing in multiple pre-allocated arrays.
2011-12-14 12:05:50 -05:00
Eric Banks
7648521718
Add check for mixed genotype so that we don't exception out for a valid record
2011-12-14 11:26:43 -05:00
Eric Banks
9497e9492c
Bug fix for complex records: do not ever reverse clip out a complete allele.
2011-12-14 11:21:28 -05:00
Eric Banks
09a5a9eac0
Don't update lineNo for decodeLoc - only for decode (otherwise they get double-counted). Even still, because of the way the GATK currently utilizes Tribble we can parse the same line multiple times, which knocks the line counter out of sync. For now, I've added a TODO in the code to remind us and the error messages note that it's an approximate line number.
2011-12-14 10:43:52 -05:00
Eric Banks
d3f4a5a901
Fail gracefully when encountering malformed VCFs without enough data columns
2011-12-14 10:37:38 -05:00
Eric Banks
079932ba2a
The log10cache needs to be larger if we want to handle 10K samples in the UG.
2011-12-13 23:36:10 -05:00
Ryan Poplin
7fa1ab1bae
Fix to allow haplotype caller to call indels after UG engine entry points were unified. Adding Haplotype Caller integration test
2011-12-13 17:19:40 -05:00
Eric Banks
e47a113c9f
Enabled multi-allelic SNP discovery in the UG. Needs loads of testing so do not use yet. While working in the UG engine, I removed the extraneous and unnecessary MultiallelicGenotypeLikelihoods class: now a VariantContext with PL-annotated Genotypes is passed around instead. Integration tests pass so it must all work, right?
2011-12-12 23:02:45 -05:00
Mauricio Carneiro
5cc1e72fdb
Parallelized SelectVariants
...
* can now use -nt with SelectVariants for significant speedup in large files
* added parallelization integration tests for SelectVariants
2011-12-12 18:41:14 -05:00
Mauricio Carneiro
a70a0f25fb
Better debug output for SAMDataSource
...
output the name and number of the files being loaded by the GATK instead of "coordinate sorted".
2011-12-12 17:57:29 -05:00
Mark DePristo
d03425df2f
TODO optimization targets
2011-12-12 17:39:51 -05:00
Laurent Francioli
7cf27bb66e
Updated md5sum for MendelianViolationEvaluator test to reflect the change in column alignment in VariantEval.
2011-12-12 12:22:43 +01:00
Laurent Francioli
025bdfe2cc
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-12 12:19:44 +01:00
Eric Banks
7b6338c742
Merge branch 'master' into trialleles
2011-12-11 00:28:46 -05:00
Eric Banks
7c4b9338ad
The old bi-allelic implementation of the Exact model has been completely deprecated - you can only use the multi-allelic implementation now.
2011-12-11 00:23:33 -05:00
Eric Banks
044f211a30
Don't collapse likelihoods over all alt alleles - that's just not right. For now, the QUAL is calculated for just the most likely of the alt alleles; I need to think about the right way to handle this properly.
2011-12-10 23:57:14 -05:00
Eric Banks
364f1a030b
Plumbing added so that the UG engine can handle multiple alleles and they can successfully be genotyped. Alleles that aren't likely are not allowed to be used when assigning genotypes, but otherwise the greedy PL-based approach is what is used. Moved assign genotypes code to UG engine since it has nothing to do with the Exact model. Still have some TODOs in here before I can push this out to everyone.
2011-12-09 14:25:28 -05:00
Mauricio Carneiro
8475328b2c
Turning off test that breaks read clipper
...
until we define what is the desired behavior for clipping this particular case.
2011-12-09 11:53:12 -05:00
Roger Zurawicki
4cbd1f0dec
Reorganized the testing code and created ClipReadsTestUtils
...
Tests are more rigorous and includes many more test cases.
We can tests custom cigars and the generated cigars.
*Still needs debugging because code is not working.
Created test classes to be used across several tests.
Some cases are still commented out.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:52:34 -05:00
Roger Zurawicki
0e9c2cefa2
testHardClipSoftClippedBases works with Matches and Deletions
...
Insertions are a problem so cigar cases with "I" are commented out.
The test works with multiple deletions and matches.
This is still not a complete test. A lot of cigar test cases are commented out.
Added insertions to ReadClipperUnitTest
ReadClipper now tests for all indels.
Signed-off-by: Mauricio Carneiro <carneiro@broadinstitute.org>
2011-12-09 11:43:37 -05:00
Eric Banks
64dad13e2d
Don't carry around an extra copy of the code for the Haplotype Caller
2011-12-09 11:09:40 -05:00
Eric Banks
442ceb6ad9
The Exact model now computes both the likelihoods and posteriors (in separate arrays); likelihoods are used for assigning genotypes, not the posteriors.
2011-12-09 10:16:44 -05:00
Laurent Francioli
a79144f7db
Merge branch 'master' of ssh://copper.broadinstitute.org/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-09 15:57:24 +01:00
Laurent Francioli
72fbfba97d
Added UnitTests for getFamilies() and getChildrenWithParents()
2011-12-09 15:57:07 +01:00
Laurent Francioli
5a06170804
Corrected bug causing getChildrenWithParents() to not take the last family member into consideration.
2011-12-09 14:51:34 +01:00
Eric Banks
aa4a8c5303
No dynamic programming solution for assignning genotypes; just done greedily now. Fixed QualByDepth to skip no-call genotypes. No-calls are no longer given annotations (attributes).
2011-12-09 02:25:06 -05:00
Eric Banks
2fe50c64da
Updating md5s
2011-12-09 00:47:01 -05:00
Eric Banks
8777288a9f
Don't throw a UserException if too many alt alleles are trying to be genotyped. Instead, I've added an argument that allows the user to set the max number of alt alleles to genotype and the UG warns and skips any sites with more than that number.
2011-12-09 00:00:20 -05:00
Eric Banks
3e7714629f
Scrapped the whole idea of an int/long as an index into the ACset: with lots of alternate alleles we run into overflow issues. Instead, simply use the ACcounts array as the hash key since it is unique for each AC conformation. To do this, it needed to be wrapped inside an object so hashcode() would work.
2011-12-08 23:50:54 -05:00
Eric Banks
4aebe99445
Need to use longs for the set index (because we can run out of ints when there are too many alternate alleles). Integration tests now use the multiallelic implementation.
2011-12-08 15:31:02 -05:00
Eric Banks
7750bafb12
Fixed bug where last dependent set index wasn't properly being transferred for sites with many alleles. Adding debugging output.
2011-12-08 13:50:50 -05:00
Guillermo del Angel
252e0f3d0a
Merged bug fix from Stable into Unstable
2011-12-08 13:11:39 -05:00
Guillermo del Angel
1bfe28067f
Don't try to genotype an indel even bigger than the reference window size, or else we'll be out of bounds. Necessary to handle Phase 1 integrated callset with large deletions. Better error indication when validating a GenomeLoc.
2011-12-08 12:54:08 -05:00
Mark DePristo
9def841275
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 13:36:16 -05:00
Mark DePristo
4055877708
Prints 0.0 TiTv not NaN when there are no variants
...
-- Updated md5
2011-12-07 12:07:54 -05:00
Matt Hanna
15533e08df
Fixed issue with RODWalker parallelization.
...
Turns out that someone previously upped the declared size of a ROD shard to 100M bases, making
each ROD shard larger than the size of chr20. Why didn't we see this in Stable? Because the
ShardStrategy/ShardStrategyFactory mechanism was dutifully ignoring the shard size specification.
When I rolled the ShardStrategy/ShardStrategyFactory mechanics back into the DataSources as part
of the async I/O project, I inadvertently reenabled this specifier.
2011-12-07 11:55:42 -05:00
Mark DePristo
5d2212bc8e
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-07 09:03:17 -05:00
Mark DePristo
6bf18899df
Fix for variant summary -- now treats all 50 bp deletions or insertions as CNVs
2011-12-07 09:02:49 -05:00
Matt Hanna
c9b2cd8ba5
Fix for chartl's stale null representation issue.
2011-12-06 18:05:17 -05:00
Eric Banks
79d18dc078
Fixing indexing bug on the ACsets. Added unit tests for the Exact model code.
2011-12-06 16:17:18 -05:00
Matt Hanna
f5b977fc88
Merge branch 'master' of ssh://gsa1/humgen/gsa-scr1/gsa-engineering/git/unstable
2011-12-06 10:11:35 -05:00
Matt Hanna
4001c22a11
Better file count / buffering variation in test suite. Parameterized read shard buffering. Misc cleanup.
2011-12-06 10:10:38 -05:00
Khalid Shakir
677bea0abd
Right aligning GATKReport numeric columns and updated MD5s in tests.
...
PreQC parses file with spaces in sample names by using tabs only.
PostQC allows passing the file names for the evals so that flanks can be evaled.
BaseTest's network temp dir now adds the user name to the path so files aren't created in the root.
HybridSelectionPipeline:
- Updated to latest versions of reference data.
- Refactored Picard parsing code replacing YAML.
2011-12-05 23:22:15 -05:00
Eric Banks
7a0f6feda4
Make sure that too many alternate alleles aren't being passed to the genotyper (10 for now) and exit with a UserError if there are.
2011-12-05 16:18:52 -05:00
Eric Banks
7fac4afab3
Fixed priors (now initialized upon engine startup in a multi-dimensional array) and cell coefficients (properly handles the generalized closed form representation for multiple alleles).
2011-12-05 15:57:25 -05:00
Eric Banks
a7cb941417
The posteriors vector is now 2 dimensional so that it supports multiple alleles (although the UG is still hard-coded to use only array[0] for now); the exact model now collapses probabilities for all conformations over a given AC into the posteriors array (in the appropriate dimension). Fixed a bug where the priors and posteriors were being passed in swapped.
2011-12-04 13:02:53 -05:00
Eric Banks
eab2b76c9b
Added loads of comments for future reference
2011-12-03 23:54:42 -05:00
Eric Banks
29662be3d7
Fixed bug where k=2N case wasn't properly being computed. Added optimization for BB genotype case not in old model. At this point, integration tests pass except for 1 case where QUALs differ by 0.01 (this is okay because I occasionally need to compute extra cells in the matrix which affects the approximations) and 2 cases where multi-allelic indels are being genotyped (some work still needs to be done to support them).
2011-12-03 23:12:04 -05:00